v2.2.0

View on GitHub Upgrade guide

Summary

Summary: Private subnet support, new retry mechanism, spot interruption detection, bare-metal support, and many fixes and improvements.

Private subnet support

  • New CloudFormation stack parameter Private (false by default), which allows you to switch to spawning runners into private subnets and disable public IP allocation. This will create one NAT Gateway per subnet (3 in total for the 3 AZs) with 3 Elastic IPs, so it costs at least $150/month additional. Mainly useful if you have stronger security requirements, or need your runners to get fixed IPs to interact with firewall'ed services.

  • [?] Could also add a parameter to only use 1 global NAT gateway for all subnets (to reduce costs), let me know.

image

Retry mechanism

  • Add retry mechanism for jobs that could not be scheduled: invalid config, GitHub outage, EC2 outage etc. Max 5 retries over 10 mins.
  • If still in error, the jobs end in the dead-letter queue. At this point jobs can be manually retried by initiating a Redrive from the AWS UI. This means you can retry all jobs at once, without going through the GitHub UI to cancel and re-trigger jobs. Jobs older than 2 minutes will check whether the workflow run is still queued before being retried (to avoid launching unnecessary runners)

MaxSpotInstanceCountExceeded detection

  • Add support for snoozing spot requests if MaxSpotInstanceCountExceeded if encountered (Fixes #49). In which case instances will be forced to launch as on-demand for the next 5 minutes. You will also receive an email with details about the issue.

Spot interruption detection

  • Agent properly detects spot interruptions, and will cleanly stop the GitHub agent, so that you don't have to wait many minutes before the failed status is reflected.

CloudWatch agent

  • Runners have now the ability to post metric data into the RunsOn/Runners namespace. This means it's now easy to add a CloudWatch agent sidecar #41.

Bare-metal instance support

  • You can now schedule bare-metal instance, for all your KVM-related hardware-accelerated needs (e.g. Android emulation). Fixes #58. Bare-metal instance are also available as spot instances, but you should mix multiple families to choose from the least interruptible pool.

Metadata improvements in logs

  • Fixes #57. RAM, CPU count is now displayed. SSH details are now hidden if ssh is disabled.
image

Security

  • SSH daemon is now disabled by default in AMIs, and only launched if ssh is set to true. Reduces a bit the attack surface in case you haven't configured a CIDR for restricting access.

Base AMIs

  • improve boot time (5 to 10s in my tests, but might vary), remove legacy or redundant stuff. Might break some workflows but will put back some stuff if needed

Notable changes:

  • add KVM and QEMU libraries to base image.
  • only keep 1 Java version (the one tagged as default in the official image). actions/setup-java only takes 5s if you need another version.
  • android sdk not present (as before), but can recommend android-actions/setup-android to set it up if needed (15s install time).
  • remove chromium, as google-chrome already present.
  • remove cloud CLI, as it's quite massive and GitHub plans to remove it anyway. Can use https://github.com/google-github-actions/setup-gcloud instead.

Config files

  • until now, RunsOn was loading two config files by default: the .github/runs-on.yml local to the repository, and (if no local config found), the .github/runs-on.yml from the .github organisation repository. Since most companies might not want to publicly display their RunsOn config, this is now switched to fetching: .github/runs-on.yml local to the repository, and (if no local config file found) .github/runs-on.yml from the .github-private organisation repository (assuming RunsOn has access to that repo). Addresses #45.

Fixes

  • Fixes fallback to on-demand when spot launch has issue.
  • ssh, hdd, spot can again be configured from the config files.
  • Fix metric dimension empty value.

Misc

  • Allow to specify HTTPS endpoint for receiving SNS alerts (AlertTopicSubscriptionHttpsEndpoint). Fixes #59.
  • Cache repository config (.github/runs-on.yml files) for 1 minute, to avoid useless re-downloads of that file if many jobs launched at once.