Skip to content

v2.2.0 - private networking support, retry mechanism, bare-metal support, faster boot, and much more

RunsOn v2.2.0 has just been released 🎉.

Private subnet support

Runners can now be launched within private subnets, with NAT gateways for accessing Internet resources. This is useful for:

  • better security
  • if you need to have fixed IPs for egress traffic, for instance to access firewalled services (for deployments, etc.)

Just specify true for the new CloudFormation stack parameter Private, and everything will automatically be setup.

Note that this will add the cost of 3 managed NAT gateways (1 per availability zone) should you choose to go private.

image

Retry mechanism

  • Add retry mechanism for jobs that could not be scheduled: invalid config, GitHub outage, EC2 outage etc. Max 5 retries over 10 mins.

  • If still in error, the jobs end in the dead-letter queue. At this point jobs can be manually retried by initiating a Redrive from the AWS UI. This means you can retry all jobs at once, without going through the GitHub UI to cancel and re-trigger jobs. Jobs older than 2 minutes will check whether the workflow run is still queued before being retried (to avoid launching unnecessary runners)

MaxSpotInstanceCountExceeded detection

  • Add support for snoozing spot requests if MaxSpotInstanceCountExceeded if encountered. In which case instances will be forced to launch as on-demand for the next 5 minutes. You will also receive an email with details about the issue.

Spot interruption detection

  • Agent properly detects spot interruptions, and will cleanly stop the GitHub agent, so that you don’t have to wait many minutes before the failed status is reflected.

CloudWatch agent

  • Runners have now the ability to post metric data into the RunsOn/Runners namespace. This means it’s now easy to add a CloudWatch agent sidecar.

Bare-metal instance support

  • You can now schedule bare-metal instance, for all your KVM-related hardware-accelerated needs (e.g. Android emulation). FBare-metal instance are also available as spot instances, but you should mix multiple families to choose from the least interruptible pool.

Metadata improvements in logs

  • RAM, CPU count is now displayed. SSH details are now hidden if ssh is disabled.
image

Security

  • SSH daemon is now disabled by default in AMIs, and only launched if ssh is set to true. Reduces a bit the attack surface in case you haven’t configured a CIDR for restricting access.

Base AMIs

  • improve boot time (5 to 10s in my tests, but might vary), remove legacy or redundant stuff. Might break some workflows but will put back some stuff if needed

Notable changes:

  • add KVM and QEMU libraries to base image.
  • only keep 1 Java version (the one tagged as default in the official image). actions/setup-java only takes 5s if you need another version.
  • android sdk still not present, but can recommend android-actions/setup-android to set it up if needed (15s install time).
  • remove chromium, as google-chrome already present.
  • remove cloud CLI, as it’s quite massive and GitHub plans to remove it anyway. Can use https://github.com/google-github-actions/setup-gcloud ↗ instead.

Config files

Until now, RunsOn was loading two config files by default:

  1. the .github/runs-on.yml local to the repository,
  2. (if no local config found), the .github/runs-on.yml from the .github organisation repository.

Since most companies might not want to publicly display their RunsOn config, this is now switched to fetching:

  1. .github/runs-on.yml local to the repository
  2. (if no local config file found) .github/runs-on.yml from the .github-private organisation repository (assuming RunsOn has access to that repo).

Fixes

  • Fixes fallback to on-demand when spot launch has issue.
  • ssh, hdd, spot can again be configured from the config files.
  • Fix metric dimension empty value.

Misc

  • Allow to specify HTTPS endpoint for receiving SNS alerts (AlertTopicSubscriptionHttpsEndpoint). Fixes #59.
  • Cache repository config (.github/runs-on.yml files) for 1 minute, to avoid useless re-downloads of that file if many jobs launched at once.