Skip to content

Blog

v2.2.1 - fix repo config inheritance, allow custom IAM policy for runners

RunsOn v2.2.1 has just been released 🎉.

A small patch release, with a fix for extending repository configuration using the _extends attribute, and a new feature to allow custom IAM policies for runners.

image

v2.2.0 - private networking support, retry mechanism, bare-metal support, faster boot, and much more

RunsOn v2.2.0 has just been released 🎉.

Private subnet support

Runners can now be launched within private subnets, with NAT gateways for accessing Internet resources. This is useful for:

  • better security
  • if you need to have fixed IPs for egress traffic, for instance to access firewalled services (for deployments, etc.)

Just specify true for the new CloudFormation stack parameter Private, and everything will automatically be setup.

Note that this will add the cost of 3 managed NAT gateways (1 per availability zone) should you choose to go private.

image

Retry mechanism

  • Add retry mechanism for jobs that could not be scheduled: invalid config, GitHub outage, EC2 outage etc. Max 5 retries over 10 mins.

  • If still in error, the jobs end in the dead-letter queue. At this point jobs can be manually retried by initiating a Redrive from the AWS UI. This means you can retry all jobs at once, without going through the GitHub UI to cancel and re-trigger jobs. Jobs older than 2 minutes will check whether the workflow run is still queued before being retried (to avoid launching unnecessary runners)

MaxSpotInstanceCountExceeded detection

  • Add support for snoozing spot requests if MaxSpotInstanceCountExceeded if encountered. In which case instances will be forced to launch as on-demand for the next 5 minutes. You will also receive an email with details about the issue.

Spot interruption detection

  • Agent properly detects spot interruptions, and will cleanly stop the GitHub agent, so that you don’t have to wait many minutes before the failed status is reflected.

CloudWatch agent

  • Runners have now the ability to post metric data into the RunsOn/Runners namespace. This means it’s now easy to add a CloudWatch agent sidecar.

Bare-metal instance support

  • You can now schedule bare-metal instance, for all your KVM-related hardware-accelerated needs (e.g. Android emulation). FBare-metal instance are also available as spot instances, but you should mix multiple families to choose from the least interruptible pool.

Metadata improvements in logs

  • RAM, CPU count is now displayed. SSH details are now hidden if ssh is disabled.
image

Security

  • SSH daemon is now disabled by default in AMIs, and only launched if ssh is set to true. Reduces a bit the attack surface in case you haven’t configured a CIDR for restricting access.

Base AMIs

  • improve boot time (5 to 10s in my tests, but might vary), remove legacy or redundant stuff. Might break some workflows but will put back some stuff if needed

Notable changes:

  • add KVM and QEMU libraries to base image.
  • only keep 1 Java version (the one tagged as default in the official image). actions/setup-java only takes 5s if you need another version.
  • android sdk still not present, but can recommend android-actions/setup-android to set it up if needed (15s install time).
  • remove chromium, as google-chrome already present.
  • remove cloud CLI, as it’s quite massive and GitHub plans to remove it anyway. Can use https://github.com/google-github-actions/setup-gcloud ↗ instead.

Config files

Until now, RunsOn was loading two config files by default:

  1. the .github/runs-on.yml local to the repository,
  2. (if no local config found), the .github/runs-on.yml from the .github organisation repository.

Since most companies might not want to publicly display their RunsOn config, this is now switched to fetching:

  1. .github/runs-on.yml local to the repository
  2. (if no local config file found) .github/runs-on.yml from the .github-private organisation repository (assuming RunsOn has access to that repo).

Fixes

  • Fixes fallback to on-demand when spot launch has issue.
  • ssh, hdd, spot can again be configured from the config files.
  • Fix metric dimension empty value.

Misc

  • Allow to specify HTTPS endpoint for receiving SNS alerts (AlertTopicSubscriptionHttpsEndpoint). Fixes #59.
  • Cache repository config (.github/runs-on.yml files) for 1 minute, to avoid useless re-downloads of that file if many jobs launched at once.

Changelog v2.1.0 - new Server and Agent, shared SQS queue, and more

RunsOn v2.1.0 has just been released 🎉.

Main changes

NodeJS => Go

I switched the server to the Go language, for better concurrency control. NodeJS allowed me to put something out quickly, and test the waters. But now that more and more people are using it, large clients (> 10k jobs a day) were hitting into some hard-to-troubleshoot concurrency issues due to the way NodeJS works. Go has a much better concurrency model, and I think it’s a better fit for the project anyway.

Before : Screenshot 2024-04-04 at 13 13 56

After : Screenshot 2024-04-04 at 13 13 36

If you are coming from a previous v2 version, the upgrade can be done in-place.

Agent and Server no longer public with the base license

Agent and Server source codes are now in separate private repositories, and added as submodules of runs-on/runs-on. Only the CloudFormation template and base AMIs are public.

A Sponsorship license will give you access to everything, so that you or your security team can review all the code, and choose to build from source if needed. Other licenses only get the compiled agent and server binaries.

The reason for this change is two-fold:

  • make it more difficult for the competition to see how the sausage is made, especially now that RunsOn beats the majority of the competition in terms of concurrency, speed, hardware availability, and pricing.

  • nudge larger clients into buying the more expensive license: until now there was no real incentive to buy a more expensive license. I could put some more advanced features into the more expensive tier, but my current view is to provide the best self-hosted runner solution out there, irrespective of the company size. I also didn’t want to use volume-based pricing, since I like to keep billing simple and predictable for users.

Hopefully this will strike a good balance between keeping RunsOn affordable to everyone, and still being sustainable. Please let me know if you have any feedback about this, nothing is written is stone yet.

Features

  • use an SQS FIFO queue to handle pending job workflows. If your AppRunner service needs to scale up horizontaly, this queue will now be shared across all instances, instead of each having its own in-memory queue. This also helps to not lose jobs in case an AppRunner instance goes down. Nice thing is that it also comes with integrated CloudWatch monitoring, so that you can see the number of pending jobs and maximum delay.

  • allow to disable cost reports: a new parameter CostReportsEnabled is in the CloudFormation stack, to disable the generation and sending of cost reports, if you prefer to look at them in CostExplorer or other means anyway.

  • allow to specify the disk size for default and large runner templates: 2 new CloudFormation parameters are now present, to specify the disk size of the default and large runner templates. In your job definition, simply indicate an hdd size and RunsOn will use the default template is hdd <= default size, or the large template if hdd > default size.

image

Fixes

  • remove the AppWorkflowQueueSize parameter from the CF stack. It’s no longer needed, as we align on the EC2 rate-limit for now.

  • bring back default runner and image: you can specify runs-on: runs-on, and it will work again. Same if you don’t specify an image, it will use the ubuntu22-full-x64 by default.

Breaking changes

  • older runner definitions (i.e. runner=2cpu-linux) are no longer supported. You must now use either runner=2cpu-linux-x64 or runner=2cpu-linux-arm64.

Deprecations

  • base and docker variant of the images as they stand are no longer useful, as the boot time of the full images is now considerably faster. They will most likely be removed in a next version, or will be rebuilt as a much lighter version of the full images.

Misc

  • setup flow design has changed a bit.
image

Changelog v2.0.13 - multi-az, multi-region, and much more

RunsOn v2.0.13 has just been released 🎉.

Warning: this is a major release bump, with a new VPC being created. You are advised to upgrade either during a quiet time (no runner running, otherwise the old VPC cannot be destroyed), or simply create a new stack with that template, follow the configuration process, and then Pause the previous AppRunner service until you validate that everything is going fine. Doing it this way will allow you to easily roll back to the previous version by just removing the new stack and clicking Resume on the previous AppRunner service.

Main changes

  • Replaces RunInstances call with CreateFleet, to reduce the number of API calls and increase the chances of finding a spot instance.
  • Multi-az support (3 AZ by default for the stack). stack no longer asks for an AZ choice.
  • capacity-optimized-prioritized allocation, so that it selects the instance type from the pool with the least risk of being interrupted
  • Modify launch sequence so that instance retrieves boot details from the S3 bucket (no more user-data)
  • Make RunsOn region aware (with region label), allowing deployments of RunsOn in multiple regions

General improvements

  • Default runner types are now separated into -x64 and -arm64 variants (simplifies configuration, no need to explicitly specify image), e.g. runs-on: runs-on,runner=2cpu-linux-arm64
  • Implement new rate limiters for EC2 RunInstances and TerminateInstances operations, as well as for workflow queuing. All are configurable.
  • New ubuntu22 full images, with some more cleanup of legacy software to reduce image sizes, and use of an agent to launch the runner earlier, instead of waiting for the execution of the cloud-final service. Current timings (from workflow job created to workflow job running) with full image: x64=39s, arm64=34s
  • Add timings for when the workflow job was created on GitHub, when the workflow job webhook got received, when the workflow started to be scheduled, and when the instance was seen as pending by AWS
image

Fixes

  • Fix default alarm. Make threshold configurable.
  • Stack no longer requires extended IAM permissions.

Misc

  • Truncate CloudWatch dimension values to 250 chars.
  • Change runner name format (runs-on--<INSTANCE_ID>--<RANDOM>), so that it contains the instance id.
  • No more success email when service is up, since you could receive those whenever the service is scaled up by AppRunner.
  • No more cost email when service is up. Wait 24h before the first one.

Breaking changes

  • Stack requires a VPC and subnet change, so perform the upgrade in a quiet time.
  • Runners no longer defaults to the 2cpu-linux x64 runner. You always need to specify a runner label as a base.
  • Specifying an image or runner label that does not exist will now raise an error, instead of silently falling back to the default image or runner specification.

Changelog v1.7.3 - now in eu-central-1 and us-west-2

RunsOn v1.7.3 has just been released 🎉.

What’s Changed

  • Official support for Frankfurt (eu-central-1) and Oregon (us-west-2) regions.
  • Disable AWS SDK retries for RunInstances API calls, to avoid rate limit issues.
  • Add m7i as an additional family type for default runners. Since m7a/c7a instances are in short supply, this should help make the onboarding for new users easier.