A small patch release, with a fix for extending repository configuration using the _extends attribute, and a new feature to allow custom IAM policies for runners.
Runners can now be launched within private subnets, with NAT gateways for accessing Internet resources. This is useful for:
better security
if you need to have fixed IPs for egress traffic, for instance to access firewalled services (for deployments, etc.)
Just specify true for the new CloudFormation stack parameter Private, and everything will automatically be setup.
Note that this will add the cost of 3 managed NAT gateways (1 per availability zone) should you choose to go private.
Retry mechanism
Add retry mechanism for jobs that could not be scheduled: invalid config, GitHub outage, EC2 outage etc. Max 5 retries over 10 mins.
If still in error, the jobs end in the dead-letter queue. At this point jobs can be manually retried by initiating a Redrive from the AWS UI. This means you can retry all jobs at once, without going through the GitHub UI to cancel and re-trigger jobs. Jobs older than 2 minutes will check whether the workflow run is still queued before being retried (to avoid launching unnecessary runners)
MaxSpotInstanceCountExceeded detection
Add support for snoozing spot requests if MaxSpotInstanceCountExceeded if encountered. In which case instances will be forced to launch as on-demand for the next 5 minutes. You will also receive an email with details about the issue.
Spot interruption detection
Agent properly detects spot interruptions, and will cleanly stop the GitHub agent, so that you don’t have to wait many minutes before the failed status is reflected.
CloudWatch agent
Runners have now the ability to post metric data into the RunsOn/Runners namespace. This means it’s now easy to add a CloudWatch agent sidecar.
Bare-metal instance support
You can now schedule bare-metal instance, for all your KVM-related hardware-accelerated needs (e.g. Android emulation). FBare-metal instance are also available as spot instances, but you should mix multiple families to choose from the least interruptible pool.
Metadata improvements in logs
RAM, CPU count is now displayed. SSH details are now hidden if ssh is disabled.
Security
SSH daemon is now disabled by default in AMIs, and only launched if ssh is set to true. Reduces a bit the attack surface in case you haven’t configured a CIDR for restricting access.
Base AMIs
improve boot time (5 to 10s in my tests, but might vary), remove legacy or redundant stuff. Might break some workflows but will put back some stuff if needed
Notable changes:
add KVM and QEMU libraries to base image.
only keep 1 Java version (the one tagged as default in the official image). actions/setup-java only takes 5s if you need another version.
android sdk still not present, but can recommend android-actions/setup-android to set it up if needed (15s install time).
remove chromium, as google-chrome already present.
I switched the server to the Go language, for better concurrency control. NodeJS allowed me to put something out quickly, and test the waters. But now that more and more people are using it, large clients (> 10k jobs a day) were hitting into some hard-to-troubleshoot concurrency issues due to the way NodeJS works. Go has a much better concurrency model, and I think it’s a better fit for the project anyway.
Before :
After :
If you are coming from a previous v2 version, the upgrade can be done in-place.
Agent and Server no longer public with the base license
Agent and Server source codes are now in separate private repositories, and added as submodules of runs-on/runs-on. Only the CloudFormation template and base AMIs are public.
A Sponsorship license will give you access to everything, so that you or your security team can review all the code, and choose to build from source if needed. Other licenses only get the compiled agent and server binaries.
The reason for this change is two-fold:
make it more difficult for the competition to see how the sausage is made, especially now that RunsOn beats the majority of the competition in terms of concurrency, speed, hardware availability, and pricing.
nudge larger clients into buying the more expensive license: until now there was no real incentive to buy a more expensive license. I could put some more advanced features into the more expensive tier, but my current view is to provide the best self-hosted runner solution out there, irrespective of the company size. I also didn’t want to use volume-based pricing, since I like to keep billing simple and predictable for users.
Hopefully this will strike a good balance between keeping RunsOn affordable to everyone, and still being sustainable. Please let me know if you have any feedback about this, nothing is written is stone yet.
Features
use an SQS FIFO queue to handle pending job workflows. If your AppRunner service needs to scale up horizontaly, this queue will now be shared across all instances, instead of each having its own in-memory queue. This also helps to not lose jobs in case an AppRunner instance goes down. Nice thing is that it also comes with integrated CloudWatch monitoring, so that you can see the number of pending jobs and maximum delay.
allow to disable cost reports: a new parameter CostReportsEnabled is in the CloudFormation stack, to disable the generation and sending of cost reports, if you prefer to look at them in CostExplorer or other means anyway.
allow to specify the disk size for default and large runner templates: 2 new CloudFormation parameters are now present, to specify the disk size of the default and large runner templates. In your job definition, simply indicate an hdd size and RunsOn will use the default template is hdd <= default size, or the large template if hdd > default size.
Fixes
remove the AppWorkflowQueueSize parameter from the CF stack. It’s no longer needed, as we align on the EC2 rate-limit for now.
bring back default runner and image: you can specify runs-on: runs-on, and it will work again. Same if you don’t specify an image, it will use the ubuntu22-full-x64 by default.
Breaking changes
older runner definitions (i.e. runner=2cpu-linux) are no longer supported. You must now use either runner=2cpu-linux-x64 or runner=2cpu-linux-arm64.
Deprecations
base and docker variant of the images as they stand are no longer useful, as the boot time of the full images is now considerably faster. They will most likely be removed in a next version, or will be rebuilt as a much lighter version of the full images.
Warning: this is a major release bump, with a new VPC being created. You are advised to upgrade either during a quiet time (no runner running, otherwise the old VPC cannot be destroyed), or simply create a new stack with that template, follow the configuration process, and then Pause the previous AppRunner service until you validate that everything is going fine. Doing it this way will allow you to easily roll back to the previous version by just removing the new stack and clicking Resume on the previous AppRunner service.
Main changes
Replaces RunInstances call with CreateFleet, to reduce the number of API calls and increase the chances of finding a spot instance.
Multi-az support (3 AZ by default for the stack). stack no longer asks for an AZ choice.
capacity-optimized-prioritized allocation, so that it selects the instance type from the pool with the least risk of being interrupted
Modify launch sequence so that instance retrieves boot details from the S3 bucket (no more user-data)
Make RunsOn region aware (with region label), allowing deployments of RunsOn in multiple regions
General improvements
Default runner types are now separated into -x64 and -arm64 variants (simplifies configuration, no need to explicitly specify image), e.g. runs-on: runs-on,runner=2cpu-linux-arm64
Implement new rate limiters for EC2 RunInstances and TerminateInstances operations, as well as for workflow queuing. All are configurable.
New ubuntu22 full images, with some more cleanup of legacy software to reduce image sizes, and use of an agent to launch the runner earlier, instead of waiting for the execution of the cloud-final service. Current timings (from workflow job created to workflow job running) with full image: x64=39s, arm64=34s
Add timings for when the workflow job was created on GitHub, when the workflow job webhook got received, when the workflow started to be scheduled, and when the instance was seen as pending by AWS
Fixes
Fix default alarm. Make threshold configurable.
Stack no longer requires extended IAM permissions.
Misc
Truncate CloudWatch dimension values to 250 chars.
Change runner name format (runs-on--<INSTANCE_ID>--<RANDOM>), so that it contains the instance id.
No more success email when service is up, since you could receive those whenever the service is scaled up by AppRunner.
No more cost email when service is up. Wait 24h before the first one.
Breaking changes
Stack requires a VPC and subnet change, so perform the upgrade in a quiet time.
Runners no longer defaults to the 2cpu-linux x64 runner. You always need to specify a runner label as a base.
Specifying an image or runner label that does not exist will now raise an error, instead of silently falling back to the default image or runner specification.
Official support for Frankfurt (eu-central-1) and Oregon (us-west-2) regions.
Disable AWS SDK retries for RunInstances API calls, to avoid rate limit issues.
Add m7i as an additional family type for default runners. Since m7a/c7a instances are in short supply, this should help make the onboarding for new users easier.