About RunsOn
Author π
Why RunsOn?
I like GitHub Action, but:
- some longer workflows (>5min) would merit faster runners.
- running out of free minutes can get expensive pretty quick.
Current landscape:
-
official runners: great for simple jobs, but can be costly. Larger runners are both very costly and slow. Good concurrency for standard runners.
-
third-party runners (buildjet, warpbuild, ubicloud, etc.): support official images, but you have to trust 3rd-party (theyβre not certified). Most of them only 50% cheaper, and/or slow as well. Not flexible in terms of hardware, image. Concurrency can be pricey / not available.
-
(artisanal) self-hosted runners: tradeoff between cheap <> max concurrency, maintenance, potential leakage between workflows, security issue for public repos.
-
(productized) self-hosted runners with ARC: heavy, require some expertise with k8s, no official image, ~flexibility in terms of hardware. Relevant if using autoscaled pods, otherwise can be costly.
What I wanted
Core features
- cheap!
- fast hardware
- 1-1 workflow compatibility with existing github actions
- fast boot
- infinite concurrency if I want to
- on-premise, donβt want to share sensitive secrets with 3rd-party
- good network throughput for all those downloads / uploads
Nice to have
- fast caches
- one-click fire and forget install
- ability to use a specific base image, to preload software, precompilations, etc.
Solution
- β official runners (expensive, and slow)
- β third-party runners (3rd party, lack of concurrency, can be slow (network and/or hardware))
- β (artisanal) self-hosted (maintenance, lack of concurrency, lack of image) (but can be great!)
- β (productized) self-hosted (maintenance, lack of image, manual config of app credentials)
- β RunsOn - KISS. Faster, 10x cheaper.
GitHub Webhook -> RunsOn -> EC2
Architecture
For now: no warm pool or clever shenanigans, stay with the most stupid thing that could work, and see how far that can go:
- instances auto-terminate when job finishes, even if RunsOn app is down, so no risk of overage.
- cloudwatch integration, for graphing consumed minutes (soon: cloudwatch monitoring for CPU/RAM usage).
- huge: integrated S3 cache (with VPC S3 gateway, so free traffic) => UNLIMITED cache. Can also be used to cache docker layers.
Timings:
- from GitHub to RunsOn receiving the webhook: 1-3s delay.
- from RunsOn to Launching instance: ~5s (instance type selection, ami selection, runner registration with github).
- From Launching to Starting: ~15s (boot, pull AMI, etc.)
- From Starting to Accepting workflow job: ~10s (network, cloud-init, runner binary init + sync with github).
All-in: from 30 to 50s depending on underlying AWS load. Hard to improve upon, unless mix with warm pools of machine.
Other third-parties: anywhere from 10s (github) to multiple minutes (github!).