Skip to content

Changelog

Details

Summary

Fix buildkit gha exporter, better user error reporting, AppRunner VPC connector integration.

What's changed

  • Report non-retryable user errors directly in GitHub: whenever a job can't be started for user reasons (e.g. bad image, bad runner definition, etc.), RunsOn will now spawn a default runner that will fail at the "Set up runner" step, with an error message explaining why. This will help surface issues. Fixes #307.

    2025-06-26-000054-Dev  Errors · runs-ontest@4363cc8

  • Fix issue with type=gha buildkit exporter for docker layers. Fixes #328.

  • Automatically enable the AppRunner VPC connector when Private mode is active, so that all AppRunner egress traffic (for the RunsOn orchestrator) goes through the private subnet(s) NAT gateways or equivalent. This means the AppRunner service will use the same static IP(s) as the runners, so that you can whitelist the AppRunner service on your GHES or GitHub Enterprise installation if needed. All ingress traffic is still publicly allowed and handled by AWS.

  • Telemetry: send values for networking_stack (embedded or external), and extras. Will help better understand how RunsOn is setup and which extra features are most used.

Details

Summary

Integrated CPU/Memory/Disk/Network monitoring, integrated job-level cost reporting, official snapshot action release, and many QoL improvements.

Spotlight: Monitoring improvements

  • Allow to send metrics to CWAgent namespace. This allows runs-on/action@v2 to send and graph metrics right within your job output. For instance:
      📊 Disk Writes:
         5973 ┤                ╭╮
         5500 ┤               ╭╯╰╮
         5028 ┼╮             ╭╯  ╰╮                   ╭───╮              ╭
         4555 ┤╰╮           ╭╯    ╰╮               ╭──╯   ╰─╮          ╭─╯
         4083 ┤ ╰╮         ╭╯      ╰─╮          ╭──╯        ╰╮       ╭─╯
         3610 ┤  ╰─╮      ╭╯         ╰╮      ╭──╯            ╰─╮   ╭─╯
         3138 ┤    ╰╮    ╭╯           ╰╮ ╭───╯                 ╰───╯
         2665 ┤     ╰╮  ╭╯             ╰─╯
         2193 ┤      ╰──╯
                                 Disk Writes (Ops/s)
      Stats: min:2040.0 avg:4180.9 max:6026.0 Ops/s
  • Create resource group for EC2 instances on CloudWatch. This means you can go to the CloudWatch EC2 Automatic dashboard, select your resource group (named after your RunsOn stack) and get a high-level overview of metrics for all your runner instances. 2025-06-19-000029-CloudWatch  us-east-1

  • Allow instance role to enable detailed monitoring on demand (not used for now, but might be an option of runs-on/action.

Spotlight: Costs computation

  • The runs-on/action@v2 now automatically computes the costs associated for each job, and displays the results right within your job logs. You can also choose to display them as a job summary.

Spotlight: Block-level snapshots

  • The runs-on/snapshot@v1 action is available and can be used to save and restore entire folders between job executions, at a much faster speed (for long jobs) than other methods relying on compression and export to S3 or other.

Stack improvements

  • Allow to override the max runner time limit. Fixes #320.
  • Add support for permission boundary for SchedulerInvokeRole. Fixes #315.
  • Add AWS account-id and region to emails. Fixes #298.
  • Remove explicit PublicAccessBlockConfiguration declaration since some SCP policies can incorrectly flag the s3:PutBucketPublicAccessBlock action. This is the default for new buckets anyway.
  • Add ECR Full Access managed policy to obtain higher ECR Public Rate Limits.
  • Make bootstrapping work on more distros.
  • Allow RunsOn to auto-create spot role if absent.
  • Add health checks for email notification subscription, and ec2 spot role. success@2x

Misc

  • Remove legacy (and unused) .env loading.
  • Update go to 1.24.

Details

Summary

Huge improvements to tagging, magic cache is now even faster, and bug fix for jobs tied to environments with no approval required.

What's changed

QoL improvements
  • Check if the spot role exists before starting RunsOn service (preflight check 2 from the installation guide). If not, alert the user over the SNS topic.

  • Cleanup all dangling instances, irrespective of RunsOn version.

  • Rewrote magic cache to more efficiently stream uploads when actions/cache is the client. For bigger (>1GiB) payloads, there should be a very noticeable improvement.

  • If SSHAllowed is set to false at the stack level, discard any ssh=true value coming from label or repo config. Fixes #310.

Improvements to tagging
  • Pass all custom tags to volumes, in addition to instances, when creating the runner. Fixes #264.

  • Allow to set additional custom tags using a custom property in the GitHub settings of a repository. If a custom property with name runs-on-custom-tags exists, RunsOn will parse it in the same way as the stack-level custom tags, and apply them to the instance and volumes. Fixes #297.

    custom property@2x

    For instance: if the value for property runs-on-custom-tags is set to key1=val1,key2=val2 then instances and volumes will get 2 new tags (key1, key2) with their corresponding values.

    Same restrictions than stack-level tags apply. Stack-level tags take precedence over tags set in the custom property, and tags set in custom properties takes precedence over custom runner tags defined in the .github/runs-on.yml configuration.

  • Pass custom tags and default branch to runner config. And write config in /runs-on/config.json (linux), or C:\runs-on\config.json (Windows). Config can then be read by actions / scripts etc. to access all runner details easily.

Bug fixes
  • Fix the race-condition that could lead to 2 instances being started when handling jobs tied to a deployment that does not require approval.
Misc
  • Add goroutine to cleanup dangling volumes and snapshots (prepare for block-level snapshots).
  • Register waiting, in_progress, and completed webhook payloads in S3 (in addition to queued).

Details

Summary

Support for EFS, TMPFS, and ECR ephemeral registry for fast docker builds. Also some bug fixes.

What's changed

EFS
  • Embedded networking stack can now create an Elastic File System (EFS), and runners will auto-mount it at /mnt/efs if the extras label include efs. Useful to share artefacts across job runs, with classic filesystem primitives.
jobs:
  with-efs:
    runs-on: runs-on=${{ github.run_id }},runner=2cpu-linux-x64,extras=efs
    steps:
      - run: df -ah /mnt/efs
      # 127.0.0.1:/      8.0E   35G  8.0E   1% /mnt/efs
📝 Example use case for maintaining mirrors For instance this can be used to maintain local mirrors of very large github repositories and avoid long checkout times for every job:
env:
  MIRRORS: "https://github.com/PostHog/posthog.git"
  # can be ${{ github.ref }} if same repo as the workflow
  REF: main

jobs:
  with-efs:
    runs-on: runs-on=${{ github.run_id }},runner=2cpu-linux-x64,extras=efs
    steps:
      - name: Setup / Refresh mirrors
        run: |
          for MIRROR in ${{ env.MIRRORS }}; do
            full_repo_name=$(echo $MIRROR | cut -d/ -f4-)
            MIRROR_DIR=/mnt/efs/mirrors/$full_repo_name
            mkdir -p "$(dirname $MIRROR_DIR)"
            test -d "${MIRROR_DIR}" || git clone --mirror ${MIRROR/https:\/\//https:\/\/x-access-token:${{ secrets.GITHUB_TOKEN }}@} "${MIRROR_DIR}"
            ( cd "$MIRROR_DIR" && \
              git remote set-url origin ${MIRROR/https:\/\//https:\/\/x-access-token:${{ secrets.GITHUB_TOKEN }}@} && \
              git fetch origin ${{ env.REF }} )
          done
      - name: Checkout from mirror
        run: |
          git clone file:///mnt/efs/mirrors/PostHog/posthog.git --branch ${{ env.REF }} --single-branch --depth 1 upstream
Ephemeral registry
  • Support for an Ephemeral ECR registry: can now automatically create an ECR repository that can act as an ephemeral registry for pulling/pushing images and cache layers from your runners. Especially useful with the type=registry buildkit cache instruction. If the extras label includes ecr-cache, the runners will automatically setup docker credentials for that registry at the start of the job.
jobs:
  ecr-cache:
    runs-on: runs-on=${{ github.run_id }},runner=2cpu-linux-x64,extras=ecr-cache
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/build-push-action@v4
        env:
          TAG: ${{ env.RUNS_ON_ECR_CACHE }}:my-app-latest
        with:
          context: .
          push: true
          tags: ${{ env.TAG }}
          cache-from: type=registry,ref=${{ env.TAG }}
          cache-to: type=registry,ref=${{ env.TAG }} }},mode=max,compression=zstd,compression-level=22
Tmpfs

Support for setting up a tmpfs volume (size: 100% of available RAM, so only to be used on high-memory instances), and binding the /tmp, /home/runner, and /var/lib/docker folders on it. /tmp and /home/runner are mounted as overlays, preserving their existing content.

Can speed up some IO-intensive workflows. Note that if tmpfs is active, instances with ephemeral disks won't have those mounted since it would conflict with the tmpfs volume.

jobs:
  with-tmpfs:
    runs-on: runs-on=${{ github.run_id }},family=r7,ram=16,extras=tmpfs
    steps:
      - run: df -ah /mnt/tmpfs
      # tmpfs            16G  724K   16G   1% /mnt/tmpfs
      - run: df -ah /home/runner
      # overlay          16G  724K   16G   1% /home/runner
      - run: df -ah /tmp
      # overlay          16G  724K   16G   1% /tmp
      - run: df -ah /var/lib/docker
      # tmpfs            16G  724K   16G   1% /var/lib/docker

You can obviously combine options, i.e. extras=efs+tmpfs+ecr-cache+s3-cache is a valid label 😄

Instance-storage mounting changes

Until now, when an instance has locally attached NVMe SSDs available, they would be automatically formatted and mounted so that /var/lib/docker and /home/runner/_work directories would end up on the local disks. Since a lot of stuff (caches etc.) seem to end up within the /home/runner folder itself, the agent now uses the same strategy as for the new tmpfs mounts above (i.e. the whole /home/runner folder is mounted as an overlay on the local disk volume, as well as the /tmp folder. /var/lib/docker remains mounted as a normal filesystem on the local disk volume). Fixes #284.

Misc
  • Move all RunsOn-specific config files into /runs-on folder on Linux. More coherent with Windows (C:\runs-on), and avoids polluting /opt folder.
  • Fix app_version in logs (was previously empty string due to incorrect env variable being used in v2.8.1).
  • Fix "Require any Amazon EC2 launch template not to auto-assign public IP addresses to network interfaces" from AWS Control Tower. When the Private mode is set to only, no longer enable public ip auto-assignment in the launch templates. Thanks @temap!

Details

Summary

A large release: can now use external networking stack ; enable encryption on all S3 buckets ; lots of quality of life improvements and bug fixes ; halve Windows boot times and enable Cloudwatch agent monitoring. Be sure to read the upgrade notes.

What's changed

Networking
  • Can now reuse existing networking stack. If NetworkingStack stack parameter is set to external instead of embedded. Fixes #198, fixes #265, fixes #230 (community-provided networking stack can provide this feature).

networking-stack@2x

  • Some not-so-useful stack outputs have been removed. Some outputs may be - if using an external VPC.
Caching
  • Fix invalid cache key restoration for Magic Cache. Thanks @erikburt from ChainlinkLabs for the troubleshooting.
Security
  • Enable server-side encryption using AWS-managed KMS key on all S3 buckets. Fixes #276.

  • No longer expose JIT token in cloud-init-output logs. The token is no longer valid after a job is run, but still.

QoL improvements
  • Add AppDebug (true or false) stack parameter, which allows to disable the auto-shutdown of runners when the bootstrap fails. Useful to investigate what is going on when the runner initializes.

  • Add AppCustomPolicy stack parameter: Optional managed IAM Policy ARN to assign to the App runner service role. Can be used to e.g. allow access to KMS decryption keys for AMIs. Thanks @dsme94!

  • Add AppGithubApiStrategy (normal or conservative) stack parameter to opt into minimizing GitHub API usage. If set to conservative, runners won't be automatically unregistered in GitHub internal database (GitHub will still clean them up after 24h). This helps for users with very large number (20k+) of jobs launched every day. Fixes #285.

  • Now bootstraps runners using runs-on/bootstrap binary, preinstalled on official RunsOn images (faster and more extensible).

  • On spot interruption, give more time to the job to possibly complete before shutdown is triggered. Shutdown is now triggered 20s before the expected time sent by AWS, instead of 15 seconds after the notification is received. Fixes #277.

Windows
  • Shaved about 50s from Windows boot times: SSH is no longer automatically installed on Windows (SSM agent is available now), and no longer using Invoke-WebRequest helped a lot (TIL).

  • CloudWatch agent is automatically installed on Windows AMIs, and EC2Launch logs are shipped to CloudWatch (same naming as for Linux runners: e.g. LOG_GROUP_NAME/INSTANCE_ID/cloud-init-output.log). Also added support for roc connect on Windows AMIs in the RunsOn CLI.

Bug fixes
  • Fix for invalid CreateTags requests - Fixes #288.

  • Fix for invalid EC2 rate-limiter being used when uploading user-data file to S3. Fixes #286 .

  • Adjust ownership rule for S3 bucket logging, from BucketOwnerPreferred to BucketOwnerEnforced. Fixes #291.

Details

Summary

GHES support is now available. Allow to specify a custom expiration for objects in the cache bucket.

What's changed

  • GHES support is now available. Fixes #250.
  • Add S3CacheExpirationInDays stack parameter. Fixes #179.
  • Tag launch templates. Fixes #264.
  • Pin launch template version to the specific version active at the time RunsOn service wqas deployed. Fixes #274.
  • Tag instances with runs-on-is-ghes and runs-on-integrations-active.
  • Tag instances with InspectorEc2Exclusion to avoid SSM inspector scans on running instances. Possibly fixes #242.

Details

Summary

Hotfix: fix for disk=large handling.

Details

Summary

A few minor breaking changes related to VPC flow logs and hdd label. Plus many fixes.

Breaking changes

This is a minor release, so this comes with the following breaking changes. Please review your CloudFormation parameters and runner configuration accordingly when updating:

  • Fix for #258. VPC Flow Logs are now only enabled if the VpcFlowLogFormat is set to a non-empty value. To enable, and use the default format (as it were before if you didn't specify a value), specify default.
  • Remove support for the deprecated hdd job label. Ensure all your workflows and repository configuration (.github/runs-on.yml) do not use this label. If it is still set, it will have no effect and the default runner configuration will be used for disk sizing. You must now use the disk=default or disk=large label instead.

What's changed

  • Fix for S3 server access logging. Fixes #241.
  • Allows specifying the root volume name. Fixes #207.
  • Expose RUNS_ON_AWS_AZ and RUNS_ON_INSTANCE_LAUNCHED_AT environment variables to jobs.
  • Properly set RUNNER_TOOL_CACHE 🤦, so that some setup-* actions can properly use the hosted toolcache on the VM.
  • Add policy to allow instances to describe their tags. Means we no longer need to enable InstanceMetadataTags for the instances. Cost Allocation Tag and Runner Tags can now contain slashes in their keys.
  • Add runs_on_spot_circuit_breaker_active prometheus metric (1=active, 0=inactive). Fixes #271.
  • Ensure we don't try to auto-retry after spot termination if the workflow run has already been manually re-attempted. Fixes #263.
  • Fixes typo - Fixes #248.
  • Scope the minutes alarm on the stack name. Also add the StackName dimension on all metrics. Fixes #235.
  • (alpha, not fully functional yet) Support for GitHub Enterprise Server (GHES) installations.

Details

Summary

Hotfix for CreateFleet IdempotentParameterMismatch errors, as well as Magic Cache support for newer buildx versions.

What's changed

  • Fixes #251: IdempotentParameterMismatch error.
  • Fix Magic Cache for newer buildx versions. No longer need to set version=1 in cache-from and cache-to.
  • Fixes #249: Add cancelled to the list of conclusion statuses that can trigger an auto-retry.

Details

Summary

New spot circuit breaker for snoozing spot requests if too many interruptions detected. Monitoring improvements. StepSecurity integration, and more.

What's changed

Spot circuit breaker
  • Allow to switch to on-demand requests if spot interruption frequency is too high over a defined time interval. Fixes #226.

For instance, if SpotCircuitBreaker is set to 2/30/60, it means that after at least 2 interruptions in the last 30 minutes, RunsOn will switch to on-demand requests for the next 60 minutes.

Monitoring
  • Add workflow job conclusion to prometheus labels. Fixes #178. Also add job_conclusion and run_attempt to all log lines.
  • Support SQS queue oldest message age alarms. Helps with compliance and to detect whether RunsOn has issues dequeuing messages fast enough. Fixes #228.
  • Use scheduled event to compute and send cost reports at midnight UTC. Fixes #216.
Native integration with StepSecurity
jobs:
  job-with-stepsecurity:
    runs-on: "runs-on=${{ github.run_id }}/runner=2cpu-linux-x64/image=ubuntu24-stepsecurity-x64"
    steps:
      - name: External call
        run: curl https://google.com

Documentation: https://runs-on.com/integrations/stepsecurity/

Misc
  • Reduce agent binary size.
  • Update Go dependencies.
  • Allow injection of custom runner agent (internal testing only).
  • Remove magic cache ON annotation. Fixes #234.

Details

Summary

Fix VpcEndpoints stack parameter.

What's changed

With VpcEndpoints enabled, the CloudFormation template was incorrectly assigning interface endpoints to both public and private subnets, while an interface endpoint can only be defined once per AZ (and only makes sense for private subnets anyway).

Thanks again to Commonwealth Fusion Systems for their quick feedback and help!

Details

Summary

Optimized GPU images, new VpcEndpoints stack parameter, ability to specify custom instance tags for custom runners.

Note: there appears to be some issues with the new VPC endpoints. I'm on it! If you need that feature, please hold on to your current version of RunsOn.

What's Changed

  • New GPU images ubuntu22-gpu-x64 and ubuntu24-gpu-x64: 1-1 compatibility with GitHub base images + NVidia GPU drivers, CUDA toolkit, and container toolkit.
  • Add new VpcEndpoints stack parameter (fixes #213), and reorganize template params. Note that the EC2 VPC endpoint was previously automatically created when Private mode was enabled. This is no longer the case, so make sure you select the VPC endpoints that you need when you update your CloudFormation stack.
  • Suspend versioning for cache bucket (fixes #191).
  • Allow to specify instance tags for runners (fixes #205). Tag keys can't start with runs-on- prefix, and key and values will be sanitized according to AWS rules.

Details

Summary

CLI 0.0.1 released, fix for Magic Cache, fleet objects deletion.

What's changed

  • CLI released: https://github.com/runs-on/cli. Allows to easily view logs (both server logs and cloud-init logs) for a workflow job by just pasting its GitHub URL or ID. Also allows easy connection to a runner through SSM.
  • Fix race-condition in Magic Cache (fixes #209).
  • Delete the fleet instead of just the instance (fixes #217).

Details

Summary

Fix magic cache handling of actions/upload-artifact. Prepare for RunsOn CLI.

What's changed

  • Store instance id assigned to job (once job has started) in the main S3 bucket (under /runs-on/db/jobs/JOB_ID/instance-id), as well as the payload for the workflow_job queued event. Will be used for #201.
  • Fix magic cache for cache keys with slashes inside.
  • Make magic cache play nice with actions/upload-artifact. For that you must add runs-on/action@v1 in your workflows. Fixes #197.
  • Documentation for magic cache at https://runs-on.com/caching/magic-cache/

Details

Summary

Magic transparent cache for dependencies and docker layers. SSM support for logging into runner instances. And more.

What's changed

jobs:
  look-ma-no-cache-config:
    runs-on: "runs-on=${{github.run_id}}/runner=2cpu-linux-x64/extras=s3-cache"
    steps:
     # standard action is supported, no need to use `runs-on/cache@v4`
     - uses: actions/cache@v4
       with:
         path: my-path
         key: my-key
     # third-party actions that depend on official toolkit (99%) are supported as well
     - uses: ruby/setup-ruby@v1
       with:
         bundler-cache: true
  • BETA - Transparent S3-backed caching for Docker layers when using cache-to: type=gha / cache-from: type=gha. For now, the magic caching is only enabled with the extras=s3-cache job label.
jobs:
  look-ma-no-cache-config:
    runs-on: "runs-on=${{github.run_id}}/runner=2cpu-linux-x64/extras=s3-cache"
      # BEFORE
      - name: "Build and push image (explicit s3 config)"
        uses: docker/build-push-action@v4
        with:
          tags: test
          cache-from: type=s3,blobs_prefix=cache/docker-s3/,manifests_prefix=cache/docker-s3/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
          cache-to: type=s3,blobs_prefix=cache/docker-s3/,manifests_prefix=cache/docker-s3/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max

      # AFTER
      - name: "Build and push image (type=gha, automgically switched to S3)"
        uses: docker/build-push-action@v4
        with:
          tags: test
          cache-from: type=gha
          cache-to: type=gha,mode=max
  • Assign AmazonSSMManagedInstanceCore policy to EC2 instances, so that one can easily connect to the runner instance with SSM. Fixes #129.
AWS_PROFILE=YOUR_PROFILE aws ssm start-session --target INSTANCE_ID --reason "testing ssm"
  • Allow to inject additional environment variables from the preinstall step, by exposing a $GITHUB_ENV variable that you can write to. The variables will automatically be made available to the job steps. Fixes #188.
runners:
  preinstall-with-env:
    image: ubuntu22-full-arm64
    family: ["c7g"]
    preinstall: |
      echo "Adding a custom env var..."
      echo "MY_CUSTOM_VAR=my_custom_value" >> $GITHUB_ENV
  • Support preinstall for Windows runners.

  • Expose RunsOnServiceArn as output, so that one can use it to build the CloudWatch log paths. Fixes #184.

  • Do not send the cost allocation tag warning if the latest cost report was non-zero. Fixes #187.

  • Add Ec2LogRetentionInDays stack parameter. Fixes #189.

  • Allow to read license key from SSM. Fixes #176.

image

Details

Summary

New stack parameters and best practices compliance changes. No longer defaults to fetching global config when a local repo config is not found. Improve housekeeping to handle an additional AWS internal error case when launching an instance.

What's changed

  • Add parameter to enable/disable IPv6: Ipv6Enabled. Default is now false, which is a change from previous versions where IPv6 was always enabled. The reason for that is that it looks like docker pulls will go through IPv6 IPs, and for some reason they are getting rate-limited much faster than on IPv4. Will have to dig a bit deeper into that. Fixes #177.
  • Add parameter to disable the inbound SSH rule in the default security group for runners: SSHAllowed. Default is true. Fixes #174. Fixes #159.
  • Add VpcFlowLogRetentionInDays stack parameter. Fixes #180.
  • No longer defaults to fetching global config when a local repo config is not found. The current behaviour was a bit broken with the caching mechanism, and lead to confusion. Let's make the behaviour explicit by requiring a local repo config file, with an explicit _extends directive. I understand this is a bit cumbersome if you have many repositories, but I think it's also nice to be able to inspect which repositories are inheriting from the global config. I'm introducing this change as part of a patch release because the current behaviour was already broken on v2.6.0.
  • Housekeeping: Detect AWS server issue that sometimes leaves instances in pending state, in which case RunsOn will terminate the current instance, and reschedule.
  • Enable versioning on all S3 buckets. Fixes #181.

Details

Summary

Auto-retry mechanism for spot interruptions, SingleAZ or MultiAZ NAT gateways, and more!

What's changed

  • Spot workflows are now retried once with an on-demand instance if interrupted. Fixes #160. Requires a permission update (write permission for Actions instead of read) for existing installations. You should receive an email with instructions after upgrading. Also add runs-on-workflow-job-interrupted=true to the instance tags if the spot instance was interrupted.
  • Add new label retry, with possible values retry=when-interrupted (default for spot), and retry=false to opt out of any auto-retry (useful for non-idempotent jobs).
  • Add runs-on-workflow-job-id to the instance tags once the job has started. Also add it to prometheus metric labels.
  • Rename tag runs-on-job-started => runs-on-workflow-job-started
  • Allow to use 1 NAT gateway per AZ instead of a single one for all. Fixes #165.
  • Add optional VpcCidrSubnetBits, DefaultPermissionBoundaryArn, VpcFlowLogFormat, and VpcFlowLogS3BucketAr parameters, so that users can get more conformant stacks compared to their internal settings.
  • Set RunsOn env variables on Windows.

New Contributors

Details

Summary

Fix GitHub webhook custom_properties handling when non-string values.

Details

Summary

Revert x/time dependency to v0.6.0 since v0.7.0 introduced a breaking change for rate-limits when using a zero limit.

Details

Summary

Add Private=only mode, make EBS encryption opt-in, introduce disk label. Plus fixes and minor improvements.

Note: please use v2.5.8+ because this version embeds a dependency upgrade for the rate-limit library, which introduced a regression.

What's changed

  • Update github go library to fix issue with custom properties.
  • Make EBS encryption opt-in, and specify default encryption key (fixes #152).
  • Add Private=only mode for the CloudFormation stack, so that runners are forbidden to launch in a public subnet. Fixes #150.
  • Disable automatic public IP assignment in public subnets when Private=only is set for the stack (helps with conformance).
  • Remove HousekeepingEnabled stack parameter. Housekeeping is now always enabled.
  • No longer display EgressStaticIp in job logs since we don't know which one the runner will end up using.

Deprecations

  • Introduce disk=default or disk=large label to simplify disk size selection based on the runner volumes defined in the RunsOn CloudFormation stack. hdd is now deprecated and will be removed in a next non-patch version.

Details

Summary

Enable IPv6 for runners. Allow to specify multiple static IPs for the managed NAT gateway. Allow filtering images based on tags. A lot of changes (again) around GitHub rate-limit handling and housekeeping mechanism.

New features

  • Enable IPv6 for runners (fixes #142). An IPv6 is attached for both public and private runners, with an egress ipv6 (free) gateway for private instances.
  • Allow to specify multiple static IPs for the managed NAT gateway (fixes #139). By default up to 2 are possible, and up to 8 when a quota increase is requested. This helps if you are launching a large number of runners in private subnets, and some external service rate-limits you based on the IP.
  • Allow filtering images based on a tag, in addition to the name wildcard (e.g. is-production-ready=true). Example :
# .github/runs-on.yml
images:
  custom:
    owner: "123456789"
    name: "my-org/my-image-name-*"
    arch: x64
    platform: linux
    tags:
      # filter with specific value
      is-production-ready: "true"
      # allow any value
      other-tag: "*"
  • Automatically bind-mount /var/lib/docker on the ephemeral instance storage, if any. Fixes #144.

Bug fixes

  • Escape shell special characters in env file values.
  • If a matching AMI cannot be found, do not retry and alert on first error.
  • Do not attempt to retry job if generated fleet params configuration is incorrect.
  • Abort early if workflow run status cannot be checked.

Fixes to avoid GitHub rate-limit issues

  • No longer attempt to reschedule jobs where a runner theft is suspected. Instead log a warning message telling users to make sure their jobs have unique enough labels. In some cases this was triggering useless reschedules due to GitHub not reflecting the job state quickly enough.
  • Fix too many GitHub calls when fetching repo config from an extends attribute (cache it).
  • No longer unregister runners from GitHub if API credit is lower than 2500. They will be removed by GitHub 24h later anyway.
  • Reorganize rate-limiters, increase DELAY_SECONDS_FOR_CHECK_BACK to 180s instead of 120s. Enable github rate-limiter, and set burst to the current number of remaining tokens.
  • Only attempt to finalize a job once at most. Instance will auto-terminate anyway so at worst we lose the job usage metrics in CloudWatch. But at least we don't eat into the GitHub / EC2 credits.
  • Set housekeeping and termination queue sizes to 1 to reduce their impact on GitHub API credits.

Details

Summary

Strengthen CF template configuration to better conform to AWS guidelines. Bug fixes.

What's changed

  • Verify that generated JIT token has at least one char.
  • Do not attempt to retry runner creation when we know the original request is invalid (e.g. invalid runner configuration due to mismatched labels etc.)
  • Strengthen CF template configuration to better conform to AWS guidelines.
  • Make sure empty admin values are ignored.
  • If no repository config found, cache the result for 1 minute to avoid hammering GitHub API.

Details

Summary

New ubuntu24 images, new housekeeping task to auto-restart instances that failed to launch, new always-on Private setting, additional runner details in logs, and more.

Notable changes (from v2.5.0 to v2.5.4)

  • Add ubuntu24 official images: ubuntu24-full-x64 and ubuntu24-full-arm64.
  • Private CloudFormation parameter now accepts always as value, in which case the runners will always launch in the private subnets by default (unless opt-out with private=false).
  • Display GitHub current rate-limits in logs (search for tokens).
  • Add 'Private' dimension to cloudwatch stats.
  • Add Environment, IsPrivate, and StaticIp (if IsPrivate) to runner details (in Setup job logs)
  • Increase frequency for spot interruption polling + add logs.
  • Conform to AWS spec when sanitizing custom tags (key and value). Fixes #125.
  • Add housekeeping task to handle edge cases where a job is still seen as queued by GitHub after a few minutes even after an instance has been launched.
  • Allow to disable new housekeeping mechanism.
  • Properly tag instance volumes with cost allocation tag. Cost report email will likely go up.
  • Display app_environment and app_stack_name in logs.
  • Attempt to fix rare preinstall issue ending up with "text file busy".
  • Unregister runner from GitHub when job is completed (i.e. do not wait for auto-expiration since it does not seem that reliable).

Experimental

  • Bring back support for single string label, using / as the separator instead of ,. e.g. runs-on: runs-on/runner=2cpu-linux-x64/other=tag will work. This simplifies passing a runs-on specification as input to dependent workflows. If you have multiple RunsOn stacks, make sure they are all upgraded to this version before using this new syntax in workflows.

Internal

  • Fix issue with private attribute not being properly loaded from the repository configuration file.
  • Switch to semaphores for processing the 3 queues.
  • Check workflow run status before scheduling job.
  • Add termination queue.
  • Update GitHub App (for new installations) to listen for workflow_run events (not used yet, but will be soon).
  • Upgrade default runner version when no runner is preinstalled.

Details

Summary

Summary: refactor rate-limits, fix housekeeping behaviour, add missing cost allocation tags, fix rare preinstall bug, unregister runner from github after job termination.

What's changed

  • Refactor rate-limits, add proper github rate limiter (defaults to 5000 req/h max). Might introduce a stack parameter if it's too low for some users on GitHub Enterprise plans.
  • Reduce concurrency of housekeeping queue, since it's not high priority.
  • Attempt to fix rare preinstall issue ending up with "text file busy".
  • Upgrade default runner version when no runner is preinstalled.
  • Display app_environment and app_stack_name in logs.
  • Properly tag instance volumes with cost allocation tag. Cost report email will likely go up.
  • Unregister runner from GitHub when job is completed (i.e. do not wait for auto-expiration since it does not seem that reliable).

Experimental

  • Bring back support for single string label, using / as the separator instead of ,. e.g. runs-on: runs-on/runner=2cpu-linux-x64/other=tag will work. This simplifies passing a runs-on specification as input to dependent workflows. If you have multiple RunsOn stacks, make sure they are all upgraded to this version before using this new syntax in workflows.

Details

Summary

New ubuntu24 images, additional runner details in logs, scheduling retry mechanism if internal AWS server error when launching, and more.

Note: DO NOT USE this release. The new housekeeping behaviour is not working as expected.

What's Changed

  • Add 'Private' dimension to cloudwatch stats
  • Private CloudFormation parameter now accepts always as value, in which case the runners will always launch in the private subnets by default (unless opt-out with private=false).
  • Fix issue with private attribute not being properly loaded from config file.
  • Add Environment, IsPrivate, and StaticIp (if IsPrivate) to runner details (in Setup job logs)
  • Add ubuntu24 official images
  • Increase frequency for spot interruption polling + add logs.
  • Conform to AWS spec when sanitizing custom tags (key and value). Fixes #125.
  • Add housekeeping task to handle edge cases where a job is still seen as queued by GitHub after a few minutes even after an instance has been launched:
// Cases:
//   - instance is terminated and doesn't have the `runs-on-job-started` tag (due to spot interruption, AWS EC2 error).
//     In this case, we need to launch a new instance, so we reschedule the runner.
//   - instance is running and has the `runs-on-job-started` tag, which means the runner was stolen by another workflow job.
//     In this case, we need to launch a new instance, so we reschedule the runner.

Details

Summary

Summary: Allow to assign an environment name to each RunsOn stack. Allow to specify VPC CIDR block and export outputs to facilitate VPC peering connections. Allow to set custom tags on instances.

Potentially breaking changes

If you have set the Private parameter to true in the CloudFormation template, the behaviour has changed:

  • The stack will now create only 1 managed NAT gateway (instead of 3) when enabling Private mode, to save on costs.
  • Also, runners will be launched in the private subnets only if the label private=true is present in the runs-on: definition. This way, runners will launch in the public subnets by default, and you can selectively use the private subnet (to get the egress static IP) for specific workflows. This saves on NAT bandwidth costs since most workflows don't need static IP.

Features

  • Allow assigning an environment name to a RunsOn stack (default production), which can then be targeted by using the env label in the workflow. This allows setting up multiple isolated RunsOn stacks to handle environments such as staging etc. with different IAM permissions or configurations. Fixes #120.
  • Allow specifying a custom VPC CIDR block when creating the stack. This helps if you plan on establishing VPC peering connections with your RunsOn runners. Note that updating this parameter for existing stacks is not recommended. You should create a new stack instead, and remove the old one. Fixes #114.
  • Provide a CloudFormation template to facilitate the establishment of a VPC peering connection between RunsOn's VPC, and a destination VPC.
  • Allow setting custom tags on the instances launched by RunsOn (RunnerCustomTags CloudFormation parameter). Fixes #119.

Details

Summary

Summary: beta windows support, prometheus metrics, disk statistics in workflow logs.

New features

  • Prometheus metrics export, every minutes, at /metrics (authenticated with Basic Auth and a new ServerPassword CloudFormation parameter).

    • runs_on_ec2_instances_total, across various labels: image_id, az, instance_type, instance_lifecycle, instance_state, repo_full_name, runner_id, workflow_job_name, workflow_job_started, workflow_name.
    • runs_on_cloudtrail_events_total, across labels event_name, for CreateFleet, RunInstances, and BidEvictedEvent events.
    metrics
    scrape_configs:
    - job_name: "runs_on"
      metrics_path: /metrics
      scheme: https
      basic_auth:
        username: admin
        password: YOUR_SERVER_PASSWORD
      scrape_interval: 60s
      static_configs:
        - targets: ["APPRUNNER_ID.APPRUNNER_ZONE.awsapprunner.com"]
    
  • Windows support (x64 only for now), with a base image: image=windows22-base-x64. Example. For now dependencies have to be installed in your workflow steps, or you need to build a custom AMI based on an official Windows 2022 AMI. Current boot time =~ 2min. This will get better.

    image
  • Display disk details in 'Runner Instance' log group.

    • Linux

      image
    • Windows

      image

Misc

  • Agent rewrite, to handle multi-platform (Windows, see above).
  • Windows EC2Launch logs are available at C:\runs-on\output.log.
  • Ability to specify an alternative public ECR registry for the RunsOn docker image.

Details

Summary

Summary: a fix for useless creation of instances when hitting quota errors, reverting the unbounded cpu and ram change (from v2.3.0), and CloudWatch agent now streams instance logs into CloudWatch.

Features

  • The change introduced in v2.3.0 expanded the instance choice by allowing instances with more CPUs and RAM than specified to be included. This has been reverted to avoid confusion, and to avoid hitting quota limits more frequently. Instead, RunsOn will take the lowest and highest value from the cpu and ram definitions, and set that as min and max values when requesting an instance. If you want to keep the behaviour introduced in v2.3.0, you can now simply do e.g. cpu=4+256 and it will evaluate all instances with cpus from 4 to 256. You no longer need to set multiple values like cpu=4+8+16+32..., since only the min and max values will be used. As another example, setting cpu=4 will only include instance with 4 CPUs, as it was the case before v2.3.0.

  • Automatically send cloud-init logs to CloudWatch. Should help a lot with knowing what happened on an instance in case it terminated early. Currently ships /var/log/cloud-init-output.log, /var/log/syslog, and /var/log/cloudwatch-agent.log. Retention set to 7 days. Requires the CloudWatch agent to be installed on the base AMI (amazon-cloudwatch-agent-ctl must be in the PATH).

  • New CloudFormation parameter to enable/disable detailed monitoring for EC2 instances (default: false).

  • Add job_url to all log messages.

  • Add runs-on-workflow-run-id tag on instance, when job has started.

  • All instances will now get a Name assigned when the instance starts processing a job from GitHub. Quite useful to monitor at a glance in EC2 UI which instances have started processing jobs.

image

Fixes

  • CreateFleet API can sometimes return an instance, even if errors are present in the response. Checking this fixes an issue that was creating more instances than necessary when hitting e.g. quota errors.
  • Ensure runner waits up to 10s until all tags have been set on the instance before shutting down.

Misc

  • Always prepend preinstall script with #!/bin/bash -e, and make RunsOn environment variables accessible.

Details

Summary

Summary: Auto-mounting of ephemeral disks, improvements in dangling instance cleanup, better handling of preinstall.

Features

  • Local NVME disks (if any) are now automatically arranged in a RAID0 array, and automatically mounted as the workspace folder for the workflow job (i.e. at /home/runner/_work).
  • Add server-side check and cleanup of dangling instances. If an instance has not been tagged with a job name within 15 minutes of its launch, it will be force-terminated by the RunsOn server. This complements the watchdog of 10 minutes on the agent side, in case the agent cannot properly launch.
  • Can now define preinstall within a custom runner definition. This will override any existing preinstall from the image.
image
  • Abort the job if preinstall failed, and display its output in the log output of the "Set up runner" step in the GitHub UI.

    image
  • Automatically install the latest version of the runner agent, when using custom images not based on the official images provided by RunsOn.

Fixes

  • No longer include bare metal instances by default. They are now included only if one of the family types include metal in its name.
  • Reset instance creation timeout when falling back to on-demand pricing.
  • Fix ephemeral disk mounts.
  • Display preinstall output in /var/log/cloud-init-output.log, in addition to logging it in the job log output on GitHub.

Misc

  • Internal refactoring for server and agent code.

Details

Summary

Summary: Allow setting custom spot allocation strategy, cpu and ram behaviour change, config file is now read from the current branch for private repositories. And 2 new regions!

Potentially breaking changes

  1. Spot allocation strategy now defaults to price-capacity-optimized instead of capacity-optimized, which should bring even better cost savings while still ensuring low spot interruption percentage. The downside of that strategy might be a higher likelihood of interruption, but you can now override the strategy (see next section). Also, the next change below might reduce interruption likelihood by automatically expanding the instance pools that EC2 chooses from.

  2. No longer specify any max for RAM or CPU when requesting an instance, so that we may get a beefier instance if the spot allocation strategy prioritises it. This could be due to lower price, or due to a less interruption likelihood. This means you no longer need to set ram=2+4+8+16+... since ram=2 will automatically include 2+ GB instances (if you set multiple values: all values except for the first one will be ignored). Same for CPU.

Note that those two changes might be reverted if many users report increased issues with spot interruptions.

Features

  • New regions: Ohio (us-east-2), and Singapore (ap-southeast-1).
  • Can now override the default spot allocation strategy, using either the full strategy name (e.g. spot=lowest-price), or its initials (e.g. spot=lp). Supported allocation strategies: price-capacity-optimized, lowest-price, capacity-optimized.
  • Automatically mount locally-attached SSD disks if any (for instances types ending with the d suffix). Very useful if you require large disk sizes with the fastest speed.
  • Add new tags runs-on-workflow-name and runs-on-workflow-job-name to the runner instance, once the job has been scheduled on the instance (good for cost allocation, troubleshooting, etc.).
  • For private repositories, the configuration file will now be read from the current branch.

Fixes

  • For non-official images, setup runner user earlier, so that SSH keys can be properly added to that user.
  • Set environment variable RUNNER_TOOL_CACHE to /opt/hostedtoolcache, since some third-party actions have this value hardcoded. This is the default value on official runners as well.

Misc

  • Send instance timings to telemetry API. This will allow better tracking of boot times across all users and regions.