Skip to content

Changelog

Details

Summary

  • Fix S3 rate-limit initialization and restore correct values.
  • Improve Slack webhook templates by @cfsnate.
  • Pool environment field renamed to env (to be coherent with the naming in runner labels). environment is still supported but will be removed in next minor release. Also if no env specified, it defaults to production.
  • Fix dependabot handling.
  • Fix regression for /var/lib/docker bind mount that was active even when no ephemeral disk is present (would cause issues if you had embedded docker images on a custom AMI).
  • Try to circumvent a GitHub bug where in rare occurrences, a workflow_job webhook is received with empty runs-on: labels (labels: []). In that case, manually refresh the job details from GitHub API before proceeding.

Details

Summary

Note 2025-10-21: please use v2.9.2 instead, since it includes important fixes.

  • Fix pool overflow not picking the correct runner and image spec.
  • Warm and hot pools are available for Windows as well!
  • Fix github rate-limit refresh.

Details

Summary

This is a large release, with many internal and external changes. Please review the first section below carefully.

Note 2025-10-17: please use v2.9.2 instead, since it includes important fixes.

Potentially breaking changes

  • Update default linux image to ubuntu24. Set image=ubuntu22-full-x64 or image=ubuntu22-full-arm64 if you want to keep using Ubuntu22.
  • When a job is canceled due to a spot interruption, all the failed jobs from that failed workflow will get retried, instead of only the first job interrupted. Fixes #192.
  • Prometheus metrics endpoint removed, along with the ServerPassword stack parameter. RunsOn now ships with OTEL integration. Fixes #322. The /metrics endpoint anyway had a long-standing issue where some prometheus scrapers were unable to reach the AppRunner endpoint due to how the Envoy proxy from AWS handles requests.
  • disk=large and disk=default labels are deprecated. If present, they will be automatically translated into the new volume label, but once you have adopted v2.9.0, you should update your workflows for future upgrades.
  • RunnerLargeDiskDeviceName and RunnerDefaultDiskDeviceName are removed (now always use the AMI root volume device name).
  • Add stack parameter RunnerConfigAutoExtendsFrom to always force a specific value for repository configuration _extends directive (even if no local config file exists). Fixes #366. Note that it defaults to .github-private, meaning that if you leave that default, RunsOn will always attempt to load the config file from that repo as a base configuration. Set it to . (only extend from current repo _extends directive) to keep the previous behavior. This has been a long requested feature (and source of confusion) and is what new users expect, so this is why the breaking setting is enabled by default.
  • Custom tag precedence is now: stack custom runner tags < custom runner tags < repository custom runner tags. This allows to set default tags at the stack level, which can be overriden by runner-level tags, but in the end repo-level tags always taking precedence (if set) to make sure repo admins can control the final tag value when needed.

Deprecations (please send feedback!)

  • disk label support is going to be removed in the next minor version (replaced by volume label, which is much more flexible).
  • RunnerLargeVolumeThroughput and RunnerLargeDiskSize are deprecated stack parameters.

Also, now that external networking is supported, next version will just set sane defaults for some VPC features when using embedded networking. As such, those parameters will be removed:

  • VpcFlowLogFormat: [DEPRECATED, use external networking if you need to fine-tune this].
  • VpcFlowLogS3BucketArn: [DEPRECATED, use external networking if you need to fine-tune this]
  • VpcFlowLogRetentionInDays: [DEPRECATED, use external networking if you need to fine-tune this]
  • VpcCidrSubnetBits: [DEPRECATED, use external networking if you need to fine-tune this]

Then, DefaultAdmins will be removed, since it's just better for admin-level people to use SSM to log into the runners if needed:

  • DefaultAdmins: [DEPRECATED, prefer to use SSM for admin access].

Finally, I don't think ECInstanceDetailedMonitoring is useful since default cloudwatch metrics are useless anyway, and you're better off using the new runs-on/action metrics:

Now that native Slack webhook integration is supported, I believe we can also remove the AlertTopicSubscriptionHttpsEndpoint parameter, which was originally introduced for that use case (but required an adapter in between). Please reach out if you think we should keep it.

Warm pools (BETA)

RunsOn can now operate pools of stopped or hot instances, which means pick-up times will be improved.

See https://github.com/runs-on/runs-on/blob/main/adrs/20250727-warm-pools.md for all details.

Volume overrides

Runners can now override the default volume settings directly within labels or through the configuration file.

Examples:

# full
runs-on: runner=2cpu-linux-x64/volume=gp3:80g:125mbps:3000iops
# partial
runs-on: runner=2cpu-linux-x64/volume=80g:250mbps

GitHub webhook redelivery on failures

RunsOn now ships with a background job to check (every 5min) for failed webhook deliveries from the github side. If it finds some (matching the current stack labels), it will attempt to redeliver them once. This is especially useful under very high load, as AppRunner can sometimes rate-limit incoming webhooks when GitHub sends a burst of webhooks all at once.

You'll get alerted (over SNS, Slack, etc.) if failed webhooks have been redelivered. Cloudwatch dashboard also has a widget showing recent runs and redeliveries (if any).

2025-10-16-000496-health-checks (Channel) - RunsOn - Slack

Runner details

  • Add original labels
  • Add pool details (if any)
Click to view example 2025-09-05-000329-Dev  Pool · runs-ontest@520f247

Slack integration

Can now define a slack webhook URL (AlertTopicSlackWebhookUrl stack parameter), so that alerts also get sent there.

OTEL integration

Can now pass OTEL endpoint and headers (OtelExporterOtlpEndpoint, OtelExporterOtlpHeader). Only HTTP transport enabled for now. Metrics will be shipped there. Example dashboard below using Signoz:

OTEL dashboard
Details of all new metrics and logs
Job Metrics
runs_on_jobs_total (Counter)

Total number of jobs by status.

Attributes:

  • status: Job status (queued, scheduled, in_progress, completed)
  • repo_full_name: Repository full name (e.g., owner/repo)
  • workflow_name: GitHub workflow name
  • instance_type: EC2 instance type (e.g., t3.medium) (optional, only for scheduled status)
  • instance_lifecycle: Instance lifecycle (spot or on-demand) (optional, only for scheduled status)
  • pool_name: Pool name if scheduled from a pool (optional, only when scheduled via pool)
  • interrupted: Whether the job was interrupted (bool) (optional, only when true)
  • org: GitHub organization name
  • installation_id: GitHub App installation ID
  • stack_name: Stack name (optional, when provided in JobEvent)
  • region: AWS region (optional, when provided in JobEvent)
  • conclusion: Job conclusion for completed status (success, failure, cancelled, skipped)

Examples:

# Scheduled status (has instance_type and instance_lifecycle)
runs_on_jobs_total{status="scheduled",repo_full_name="acme/api",workflow_name="CI",instance_type="t3.medium",instance_lifecycle="spot",pool_name="default",org="acme",installation_id=12345,stack_name="runs-on-prod",region="us-east-1"} 42

# Completed status (no instance_type or instance_lifecycle)
runs_on_jobs_total{status="completed",repo_full_name="acme/api",workflow_name="CI",pool_name="default",conclusion="success",org="acme",installation_id=12345,stack_name="runs-on-prod",region="us-east-1"} 42
runs_on_internal_queue_duration_seconds (Histogram)

Time from job queued in RunsOn to scheduled (internal queue time). This measures how long the job spends in RunsOn's internal queue before an instance is scheduled.

Attributes: Same as runs_on_jobs_total

Buckets: Default OTEL histogram buckets

runs_on_overall_queue_duration_seconds (Histogram)

Time from job queued by GitHub to started (overall queue time). This measures the total time from when GitHub queues the job to when it actually starts running, including instance launch and runner bootstrap.

Attributes: Same as runs_on_jobs_total

Buckets: Default OTEL histogram buckets

runs_on_job_duration_seconds (Histogram)

Time from job started to completed.

Attributes: Same as runs_on_jobs_total

Buckets: Default OTEL histogram buckets

Pool Metrics
runs_on_pool_instances_total (Observable Gauge)

Number of pool instances by state. This is a pull-based metric that reports current state.

Attributes:

  • pool_name: Pool name
  • state: Instance state (running, stopped, pending, terminating)
  • installation_id: GitHub App installation ID
  • org: GitHub organization name

Example:

runs_on_pool_instances_total{pool_name="default",state="running",installation_id=12345,org="acme"} 5
runs_on_pool_instances_total{pool_name="default",state="stopped",installation_id=12345,org="acme"} 10
Rate Limiter Metrics
runs_on_rate_limiter_tokens (Observable Gauge)

Available tokens in rate limiter. This is a pull-based metric that reports current state.

Attributes:

  • limiter: Rate limiter name (e.g., github_api, ec2_api)

Example:

runs_on_rate_limiter_tokens{limiter="github_api"} 4500.5
runs_on_rate_limiter_burst (Observable Gauge)

Burst capacity of rate limiter. This is a pull-based metric that reports current state.

Attributes:

  • limiter: Rate limiter name

Example:

runs_on_rate_limiter_burst{limiter="github_api"} 5000
Spot Circuit Breaker Metrics
runs_on_spot_circuit_breaker_active (Observable Gauge)

Whether spot circuit breaker is currently active. This is a pull-based metric that reports current state.

Values:

  • 1: Circuit breaker is active (spot instances disabled)
  • 0: Circuit breaker is inactive (spot instances enabled)

Example:

runs_on_spot_circuit_breaker_active{} 0
Go Runtime Metrics

The metrics package automatically instruments Go runtime metrics via go.opentelemetry.io/contrib/instrumentation/runtime:

  • process.runtime.go.mem.heap_alloc
  • process.runtime.go.mem.heap_idle
  • process.runtime.go.mem.heap_inuse
  • process.runtime.go.gc.count
  • process.runtime.go.goroutines.count
  • And more...

These metrics include the standard service.name="runs-on-server" attribute.

Resource Attributes

All metrics include these resource attributes:

Attribute Description Example
service.name Service name (always runs-on-server) runs-on-server
app.version Application version (if configured) v2.9.0
app.environment Environment name (if configured) production
stack_name Stack name (if configured) runs-on-prod
region AWS region (if configured) us-east-1

Structured Logs

The metrics package emits periodic structured logs (JSON) containing snapshots of all metrics.

Log Types
Job Summary (metric_type=jobs_summary)

Cumulative job counts since server start.

{
  "metric_type": "jobs_summary",
  "queued": 1234,
  "scheduled": 1200,
  "in_progress": 34,
  "completed": 1150,
  "interrupted": 16
}

Note: The interrupted counter tracks jobs that were interrupted (e.g., by spot interruptions), but jobs are recorded with their final status (e.g., completed) and the interrupted attribute set to true.

Job Event (metric_type=job_event)

Individual job lifecycle events (emitted immediately, not periodic).

{
  "metric_type": "job_event",
  "status": "completed",
  "conclusion": "success",
  "repo_full_name": "acme/api",
  "workflow_name": "CI",
  "instance_type": "t3.medium",
  "instance_lifecycle": "spot",
  "pool_name": "default",
  "interrupted": true,
  "internal_queue_duration_seconds": 12.5,
  "overall_queue_duration_seconds": 45.2,
  "job_duration_seconds": 180.3
}

Note: instance_type, instance_lifecycle, pool_name, and interrupted fields are only included when available/applicable.

Pool Instances (metric_type=pool_instances)

Current pool instance counts by state.

{
  "metric_type": "pool_instances",
  "installation_id": 12345,
  "org": "acme",
  "pool_name": "default",
  "running": 5,
  "stopped": 10,
  "pending": 2
}
Rate Limiter (metric_type=rate_limiter)

Current rate limiter state.

{
  "metric_type": "rate_limiter",
  "limiter": "github_api",
  "tokens": 4500.5,
  "burst": 5000
}
Spot Circuit Breaker (metric_type=spot_circuit_breaker)

Current circuit breaker state.

{
  "metric_type": "spot_circuit_breaker",
  "active": false,
  "interruption_count": 42
}
Spot Interruption (metric_type=spot_interruption)

Individual spot interruption events (emitted immediately, not periodic).

{
  "metric_type": "spot_interruption",
  "interruption_time": "2025-10-10T14:30:00Z",
  "trip_count": 3,
  "recovery_minutes": 15,
  "circuit_breaker_active": false,
  "active_until": "2025-10-10T14:45:00Z",
  "instance_id": "i-1234567890abcdef0",
  "job_id": "987654321",
  "job_name": "build",
  "job_url": "https://github.com/owner/repo/actions/runs/123456789/job/987654321",
  "repo_full_name": "owner/repo"
}

Note: active_until is only included when the circuit breaker is active. Job details (instance_id, job_id, job_name, job_url, repo_full_name) are only included when available.

Pre/Post custom job hooks

You can now launch custom scripts within the "Set up runner" and "Complete runner" sections of a workflow. If /runs-on/pre.custom.sh or /runs-on/post.custom.sh scripts are found, the RunsOn agent will execute them in their respective job section. They are executed after the RunsOn-specific scripts, and RunsOn will fail the step if those custom scripts fail. See https://docs.github.com/en/actions/how-tos/manage-runners/self-hosted-runners/run-scripts for more details.

Misc

  • Improve failure message when invalid runner spec (missing family). Fixes #343.
  • Fix permission issue with Docker and ECR login in preinstall scripts on instances with local disks. Fixes #362.
  • Auto-resize Windows disks. Fixes #369.
  • Properly disable all ipv4 public addresses whenever launching in private subnets. Previously this was only done when Private=only stack parameter was set, leading to increased costs when running mixed networking mode (public + private runners allowed) stacks.
  • When instance received a spot interruption warning, let AWS perform the termination so that we don't get billed if runtime was <1h. Fixes #365.
  • Surface job error after all schedule attempts have been exhausted. Fixes #357.
  • Fix SSH setup issues on AlmaLinux images. #330.

Details

Summary

Details

Summary

Increase Docker ECR setup timeout to 2min (previously 20s, but could lead to authentication errors).

Details

Summary

Small fix for the CloudFormation template, when using the external networking stack and not passing any public (or private) subnet IDs.

Details

Summary

Bug fixes, and first iteration on integrated CloudWatch dashboard for people managing the stack.

What's changed

  • Better handling of environment variable display in the "Set up runner" step. Fixes #325.
  • Allow ExternalVpcPublicSubnetIds to be left empty when using Private=only mode.
  • Cleanup delete markers and aborted multipart uploads. Fixes #329.
  • Lower MinValue for disk size to 10GB. Fixes #336.
  • Reformat error and cost report subjects, limit to max 100 chars. Fixes #340.
  • Windows: make user-data run on every reboot
  • Add stack parameter EnableDashboard (default: false) to allow creation of CloudWatch dashboard
  • Properly override platform and arch based on retrieved image details (if ami id is provided). Previously you could have windows images getting the linux user-data script if you were just providing the AMI ID (i.e. not using an image spec definition in the config file).
  • Make windows agent resilient to already existing runner user.
  • Do not retry terminating a job if invalid instance id given.
  • Update dependencies.

Beta: integrated CloudWatch dashboard

You can now enable the creation of a CloudWatch dashboard. This is early days, but it can already display widgets for:

  • total runners scheduled for current period
  • runners scheduled over time
  • status of ec2 rate limiters + github api tokens left
  • last 20 error messages for current period (can expand)
RunsOn Dashboard

Details

Summary

Fix buildkit gha exporter, better user error reporting, AppRunner VPC connector integration.

What's changed

  • Report non-retryable user errors directly in GitHub: whenever a job can't be started for user reasons (e.g. bad image, bad runner definition, etc.), RunsOn will now spawn a default runner that will fail at the "Set up runner" step, with an error message explaining why. This will help surface issues. Fixes #307.

    2025-06-26-000054-Dev  Errors · runs-ontest@4363cc8

  • Fix issue with type=gha buildkit exporter for docker layers. Fixes #328.

  • Automatically enable the AppRunner VPC connector when Private mode is active, so that all AppRunner egress traffic (for the RunsOn orchestrator) goes through the private subnet(s) NAT gateways or equivalent. This means the AppRunner service will use the same static IP(s) as the runners, so that you can whitelist the AppRunner service on your GHES or GitHub Enterprise installation if needed. All ingress traffic is still publicly allowed and handled by AWS.

  • Telemetry: send values for networking_stack (embedded or external), and extras. Will help better understand how RunsOn is setup and which extra features are most used.

Details

Summary

Integrated CPU/Memory/Disk/Network monitoring, integrated job-level cost reporting, official snapshot action release, and many QoL improvements.

Spotlight: Monitoring improvements

  • Allow to send metrics to CWAgent namespace. This allows runs-on/action@v2 to send and graph metrics right within your job output. For instance:
      📊 Disk Writes:
         5973 â”Ī                ╭â•Ū
         5500 â”Ī               ╭â•Ŋ╰â•Ū
         5028 ┾â•Ū             ╭â•Ŋ  ╰â•Ū                   ╭───â•Ū              ╭
         4555 â”Ī╰â•Ū           ╭â•Ŋ    ╰â•Ū               ╭──â•Ŋ   ╰─â•Ū          ╭─â•Ŋ
         4083 â”Ī ╰â•Ū         ╭â•Ŋ      ╰─â•Ū          ╭──â•Ŋ        ╰â•Ū       ╭─â•Ŋ
         3610 â”Ī  ╰─â•Ū      ╭â•Ŋ         ╰â•Ū      ╭──â•Ŋ            ╰─â•Ū   ╭─â•Ŋ
         3138 â”Ī    ╰â•Ū    ╭â•Ŋ           ╰â•Ū ╭───â•Ŋ                 ╰───â•Ŋ
         2665 â”Ī     ╰â•Ū  ╭â•Ŋ             ╰─â•Ŋ
         2193 â”Ī      ╰──â•Ŋ
                                 Disk Writes (Ops/s)
      Stats: min:2040.0 avg:4180.9 max:6026.0 Ops/s
  • Create resource group for EC2 instances on CloudWatch. This means you can go to the CloudWatch EC2 Automatic dashboard, select your resource group (named after your RunsOn stack) and get a high-level overview of metrics for all your runner instances. 2025-06-19-000029-CloudWatch  us-east-1

  • Allow instance role to enable detailed monitoring on demand (not used for now, but might be an option of runs-on/action.

Spotlight: Costs computation

  • The runs-on/action@v2 now automatically computes the costs associated for each job, and displays the results right within your job logs. You can also choose to display them as a job summary.

Spotlight: Block-level snapshots

  • The runs-on/snapshot@v1 action is available and can be used to save and restore entire folders between job executions, at a much faster speed (for long jobs) than other methods relying on compression and export to S3 or other.

Stack improvements

  • Allow to override the max runner time limit. Fixes #320.
  • Add support for permission boundary for SchedulerInvokeRole. Fixes #315.
  • Add AWS account-id and region to emails. Fixes #298.
  • Remove explicit PublicAccessBlockConfiguration declaration since some SCP policies can incorrectly flag the s3:PutBucketPublicAccessBlock action. This is the default for new buckets anyway.
  • Add ECR Full Access managed policy to obtain higher ECR Public Rate Limits.
  • Make bootstrapping work on more distros.
  • Allow RunsOn to auto-create spot role if absent.
  • Add health checks for email notification subscription, and ec2 spot role. success@2x

Misc

  • Remove legacy (and unused) .env loading.
  • Update go to 1.24.

Details

Summary

Huge improvements to tagging, magic cache is now even faster, and bug fix for jobs tied to environments with no approval required.

What's changed

QoL improvements
  • Check if the spot role exists before starting RunsOn service (preflight check 2 from the installation guide). If not, alert the user over the SNS topic.

  • Cleanup all dangling instances, irrespective of RunsOn version.

  • Rewrote magic cache to more efficiently stream uploads when actions/cache is the client. For bigger (>1GiB) payloads, there should be a very noticeable improvement.

  • If SSHAllowed is set to false at the stack level, discard any ssh=true value coming from label or repo config. Fixes #310.

Improvements to tagging
  • Pass all custom tags to volumes, in addition to instances, when creating the runner. Fixes #264.

  • Allow to set additional custom tags using a custom property in the GitHub settings of a repository. If a custom property with name runs-on-custom-tags exists, RunsOn will parse it in the same way as the stack-level custom tags, and apply them to the instance and volumes. Fixes #297.

    custom property@2x

    For instance: if the value for property runs-on-custom-tags is set to key1=val1,key2=val2 then instances and volumes will get 2 new tags (key1, key2) with their corresponding values.

    Same restrictions than stack-level tags apply. Stack-level tags take precedence over tags set in the custom property, and tags set in custom properties takes precedence over custom runner tags defined in the .github/runs-on.yml configuration.

  • Pass custom tags and default branch to runner config. And write config in /runs-on/config.json (linux), or C:\runs-on\config.json (Windows). Config can then be read by actions / scripts etc. to access all runner details easily.

Bug fixes
  • Fix the race-condition that could lead to 2 instances being started when handling jobs tied to a deployment that does not require approval.
Misc
  • Add goroutine to cleanup dangling volumes and snapshots (prepare for block-level snapshots).
  • Register waiting, in_progress, and completed webhook payloads in S3 (in addition to queued).

Details

Summary

Support for EFS, TMPFS, and ECR ephemeral registry for fast docker builds. Also some bug fixes.

What's changed

EFS
  • Embedded networking stack can now create an Elastic File System (EFS), and runners will auto-mount it at /mnt/efs if the extras label include efs. Useful to share artefacts across job runs, with classic filesystem primitives.
jobs:
  with-efs:
    runs-on: runs-on=${{ github.run_id }},runner=2cpu-linux-x64,extras=efs
    steps:
      - run: df -ah /mnt/efs
      # 127.0.0.1:/      8.0E   35G  8.0E   1% /mnt/efs
📝 Example use case for maintaining mirrors For instance this can be used to maintain local mirrors of very large github repositories and avoid long checkout times for every job:
env:
  MIRRORS: "https://github.com/PostHog/posthog.git"
  # can be ${{ github.ref }} if same repo as the workflow
  REF: main

jobs:
  with-efs:
    runs-on: runs-on=${{ github.run_id }},runner=2cpu-linux-x64,extras=efs
    steps:
      - name: Setup / Refresh mirrors
        run: |
          for MIRROR in ${{ env.MIRRORS }}; do
            full_repo_name=$(echo $MIRROR | cut -d/ -f4-)
            MIRROR_DIR=/mnt/efs/mirrors/$full_repo_name
            mkdir -p "$(dirname $MIRROR_DIR)"
            test -d "${MIRROR_DIR}" || git clone --mirror ${MIRROR/https:\/\//https:\/\/x-access-token:${{ secrets.GITHUB_TOKEN }}@} "${MIRROR_DIR}"
            ( cd "$MIRROR_DIR" && \
              git remote set-url origin ${MIRROR/https:\/\//https:\/\/x-access-token:${{ secrets.GITHUB_TOKEN }}@} && \
              git fetch origin ${{ env.REF }} )
          done
      - name: Checkout from mirror
        run: |
          git clone file:///mnt/efs/mirrors/PostHog/posthog.git --branch ${{ env.REF }} --single-branch --depth 1 upstream
Ephemeral registry
  • Support for an Ephemeral ECR registry: can now automatically create an ECR repository that can act as an ephemeral registry for pulling/pushing images and cache layers from your runners. Especially useful with the type=registry buildkit cache instruction. If the extras label includes ecr-cache, the runners will automatically setup docker credentials for that registry at the start of the job.
jobs:
  ecr-cache:
    runs-on: runs-on=${{ github.run_id }},runner=2cpu-linux-x64,extras=ecr-cache
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/build-push-action@v4
        env:
          TAG: ${{ env.RUNS_ON_ECR_CACHE }}:my-app-latest
        with:
          context: .
          push: true
          tags: ${{ env.TAG }}
          cache-from: type=registry,ref=${{ env.TAG }}
          cache-to: type=registry,ref=${{ env.TAG }} }},mode=max,compression=zstd,compression-level=22
Tmpfs

Support for setting up a tmpfs volume (size: 100% of available RAM, so only to be used on high-memory instances), and binding the /tmp, /home/runner, and /var/lib/docker folders on it. /tmp and /home/runner are mounted as overlays, preserving their existing content.

Can speed up some IO-intensive workflows. Note that if tmpfs is active, instances with ephemeral disks won't have those mounted since it would conflict with the tmpfs volume.

jobs:
  with-tmpfs:
    runs-on: runs-on=${{ github.run_id }},family=r7,ram=16,extras=tmpfs
    steps:
      - run: df -ah /mnt/tmpfs
      # tmpfs            16G  724K   16G   1% /mnt/tmpfs
      - run: df -ah /home/runner
      # overlay          16G  724K   16G   1% /home/runner
      - run: df -ah /tmp
      # overlay          16G  724K   16G   1% /tmp
      - run: df -ah /var/lib/docker
      # tmpfs            16G  724K   16G   1% /var/lib/docker

You can obviously combine options, i.e. extras=efs+tmpfs+ecr-cache+s3-cache is a valid label 😄

Instance-storage mounting changes

Until now, when an instance has locally attached NVMe SSDs available, they would be automatically formatted and mounted so that /var/lib/docker and /home/runner/_work directories would end up on the local disks. Since a lot of stuff (caches etc.) seem to end up within the /home/runner folder itself, the agent now uses the same strategy as for the new tmpfs mounts above (i.e. the whole /home/runner folder is mounted as an overlay on the local disk volume, as well as the /tmp folder. /var/lib/docker remains mounted as a normal filesystem on the local disk volume). Fixes #284.

Misc
  • Move all RunsOn-specific config files into /runs-on folder on Linux. More coherent with Windows (C:\runs-on), and avoids polluting /opt folder.
  • Fix app_version in logs (was previously empty string due to incorrect env variable being used in v2.8.1).
  • Fix "Require any Amazon EC2 launch template not to auto-assign public IP addresses to network interfaces" from AWS Control Tower. When the Private mode is set to only, no longer enable public ip auto-assignment in the launch templates. Thanks @temap!

Details

Summary

A large release: can now use external networking stack ; enable encryption on all S3 buckets ; lots of quality of life improvements and bug fixes ; halve Windows boot times and enable Cloudwatch agent monitoring. Be sure to read the upgrade notes.

What's changed

Networking
  • Can now reuse existing networking stack. If NetworkingStack stack parameter is set to external instead of embedded. Fixes #198, fixes #265, fixes #230 (community-provided networking stack can provide this feature).

networking-stack@2x

  • Some not-so-useful stack outputs have been removed. Some outputs may be - if using an external VPC.
Caching
  • Fix invalid cache key restoration for Magic Cache. Thanks @erikburt from ChainlinkLabs for the troubleshooting.
Security
  • Enable server-side encryption using AWS-managed KMS key on all S3 buckets. Fixes #276.

  • No longer expose JIT token in cloud-init-output logs. The token is no longer valid after a job is run, but still.

QoL improvements
  • Add AppDebug (true or false) stack parameter, which allows to disable the auto-shutdown of runners when the bootstrap fails. Useful to investigate what is going on when the runner initializes.

  • Add AppCustomPolicy stack parameter: Optional managed IAM Policy ARN to assign to the App runner service role. Can be used to e.g. allow access to KMS decryption keys for AMIs. Thanks @dsme94!

  • Add AppGithubApiStrategy (normal or conservative) stack parameter to opt into minimizing GitHub API usage. If set to conservative, runners won't be automatically unregistered in GitHub internal database (GitHub will still clean them up after 24h). This helps for users with very large number (20k+) of jobs launched every day. Fixes #285.

  • Now bootstraps runners using runs-on/bootstrap binary, preinstalled on official RunsOn images (faster and more extensible).

  • On spot interruption, give more time to the job to possibly complete before shutdown is triggered. Shutdown is now triggered 20s before the expected time sent by AWS, instead of 15 seconds after the notification is received. Fixes #277.

Windows
  • Shaved about 50s from Windows boot times: SSH is no longer automatically installed on Windows (SSM agent is available now), and no longer using Invoke-WebRequest helped a lot (TIL).

  • CloudWatch agent is automatically installed on Windows AMIs, and EC2Launch logs are shipped to CloudWatch (same naming as for Linux runners: e.g. LOG_GROUP_NAME/INSTANCE_ID/cloud-init-output.log). Also added support for roc connect on Windows AMIs in the RunsOn CLI.

Bug fixes
  • Fix for invalid CreateTags requests - Fixes #288.

  • Fix for invalid EC2 rate-limiter being used when uploading user-data file to S3. Fixes #286 .

  • Adjust ownership rule for S3 bucket logging, from BucketOwnerPreferred to BucketOwnerEnforced. Fixes #291.

Details

Summary

GHES support is now available. Allow to specify a custom expiration for objects in the cache bucket.

What's changed

  • GHES support is now available. Fixes #250.
  • Add S3CacheExpirationInDays stack parameter. Fixes #179.
  • Tag launch templates. Fixes #264.
  • Pin launch template version to the specific version active at the time RunsOn service wqas deployed. Fixes #274.
  • Tag instances with runs-on-is-ghes and runs-on-integrations-active.
  • Tag instances with InspectorEc2Exclusion to avoid SSM inspector scans on running instances. Possibly fixes #242.

Details

Summary

Hotfix: fix for disk=large handling.

Details

Summary

A few minor breaking changes related to VPC flow logs and hdd label. Plus many fixes.

Breaking changes

This is a minor release, so this comes with the following breaking changes. Please review your CloudFormation parameters and runner configuration accordingly when updating:

  • Fix for #258. VPC Flow Logs are now only enabled if the VpcFlowLogFormat is set to a non-empty value. To enable, and use the default format (as it were before if you didn't specify a value), specify default.
  • Remove support for the deprecated hdd job label. Ensure all your workflows and repository configuration (.github/runs-on.yml) do not use this label. If it is still set, it will have no effect and the default runner configuration will be used for disk sizing. You must now use the disk=default or disk=large label instead.

What's changed

  • Fix for S3 server access logging. Fixes #241.
  • Allows specifying the root volume name. Fixes #207.
  • Expose RUNS_ON_AWS_AZ and RUNS_ON_INSTANCE_LAUNCHED_AT environment variables to jobs.
  • Properly set RUNNER_TOOL_CACHE ðŸĪĶ, so that some setup-* actions can properly use the hosted toolcache on the VM.
  • Add policy to allow instances to describe their tags. Means we no longer need to enable InstanceMetadataTags for the instances. Cost Allocation Tag and Runner Tags can now contain slashes in their keys.
  • Add runs_on_spot_circuit_breaker_active prometheus metric (1=active, 0=inactive). Fixes #271.
  • Ensure we don't try to auto-retry after spot termination if the workflow run has already been manually re-attempted. Fixes #263.
  • Fixes typo - Fixes #248.
  • Scope the minutes alarm on the stack name. Also add the StackName dimension on all metrics. Fixes #235.
  • (alpha, not fully functional yet) Support for GitHub Enterprise Server (GHES) installations.

Details

Summary

Hotfix for CreateFleet IdempotentParameterMismatch errors, as well as Magic Cache support for newer buildx versions.

What's changed

  • Fixes #251: IdempotentParameterMismatch error.
  • Fix Magic Cache for newer buildx versions. No longer need to set version=1 in cache-from and cache-to.
  • Fixes #249: Add cancelled to the list of conclusion statuses that can trigger an auto-retry.

Details

Summary

New spot circuit breaker for snoozing spot requests if too many interruptions detected. Monitoring improvements. StepSecurity integration, and more.

What's changed

Spot circuit breaker
  • Allow to switch to on-demand requests if spot interruption frequency is too high over a defined time interval. Fixes #226.

For instance, if SpotCircuitBreaker is set to 2/30/60, it means that after at least 2 interruptions in the last 30 minutes, RunsOn will switch to on-demand requests for the next 60 minutes.

Monitoring
  • Add workflow job conclusion to prometheus labels. Fixes #178. Also add job_conclusion and run_attempt to all log lines.
  • Support SQS queue oldest message age alarms. Helps with compliance and to detect whether RunsOn has issues dequeuing messages fast enough. Fixes #228.
  • Use scheduled event to compute and send cost reports at midnight UTC. Fixes #216.
Native integration with StepSecurity
jobs:
  job-with-stepsecurity:
    runs-on: "runs-on=${{ github.run_id }}/runner=2cpu-linux-x64/image=ubuntu24-stepsecurity-x64"
    steps:
      - name: External call
        run: curl https://google.com

Documentation: https://runs-on.com/integrations/stepsecurity/

Misc
  • Reduce agent binary size.
  • Update Go dependencies.
  • Allow injection of custom runner agent (internal testing only).
  • Remove magic cache ON annotation. Fixes #234.

Details

Summary

Fix VpcEndpoints stack parameter.

What's changed

With VpcEndpoints enabled, the CloudFormation template was incorrectly assigning interface endpoints to both public and private subnets, while an interface endpoint can only be defined once per AZ (and only makes sense for private subnets anyway).

Thanks again to Commonwealth Fusion Systems for their quick feedback and help!

Details

Summary

Optimized GPU images, new VpcEndpoints stack parameter, ability to specify custom instance tags for custom runners.

Note: there appears to be some issues with the new VPC endpoints. I'm on it! If you need that feature, please hold on to your current version of RunsOn.

What's Changed

  • New GPU images ubuntu22-gpu-x64 and ubuntu24-gpu-x64: 1-1 compatibility with GitHub base images + NVidia GPU drivers, CUDA toolkit, and container toolkit.
  • Add new VpcEndpoints stack parameter (fixes #213), and reorganize template params. Note that the EC2 VPC endpoint was previously automatically created when Private mode was enabled. This is no longer the case, so make sure you select the VPC endpoints that you need when you update your CloudFormation stack.
  • Suspend versioning for cache bucket (fixes #191).
  • Allow to specify instance tags for runners (fixes #205). Tag keys can't start with runs-on- prefix, and key and values will be sanitized according to AWS rules.

Details

Summary

CLI 0.0.1 released, fix for Magic Cache, fleet objects deletion.

What's changed

  • CLI released: https://github.com/runs-on/cli. Allows to easily view logs (both server logs and cloud-init logs) for a workflow job by just pasting its GitHub URL or ID. Also allows easy connection to a runner through SSM.
  • Fix race-condition in Magic Cache (fixes #209).
  • Delete the fleet instead of just the instance (fixes #217).

Details

Summary

Fix magic cache handling of actions/upload-artifact. Prepare for RunsOn CLI.

What's changed

  • Store instance id assigned to job (once job has started) in the main S3 bucket (under /runs-on/db/jobs/JOB_ID/instance-id), as well as the payload for the workflow_job queued event. Will be used for #201.
  • Fix magic cache for cache keys with slashes inside.
  • Make magic cache play nice with actions/upload-artifact. For that you must add runs-on/action@v1 in your workflows. Fixes #197.
  • Documentation for magic cache at https://runs-on.com/caching/magic-cache/

Details

Summary

Magic transparent cache for dependencies and docker layers. SSM support for logging into runner instances. And more.

What's changed

jobs:
  look-ma-no-cache-config:
    runs-on: "runs-on=${{github.run_id}}/runner=2cpu-linux-x64/extras=s3-cache"
    steps:
     # standard action is supported, no need to use `runs-on/cache@v4`
     - uses: actions/cache@v4
       with:
         path: my-path
         key: my-key
     # third-party actions that depend on official toolkit (99%) are supported as well
     - uses: ruby/setup-ruby@v1
       with:
         bundler-cache: true
  • BETA - Transparent S3-backed caching for Docker layers when using cache-to: type=gha / cache-from: type=gha. For now, the magic caching is only enabled with the extras=s3-cache job label.
jobs:
  look-ma-no-cache-config:
    runs-on: "runs-on=${{github.run_id}}/runner=2cpu-linux-x64/extras=s3-cache"
      # BEFORE
      - name: "Build and push image (explicit s3 config)"
        uses: docker/build-push-action@v4
        with:
          tags: test
          cache-from: type=s3,blobs_prefix=cache/docker-s3/,manifests_prefix=cache/docker-s3/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }}
          cache-to: type=s3,blobs_prefix=cache/docker-s3/,manifests_prefix=cache/docker-s3/,region=${{ env.RUNS_ON_AWS_REGION }},bucket=${{ env.RUNS_ON_S3_BUCKET_CACHE }},mode=max

      # AFTER
      - name: "Build and push image (type=gha, automgically switched to S3)"
        uses: docker/build-push-action@v4
        with:
          tags: test
          cache-from: type=gha
          cache-to: type=gha,mode=max
  • Assign AmazonSSMManagedInstanceCore policy to EC2 instances, so that one can easily connect to the runner instance with SSM. Fixes #129.
AWS_PROFILE=YOUR_PROFILE aws ssm start-session --target INSTANCE_ID --reason "testing ssm"
  • Allow to inject additional environment variables from the preinstall step, by exposing a $GITHUB_ENV variable that you can write to. The variables will automatically be made available to the job steps. Fixes #188.
runners:
  preinstall-with-env:
    image: ubuntu22-full-arm64
    family: ["c7g"]
    preinstall: |
      echo "Adding a custom env var..."
      echo "MY_CUSTOM_VAR=my_custom_value" >> $GITHUB_ENV
  • Support preinstall for Windows runners.

  • Expose RunsOnServiceArn as output, so that one can use it to build the CloudWatch log paths. Fixes #184.

  • Do not send the cost allocation tag warning if the latest cost report was non-zero. Fixes #187.

  • Add Ec2LogRetentionInDays stack parameter. Fixes #189.

  • Allow to read license key from SSM. Fixes #176.

image

Details

Summary

New stack parameters and best practices compliance changes. No longer defaults to fetching global config when a local repo config is not found. Improve housekeeping to handle an additional AWS internal error case when launching an instance.

What's changed

  • Add parameter to enable/disable IPv6: Ipv6Enabled. Default is now false, which is a change from previous versions where IPv6 was always enabled. The reason for that is that it looks like docker pulls will go through IPv6 IPs, and for some reason they are getting rate-limited much faster than on IPv4. Will have to dig a bit deeper into that. Fixes #177.
  • Add parameter to disable the inbound SSH rule in the default security group for runners: SSHAllowed. Default is true. Fixes #174. Fixes #159.
  • Add VpcFlowLogRetentionInDays stack parameter. Fixes #180.
  • No longer defaults to fetching global config when a local repo config is not found. The current behaviour was a bit broken with the caching mechanism, and lead to confusion. Let's make the behaviour explicit by requiring a local repo config file, with an explicit _extends directive. I understand this is a bit cumbersome if you have many repositories, but I think it's also nice to be able to inspect which repositories are inheriting from the global config. I'm introducing this change as part of a patch release because the current behaviour was already broken on v2.6.0.
  • Housekeeping: Detect AWS server issue that sometimes leaves instances in pending state, in which case RunsOn will terminate the current instance, and reschedule.
  • Enable versioning on all S3 buckets. Fixes #181.

Details

Summary

Auto-retry mechanism for spot interruptions, SingleAZ or MultiAZ NAT gateways, and more!

What's changed

  • Spot workflows are now retried once with an on-demand instance if interrupted. Fixes #160. Requires a permission update (write permission for Actions instead of read) for existing installations. You should receive an email with instructions after upgrading. Also add runs-on-workflow-job-interrupted=true to the instance tags if the spot instance was interrupted.
  • Add new label retry, with possible values retry=when-interrupted (default for spot), and retry=false to opt out of any auto-retry (useful for non-idempotent jobs).
  • Add runs-on-workflow-job-id to the instance tags once the job has started. Also add it to prometheus metric labels.
  • Rename tag runs-on-job-started => runs-on-workflow-job-started
  • Allow to use 1 NAT gateway per AZ instead of a single one for all. Fixes #165.
  • Add optional VpcCidrSubnetBits, DefaultPermissionBoundaryArn, VpcFlowLogFormat, and VpcFlowLogS3BucketAr parameters, so that users can get more conformant stacks compared to their internal settings.
  • Set RunsOn env variables on Windows.

New Contributors

Details

Summary

Fix GitHub webhook custom_properties handling when non-string values.

Details

Summary

Revert x/time dependency to v0.6.0 since v0.7.0 introduced a breaking change for rate-limits when using a zero limit.

Details

Summary

Add Private=only mode, make EBS encryption opt-in, introduce disk label. Plus fixes and minor improvements.

Note: please use v2.5.8+ because this version embeds a dependency upgrade for the rate-limit library, which introduced a regression.

What's changed

  • Update github go library to fix issue with custom properties.
  • Make EBS encryption opt-in, and specify default encryption key (fixes #152).
  • Add Private=only mode for the CloudFormation stack, so that runners are forbidden to launch in a public subnet. Fixes #150.
  • Disable automatic public IP assignment in public subnets when Private=only is set for the stack (helps with conformance).
  • Remove HousekeepingEnabled stack parameter. Housekeeping is now always enabled.
  • No longer display EgressStaticIp in job logs since we don't know which one the runner will end up using.

Deprecations

  • Introduce disk=default or disk=large label to simplify disk size selection based on the runner volumes defined in the RunsOn CloudFormation stack. hdd is now deprecated and will be removed in a next non-patch version.

Details

Summary

Enable IPv6 for runners. Allow to specify multiple static IPs for the managed NAT gateway. Allow filtering images based on tags. A lot of changes (again) around GitHub rate-limit handling and housekeeping mechanism.

New features

  • Enable IPv6 for runners (fixes #142). An IPv6 is attached for both public and private runners, with an egress ipv6 (free) gateway for private instances.
  • Allow to specify multiple static IPs for the managed NAT gateway (fixes #139). By default up to 2 are possible, and up to 8 when a quota increase is requested. This helps if you are launching a large number of runners in private subnets, and some external service rate-limits you based on the IP.
  • Allow filtering images based on a tag, in addition to the name wildcard (e.g. is-production-ready=true). Example :
# .github/runs-on.yml
images:
  custom:
    owner: "123456789"
    name: "my-org/my-image-name-*"
    arch: x64
    platform: linux
    tags:
      # filter with specific value
      is-production-ready: "true"
      # allow any value
      other-tag: "*"
  • Automatically bind-mount /var/lib/docker on the ephemeral instance storage, if any. Fixes #144.

Bug fixes

  • Escape shell special characters in env file values.
  • If a matching AMI cannot be found, do not retry and alert on first error.
  • Do not attempt to retry job if generated fleet params configuration is incorrect.
  • Abort early if workflow run status cannot be checked.

Fixes to avoid GitHub rate-limit issues

  • No longer attempt to reschedule jobs where a runner theft is suspected. Instead log a warning message telling users to make sure their jobs have unique enough labels. In some cases this was triggering useless reschedules due to GitHub not reflecting the job state quickly enough.
  • Fix too many GitHub calls when fetching repo config from an extends attribute (cache it).
  • No longer unregister runners from GitHub if API credit is lower than 2500. They will be removed by GitHub 24h later anyway.
  • Reorganize rate-limiters, increase DELAY_SECONDS_FOR_CHECK_BACK to 180s instead of 120s. Enable github rate-limiter, and set burst to the current number of remaining tokens.
  • Only attempt to finalize a job once at most. Instance will auto-terminate anyway so at worst we lose the job usage metrics in CloudWatch. But at least we don't eat into the GitHub / EC2 credits.
  • Set housekeeping and termination queue sizes to 1 to reduce their impact on GitHub API credits.

Details

Summary

Strengthen CF template configuration to better conform to AWS guidelines. Bug fixes.

What's changed

  • Verify that generated JIT token has at least one char.
  • Do not attempt to retry runner creation when we know the original request is invalid (e.g. invalid runner configuration due to mismatched labels etc.)
  • Strengthen CF template configuration to better conform to AWS guidelines.
  • Make sure empty admin values are ignored.
  • If no repository config found, cache the result for 1 minute to avoid hammering GitHub API.

Details

Summary

New ubuntu24 images, new housekeeping task to auto-restart instances that failed to launch, new always-on Private setting, additional runner details in logs, and more.

Notable changes (from v2.5.0 to v2.5.4)

  • Add ubuntu24 official images: ubuntu24-full-x64 and ubuntu24-full-arm64.
  • Private CloudFormation parameter now accepts always as value, in which case the runners will always launch in the private subnets by default (unless opt-out with private=false).
  • Display GitHub current rate-limits in logs (search for tokens).
  • Add 'Private' dimension to cloudwatch stats.
  • Add Environment, IsPrivate, and StaticIp (if IsPrivate) to runner details (in Setup job logs)
  • Increase frequency for spot interruption polling + add logs.
  • Conform to AWS spec when sanitizing custom tags (key and value). Fixes #125.
  • Add housekeeping task to handle edge cases where a job is still seen as queued by GitHub after a few minutes even after an instance has been launched.
  • Allow to disable new housekeeping mechanism.
  • Properly tag instance volumes with cost allocation tag. Cost report email will likely go up.
  • Display app_environment and app_stack_name in logs.
  • Attempt to fix rare preinstall issue ending up with "text file busy".
  • Unregister runner from GitHub when job is completed (i.e. do not wait for auto-expiration since it does not seem that reliable).

Experimental

  • Bring back support for single string label, using / as the separator instead of ,. e.g. runs-on: runs-on/runner=2cpu-linux-x64/other=tag will work. This simplifies passing a runs-on specification as input to dependent workflows. If you have multiple RunsOn stacks, make sure they are all upgraded to this version before using this new syntax in workflows.

Internal

  • Fix issue with private attribute not being properly loaded from the repository configuration file.
  • Switch to semaphores for processing the 3 queues.
  • Check workflow run status before scheduling job.
  • Add termination queue.
  • Update GitHub App (for new installations) to listen for workflow_run events (not used yet, but will be soon).
  • Upgrade default runner version when no runner is preinstalled.