self-host →

Troubleshooting

Diagnose RunsOn runner and stack issues across Flex and Fleet — webhook delivery, EC2 launch failures, scale-set routing, runner groups, logs, and common configuration mistakes.

Runners can fail to start for a variety of reasons. The diagnostics differ depending on whether you run RunsOn in Flex mode (per-job runners launched from webhooks) or Fleet mode (runner scale sets driven by GitHub job assignment).

Jump to the section that matches your deployment — Flex troubleshooting or Fleet troubleshooting — then use the shared sections below (Viewing logs, CloudTrail events, Unexpected costs, apt and dpkg lock errors) for diagnostics that apply to both modes.

Quick checks

Flex troubleshooting

If an error is raised while attempting to start a workflow, RunsOn will alert you by email (assuming you have confirmed the SNS Topic notification when you setup the stack).

Common Flex symptoms

CloudFormation stack fails while creating RunsOnWorkerCluster

If the CloudFormation stack fails on RunsOnWorkerCluster with the following error, the failure happens before RunsOn starts:

Unable to assume the service linked role. Please verify that the ECS service linked role exists.

RunsOnWorkerCluster is the ECS/Fargate cluster used by the RunsOn control plane. AWS needs the account-level ECS service-linked role, AWSServiceRoleForECS, before it can create or use that cluster.

This can happen in fresh AWS Control Tower or multi-account environments when the member account does not have the ECS service-linked role yet, or when an SCP or permission boundary blocks ECS from creating or assuming it.

First check whether the role exists in the target AWS account:

aws iam get-role --role-name AWSServiceRoleForECS

If the role is missing, create it once:

aws iam create-service-linked-role --aws-service-name ecs.amazonaws.com

If that command is denied, your AWS Organizations policy or permission boundary must allow service-linked role creation for ECS. A scoped allowance looks like this:

{
"Effect": "Allow",
"Action": "iam:CreateServiceLinkedRole",
"Resource": "*",
"Condition": {
"StringEquals": {
"iam:AWSServiceName": "ecs.amazonaws.com"
}
}
}

After the role exists and IAM propagation has completed, retry the CloudFormation stack creation.

All jobs are queued indefinitely or long queuing time for some workflows

RunsOn runners consistently start in ~30s for x64 and arm64. If you are seeing abnormal queuing times, let’s review the different possible root causes.

Webhooks not getting delivered

To start a runner, the GitHub webhook needs to be delivered to the RunsOn public ingress endpoint. If you are seeing long queuing times, it is possible that the webhook is not getting delivered.

To check this, you can go to your RunsOn GitHub App settings > Advanced, and you should see the last deliveries with their status code.

If you see a non-200 status code, it is possible that the webhook is not getting delivered. You can manually trigger a delivery. If it persists and you think it might be a bug on the receiver side, please contact support.

Delivery failed

Runner stealing

It may be the case that the runner started for your workflow job has been stolen by another workflow job.

For instance, if two workflow jobs A and B with the same runs-on labels are queued at the same time, the runner started for job A may actually start processing job B (since runner A labels matches those for job B), while job A has to wait for runner B to come up online.

To avoid this and help with debugging, it is best practice to ensure that each workflow job gets a more unique label. This can be achieved by assigning the current workflow run id as an additional label.

Make sure you are using the single-string syntax available since v2.5.4, so GitHub treats the full RunsOn specification as one label:

jobs:
my-build-job:
runs-on: "runs-on=${{ github.run_id }}/runner=2cpu-linux-x64"
# even better, if you have multiple jobs in the same workflow file with the same `runs-on:` labels
runs-on: "runs-on=${{ github.run_id }}-my-build-job/runner=2cpu-linux-x64"
my-release-job:
runs-on: "runs-on=${{ github.run_id }}-my-release-job/runner=2cpu-linux-x64"

If the problem persists:

  1. ensure that the repository is correctly enabled for your RunsOn GitHub App.
  2. ensure that webhooks are correctly delivered to your RunsOn public ingress endpoint: go to your RunsOn GitHub App settings > Advanced, and you should see the last deliveries with their status code.

If you need more help, please contact support.

Runner stealing and matrix jobs

If you are using matrix jobs, note that the github.run_id is not unique for each matrix job. It is only unique for each workflow run, and unfortunately GitHub still doesn’t expose the JOB_ID variable for a job. So if you want to ensure a deterministic job <-> runner assignment, you can append the strategy job index in addition to the workflow run id. You can also add the run attempt number for good measure:

jobs:
my-build-job:
strategy:
matrix:
node: [16, 18, 20]
runs-on: "runs-on=${{ github.run_id }}-my-build-job-${{ strategy.job-index }}/runner=2cpu-linux-x64"
# or even more complete, although... long
runs-on: "runs-on=${{ github.run_id }}-my-build-job-${{ github.run_attempt }}-${{ strategy.job-index }}/runner=2cpu-linux-x64"

Failed to create instance

This error can happen due to multiple reasons:

PendingVerification

⚠️ Failed to create instance with type c7a.4xlarge: PendingVerification: Your request for accessing resources in this region is being validated, and you will not be able to launch additional resources in this region until the validation is complete. We will notify you by email once your request has been validated. While normally resolved within minutes, please allow up to 4 hours for this process to complete. If the issue still persists, then open a support case. [https://support.console.aws.amazon.com/support/home?region=us-east-1#/case/create?issueType=customer-service&serviceCode=account-management&categoryCode=account-verification]

This is usually resolved within a few minutes (automatically). So just retry the workflow a few minutes later and it should work. Otherwise open a support case.

RequestLimitExceeded

This usually happens if you are launching instances too quickly compared to the allowed rate limit for your account.

The rate limit mechanism is detailed in https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html.

RunsOn now defaults to the lowest rate limit (2 RunInstances API call/s max).

If your account has a higher quota for those API calls, you can use a larger AppSize preset. Larger presets increase worker concurrency and assume you have raised the relevant EC2 quotas.

Fleet troubleshooting

Fleet troubleshooting starts with GitHub routing. A healthy AWS stack can still show no runner launches if GitHub is not assigning jobs to the scale set. Before looking at EC2, verify that GitHub assigned the job to the Fleet scale set.

Use this order:

  1. Confirm the workflow label and runner group route the job to the scale set.
  2. Confirm the Fleet worker sees assigned demand.
  3. Confirm EC2 can launch or pick up capacity for the runner fleet.
  4. Confirm the runner registers and receives the job.

Check the workflow label

The workflow label must match the fleet name and environment:

runs-on: runs-on/fleet=linux-small/env=production

Confirm:

  • linux-small exists in the Terraform fleets map.
  • The label environment matches the module-level environment.
  • The repository has access to the runner group that contains the scale set.

Fleet has one routing environment per stack. Fleet entries cannot override env, so a workflow using env=staging will not match a stack deployed with environment = "production".

Check runner groups

In organization mode, the runner group is an organization runner group. In enterprise mode, it is an enterprise runner group.

Fleet looks up the group by name. If runner_group is set to a group that does not exist or does not grant access to the repository, jobs will not route to the runner fleet.

Multiple fleets can share one runner group. Use separate groups only when the GitHub access policy must differ.

Check the GitHub scale set

In GitHub, the runner fleet should appear as a runner scale set named runs-on-<stack>-<fleet>. With stack runs-on-fleet and fleet linux-small, the scale set in GitHub is runs-on-runs-on-fleet-linux-small.

If the scale set is missing, check Fleet startup logs and GitHub credentials:

  • organization mode needs one active GitHub App installation with organization self-hosted runner write access
  • enterprise mode needs github_enterprise_pat and github_enterprise_name
  • github_base_url must be the host root, not an API path

If the scale set exists but jobs remain queued, the most likely causes are a label mismatch, runner-group access, enterprise organization access, or GitHub workflow restrictions.

Check the Fleet worker logs

The Terraform module creates a CloudWatch log group for the Fleet runtime service. Look for log fields such as:

  • fleet_name
  • fleet_scope
  • runner_group
  • workflow_label
  • assigned-job and claim counts

If logs show the target as ready but assigned demand stays at zero, the issue is usually GitHub routing: label mismatch, runner group access, enterprise organization access, or workflow restrictions.

If assigned demand is non-zero but no runner launches, check EC2 capacity and runner fleet configuration:

  • EC2 service quotas for the selected instance family
  • spot or on-demand availability for the family and Availability Zones
  • image lookup results for the configured image
  • subnet routing and security groups
  • runner IAM permissions and permission boundaries
  • max_launch_batch_size if large bursts are only launching in small waves

Check hot and stopped pools

If a runner fleet uses schedule.hot or schedule.stopped, separate standby inventory issues from GitHub routing issues.

If standby instances are not present:

  • confirm the schedule matches the current time in the runner fleet’s timezone
  • confirm the active schedule has non-zero hot or stopped
  • check EC2 on-demand quota and subnet capacity
  • check Fleet worker logs for the target fleet_name

If standby instances exist but jobs still launch cold:

  • confirm the workflow targets the same fleet key and environment
  • confirm the standby instances belong to the same stack and runner fleet
  • check whether all ready hot or stopped instances were already consumed by earlier assigned jobs
  • after changing runner image, family, networking, or IAM, allow Fleet to replace stale standby inventory

Fleet uses ready hot instances first, then ready stopped instances, then cold CreateFleet overflow.

Matrix jobs and max-parallel

Fleet can work with GitHub strategy.max-parallel because GitHub assigns jobs to the runner scale set as they become eligible to run. Use this when a large matrix should intentionally limit concurrent runner demand:

strategy:
max-parallel: 4
matrix:
shard: [1, 2, 3, 4, 5, 6, 7, 8]

If a matrix appears slower than expected, check whether max-parallel is intentionally limiting the number of assigned jobs before tuning Fleet capacity.

Runners launch but do not register

When EC2 instances launch but GitHub jobs keep waiting, inspect the runner logs:

  • CloudWatch Logs for the EC2 runner log group (see Viewing logs)
  • EC2 console output for cloud-init or bootstrap failures (see Instance console logs)
  • subnet egress to GitHub, S3, ECR, and any package registries used by the image
  • Secrets Manager and S3 access from the runner instance role

Once Fleet has selected capacity, the same lower-level AWS checks apply as in Flex: EC2 quota, AMI lookup, subnet reachability, IAM permissions, instance bootstrap, and CloudWatch logs. The Fleet-specific difference is the first step — GitHub must assign demand to the scale set before Fleet launches or picks up a runner.

Common Fleet symptoms

SymptomLikely cause
No EC2 runners launchGitHub did not assign jobs to the scale set. Check label and runner-group access.
Fleet startup fails in organization modeThe GitHub App has zero or multiple active installations. Keep one installation for the runtime.
Enterprise runner fleet exists but org jobs do not routeThe enterprise runner group does not grant access to that organization or workflow.
Runners launch but jobs retryCheck EC2 capacity, runner image lookup, IAM, networking, and bootstrap logs.
Hot or stopped pool is ignoredSchedule mismatch, stale standby inventory, or all ready standby instances were already consumed.
Large bursts launch slowlyReview app_size, EC2 quotas, max_launch_batch_size, and runner family availability.

Finding run_id and job_id

The run_id and job_id are not easily available from the GitHub UI. The easiest way is to go to a job log outputs in the GitHub UI, and extract the values from the URL.

For instance you may have a URL that looks like this:

https://github.com/YOUR_ORG/YOUR_REPO/actions/runs/12054210358/job/33611707460

In which case the run_id is 12054210358 and the job_id is 33611707460.

Viewing logs

Application logs

Application logs for the RunsOn control plane are available in CloudWatch. The log group name is exposed by the CloudFormation RunsOnServiceLogGroupName output.

There are multiple ways to access the logs:

For v3 and later, RunsOn provides a CLI --full mode to export a complete diagnostic archive for a GitHub job:

AWS_PROFILE=your-aws-profile roc logs https://github.com/YOUR_ORG/YOUR_REPO/actions/runs/RUN_ID/job/JOB_ID --full

This writes a roc-logs-<job_id>-<timestamp>.zip archive with the raw workflow-job item, RunsOn control-plane logs for the job and run, CloudTrail events for attempted instances, EC2 console output, and agent logs.

For live streaming, omit --full:

AWS_PROFILE=your-aws-profile roc logs https://github.com/YOUR_ORG/YOUR_REPO/actions/runs/RUN_ID/job/JOB_ID --watch

It can be useful to access the logs of RunsOn to see more details about the issues. This can either be done from CloudWatch UI, or with awslogs command:

pip install awslogs

Now replace the log group with the value of your RunsOnServiceLogGroupName stack output, and you can do:

AWS_PROFILE=your-aws-profile awslogs get --aws-region eu-west-1 \
YOUR_RUNS_ON_SERVICE_LOG_GROUP_NAME \
-wGS -s 30m --timestamp

You can also find the logs from the AWS UI, and apply filtering based on e.g. the workflow run id:

Play

On current v3 CloudFormation installs, application logs use the built-in retention configured by RunsOn. If you manage the stack yourself with Terraform/OpenTofu or custom infrastructure, verify that your CloudWatch log retention matches your own policy.

Instance cloud-init logs

Official images publish bootstrap logs, including the cloud-init boot process, to the EC2 instance log group in CloudWatch.

If runner OTEL is enabled for a job, the bootstrap output.log file can also be forwarded to your OTLP backend as described on /docs/observability/opentelemetry/.

You can use the RunsOn CLI to view all job logs:

roc logs https://github.com/owner/repo/actions/runs/123/job/456

These logs can be also be retrieved from the AWS UI: CloudWatch > Log groups > <STACK_NAME>-runs-on-EC2InstanceLogGroup-<RANDOM_ID>.

Within that log group you will find a log stream for each instance and accompanying log file (e.g. i-0006f3ff78fcd11f4/cloud-init-output). You can filter using the instance ID.

Cloud-init logs

Instance logs are kept for 7 days.

Instance console logs

You can use the RunsOn CLI to view the EC2 instance console logs:

roc logs https://github.com/owner/repo/actions/runs/123/job/456 --include=console

You can also retrieve the console logs through the AWS Console. In EC2, select your instance and then Actions > Monitor and troubleshoot > Get system log:

Console logs

Note that system logs are sometimes only available a few minutes after the instance has been created.

CloudTrail events

If you’re getting errors about request limit exceeded or quota issues, have a look at the Cloudtrail events, and especially for the RunInstances API event, to see if you are getting rate limited.

For instance in eu-west-1, the Cloudtrail events can be accessed at:

https://eu-west-1.console.aws.amazon.com/cloudtrailv2/home?region=eu-west-1#/events?ReadOnly=false

Checking if a spot instance has been preempted

In the CloudTrail events, you can check if a spot instance has been preempted by checking for events with the name BidEvictedEvent.

Unexpected costs

AWS Config

With default settings, AWS Config records an event for every ephemeral EC2 resource RunsOn creates (Fleet, Network Interface, Volume), which can add up quickly. See Cost control › AWS Config for the fix.

Datadog

From one of our users:

We ran into a big spike in registered Datadog Infra Hosts after switching to RunsOn because Datadog’s automatic AWS integration was picking up the new instances. And of course, since this is Datadog, more hosts means a lot more money. https://docs.datadoghq.com/account_management/billing/aws/#aws-resource-exclusion gives an easy approach to ignoring these hosts, I’m just doing EC2: !provider:runs-on.com and that seems to be working.

apt and dpkg lock errors

If you encounter apt or dpkg lock errors like the following during your workflow jobs:

E: Could not get lock /var/lib/apt/lists/lock. It is held by process 1166 (python)
E: Unable to lock directory /var/lib/apt/lists/

or:

Run sudo apt-get update -qq && sudo apt-get install build-essential -y
E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 1674 (dpkg)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
Error: Process completed with exit code 100.

the most likely cause is that AWS is upgrading the SSM Agent in the background while your job is trying to run apt.

If you followed the recommended RunsOn setup and use a dedicated AWS account for RunsOn, disabling SSM Agent auto-updates is usually the right fix. RunsOn instances are short-lived, so letting AWS update SSM Agent in the background adds little value and can break package installation during job startup.

To fix this, disable SSM Agent auto-updates in your AWS account:

  1. Go to the AWS Systems Manager console.
  2. Navigate to Fleet Manager > Settings.
  3. Under Agent auto update, choose Delete to remove the State Manager association that automatically updates SSM Agent on your managed nodes.

See the AWS documentation for more details.

Alternatively, you can work around this issue by:

  • Adding a retry with backoff to your apt install commands.
  • Pre-installing the required packages in a custom image.

Contact support

Once you’ve made the checks above, please have a look through the relevant section, and send us an email if the issue persists: ops@runs-on.com. Include as many details as possible, such as:

  • RunsOn version.
  • AWS region.
  • Any error messages you see in the GitHub UI, email notifications, or CloudWatch logs.
  • CloudWatch logs for the RunsOn control plane (you can filter on the run_id or job_id — see Finding run_id and job_id), and instance logs if you have them.
  • Details about the workflows in error, especially the runs-on labels, number of jobs in the workflow, and any use of matrix jobs.