Skip to content

Troubleshooting

Runners can fail to start for a variety of reasons. If an error is raised while attempting to start a workflow, RunsOn will alert you by email (assuming you have confirmed the SNS Topic notification when you setup the stack).

Quick checks

Contact support

Once you’ve made the checks above, please have a look below at various error cases, and send us an email if the issue persists: [email protected]. Include as many details as possible, such as:

  • RunsOn version.
  • AWS region.
  • Any error messages you see.
  • CloudWatch logs for the AppRunner service (you can filter on the run_id or job_id), and instance logs if you have them.
  • Details about the workflows in error, especially the runs-on labels, number of jobs in the workflow, and any use of matrix jobs.

How to find run_id and job_id

The run_id and job_id are not easily available from the GitHub UI. The easiest was is to go to a job log outputs, and extract the values from the URL.

For instance you may have a URL that looks like this:

https://github.com/YOUR_ORG/YOUR_REPO/actions/runs/12054210358/job/33611707460

In which case the run_id is 12054210358 and the job_id is 33611707460.

All jobs are queued indefinitely or long queuing time for some workflows

RunsOn runners consistently start in ~30s for x64 and arm64. If you are seeing abnormal queuing times, it may be the case that the runner started for your workflow job has been stolen by another worklow job.

For instance, if two workflow jobs A and B with the same runs-on labels are queued at the same time, the runner started for job A may actually start processing job B (since runner A labels matches those for job B), while job A has to wait for runner B to come up online.

To avoid this and help with debugging, it is best practice to ensure that each workflow job gets a more unique label. This can be achieved by assigning the current workflow run id as an additional label.

Make sure you are using the new single label syntax available since v2.5.4, to ensure as much determinism in label name as possible:

runs-on: "runs-on=${{ github.run_id }}/runner=2cpu-linux-x64"
# even better
runs-on: "runs-on=${{ github.run_id }}/runner=2cpu-linux-x64/tag=my-unique-tag-for-that-job"

If the problem persists:

  1. ensure that the repository is correctly enabled for your RunsOn GitHub App.
  2. ensure that webhooks are correctly delivered to your AppRunner service: go to your RunsOn GitHub App settings > Advanced, and you should see the last deliveries with their status code.

If you need more help, please contact support (see above for links).

The case of matrix jobs

If you are using matrix jobs, note that the github.run_id is not unique for each matrix job. It is only unique for each workflow run, and unfortunately GitHub still doesn’t expose the JOB_ID variable for a job. So if you want to ensure a deterministic job <-> runner assignment, you can append a custom tag to identify each matrix job item:

jobs:
build:
strategy:
matrix:
node: [16, 18, 20]
runs-on: "runs-on=${{ github.run_id }}/runner=2cpu-linux-x64/tag=node-${{ matrix.node }}"

Also note that the max-parallel option is not supported by RunsOn, because GitHub does not send an event when the job is finally ready to be scheduled, but all the events are sent at the time the workflow run is created.

View the application logs

It can be useful to access the logs of RunsOn to see more details about the issues. This can either be done from CloudWatch UI, or with awslogs command:

pip install awslogs

Now replace the log group (/aws/apprunner/...) with yours, and you can do:

AWS_PROFILE=your-aws-profile awslogs get --aws-region eu-west-1 \
/aws/apprunner/RunsOnService-6Gwxsz1vjfMD/356d75069c2c4ec89b0e452c51778ce8/application \
-wGS -s 30m --timestamp

You can also find the logs from the AWS UI, and apply filtering based on e.g. the workflow run id:

Note: the log group created by the AppRunner application has no retention period set (not supported yet by CloudFormation). We recommend that you manually update this period to e.g. 30 days to avoid costs.

View the instance logs

Starting with v2.3.2, the CloudWatch agent is automatically setup and started on all instances that use or derive from the official images.

These logs can be seen in the AWS UI: CloudWatch > Log groups > <STACK_NAME>-runs-on-EC2InstanceLogGroup-<RANDOM_ID>.

Within that log group you will find a log stream for each instance and accompanying log file (e.g. i-0006f3ff78fcd11f4/cloud-init-output). You can filter using the instance ID.

Instance logs are kept for 7 days.

View Cloudtrail events

If you’re getting errors about request limit exceeded or quota issues, have a look at the Cloudtrail events, and especially for the RunInstances API event, to see if you are getting rate limited.

For instance in eu-west-1, the Cloudtrail events can be accessed at:

https://eu-west-1.console.aws.amazon.com/cloudtrailv2/home?region=eu-west-1#/events?ReadOnly=false

Checking if a spot instance has been preempted

In the CloudTrail events, you can check if a spot instance has been preempted by checking for events with the name BidEvictedEvent.

Failed to create instance

This error can happen due to multiple reasons:

PendingVerification

⚠️ Failed to create instance with type c7a.4xlarge: PendingVerification: Your request for accessing resources in this region is being validated, and you will not be able to launch additional resources in this region until the validation is complete. We will notify you by email once your request has been validated. While normally resolved within minutes, please allow up to 4 hours for this process to complete. If the issue still persists, then open a support case. [https://support.console.aws.amazon.com/support/home?region=us-east-1#/case/create?issueType=customer-service&serviceCode=account-management&categoryCode=account-verification]

This is usually resolved within a few minutes (automatically). So just retry the workflow a few minutes later and it should work. Otherwise open a support case.

RequestLimitExceeded

This usually happens if you are launching instances too quickly compared to the allowed rate limit for your account.

The rate limit mechanism is detailed in https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html ↗, but this should not longer happen since v1.6.2.

RunsOn now defaults to the lowest rate limit (2 RunInstances API call/s max).

If your account has a higher quota for those API calls, you can modify the queue size in the CloudFormation stack parameters to take advantage of it.

Unexpected costs

AWS Config

If you have AWS Config enabled in your AWS account, with the default settings it will record an event for every resource created in your account, including every EC2 instances created by RunsOn. Each EC2 instance will trigger at least 3 events that could quickly add up:

  • AWS EC2 Fleet
  • AWS EC2 Network Interface
  • AWS EC2 Volume

To avoid this, you should modify your AWS Config settings to skip recording for those events, in the AWS account where RunsOn is deployed.

AWS Config settings

You can also skip recording AWS EC2 Instance events if you have really high usage.