CloudWatch

RunsOn integrates with CloudWatch in two places: a built-in dashboard for control-plane health, and per-job runner metrics collected through runs-on/action@v2 ↗. Both keep observability inside your AWS account, with no extra backend to run.

Flex CloudWatch dashboard#

RunsOn can create a CloudWatch dashboard for the deployed stack. Use it for a quick, AWS-native view of whether GitHub webhooks are arriving, jobs are being scheduled, queues are draining, and runner launches are behaving normally.

The dashboard answers operational questions such as:

Is the internal webhook/job queue growing or draining?
How many runners have been scheduled, and how does that trend over time?
What are the internal vs overall queue-duration percentiles (P50/P90)?
Are there recent error messages in the worker logs?
Is the spot circuit breaker active, and are there recent spot interruptions?

It is not meant to replace a full observability backend. The dashboard mostly uses structured CloudWatch logs and AWS service metrics, so it is less flexible than querying OTLP metrics and traces in Grafana, Datadog, SigNoz, New Relic, or another OTLP-compatible backend.

The dashboard#

RunsOn always creates a CloudWatch dashboard for the deployed stack (there is no enable/disable parameter). It is created in the AWS account and region where the stack is deployed. Open CloudWatch → Dashboards and look for the dashboard created for your RunsOn stack.

Runner CloudWatch metrics via `runs-on/action@v2`#

For per-job resource metrics — CPU, memory, disk, network, and I/O — runs-on/action@v2 ↗ can collect them through CloudWatch and render ASCII charts in the Post Run runs-on/action@v2 step:

Add runs-on/action@v2 ↗ to the job.
Request one or more metric groups with the metrics: input.
The action configures CloudWatch metrics collection.
In Post Run runs-on/action@v2, the action queries CloudWatch and renders ASCII charts in that post step.

This path does not generate or upload metrics.jsonl, and does not work with container-based jobs (unlike the built-in runner metrics).

Supported metric groups#

Metric group	Available metrics	What it helps you answer
CPU	`usage_user`, `usage_system`	Is the runner CPU-bound?
Network	`bytes_recv`, `bytes_sent`	Is the job moving a lot of data?
Memory	`used_percent`	Is the runner memory-constrained?
Disk	`used_percent`, `inodes_used`	Is the workspace or filesystem filling up?
I/O	`io_time`, `reads`, `writes`	Is the job bottlenecked on disk activity?

Configure the action#

jobs:
  build:
    runs-on: runs-on=${{ github.run_id }}/runner=2cpu-linux-x64
    steps:
      - uses: runs-on/action@v2
        with:
          metrics: cpu,network,memory,disk,io

      - uses: actions/checkout@v6
      - name: Build application
        run: npm run build

You can also request a smaller subset (metrics: cpu,memory). Example output from the post step:

📈 Metrics (since 2025-06-30T14:18:56Z):

📊 CPU User:
   100.0 ┤
    87.5 ┤                                        ╭─╮╭───────────╮
    75.0 ┤                                       ╭╯ ╰╯           │
    62.5 ┤                                      ╭╯               ╰╮
    50.0 ┤                                      │                 │
    37.5 ┤                                      │                 ╰╮
    25.0 ┤                                     ╭╯                  │
    12.5 ┤                    ╭─────────╮╭─────╯                   ╰╮
     0.0 ┼────────────────────╯         ╰╯                          ╰
                               CPU User (Percent)
  Stats: min:0.0 avg:29.0 max:93.4 Percent

For the other per-job metric paths — built-in inline charts and runner OpenTelemetry export — see OpenTelemetry.

Configure OpenTelemetry export#

The control plane can also push its own logs, metrics, and traces to any OTLP-compatible backend. For Flex, set the OTLP endpoint (and optional headers) on the stack:

OtelExporterEndpoint — the OTLP endpoint to export to.
OtelExporterHeaders — authentication headers for the backend.

Once an endpoint is configured, server-side export is automatic. For the full list of server metrics, attributes, and resource attributes — plus runner-side export — see the OpenTelemetry reference.

What to use for each problem#

Problem	Use
Check whether the control plane is healthy	The Flex CloudWatch dashboard
Get per-job CPU/memory/disk/network charts	The `runs-on/action@v2` metrics above
Debug one GitHub Actions job	Runner metrics and the job log
Send server logs, metrics, and traces to an observability backend	OpenTelemetry
Get failure notifications by email or Slack	Alerts
Track daily spend and budget alarms	Cost control

How it relates to runner metrics#

The stack dashboard is about the RunsOn control plane: incoming GitHub events, scheduling, queues, API pressure, and AWS service behavior.

Runner metrics are about a single runner while it executes a job: CPU, memory, disk, network, I/O, and runner metadata.

If a job is slow or oversized, start with runner metrics. If many jobs are delayed, failing to launch, or stuck behind queue/API pressure, start with the dashboard.