Observability

RunsOn exposes several independent monitoring surfaces — some always on per job, some opt-in per job, some wired at the stack level. They are easy to confuse because OpenTelemetry, CloudWatch, the GitHub job log, and the roc CLI can all show similar-looking data. This page lays them out one by one, says where each one ships data, and is explicit about which ones Fleet supports today.

Usage#

Flex#

Always on. A normal RunsOn job already writes inline ASCII charts and instance metadata into the job log. No label, no action required.

jobs:
  build:
    runs-on: runs-on=${{ github.run_id }}/runner=2cpu-linux-x64

Per-job CloudWatch metrics. Add runs-on/action@v2 with the metric groups you want; the action configures the CloudWatch agent and renders post-step charts. CloudWatch ingestion is the more expensive backend — prefer the OTLP path below when you have an OTLP backend.

- uses: runs-on/action@v2
  with:
    metrics: cpu,memory,disk

Per-job OTLP export. When the Flex stack is configured with OtelExporterEndpoint, opt a job into runner-side OTLP with extras=otel. The runner then ships bootstrap logs, host metrics, and agent traces to your OTLP backend.

jobs:
  observe:
    runs-on: runs-on=${{ github.run_id }}/runner=2cpu-linux-x64/extras=otel

roc logs is unrelated to any of the above — it pulls instance and control-plane logs from CloudWatch for a specific job URL, for ad-hoc debugging.

AWS_PROFILE=runs-on-admin roc logs "$JOB_URL" --include=console --watch

Fleet#

Fleet supports the shared per-job surfaces: inline runner metrics, runner metadata, and runs-on/action@v2 CloudWatch metrics. Fleet operator-facing surfaces are narrower today, so use raw CloudWatch logs and Fleet troubleshooting for control-plane checks.

Per-job surfaces (always on)#

Built-in runner metrics (`metrics.jsonl`)#

The local collector runs on every RunsOn runner. It samples host metrics during the job, writes them to metrics.jsonl, renders ASCII charts inside the Complete runner step, and uploads the raw file to the RunsOn S3 bucket for later inspection.

You get this without enabling anything. See OpenTelemetry for the per-job paths and example chart output.

Runner metadata in the job log#

The Set up job GitHub UI step shows EC2 instance type, lifecycle (spot vs on-demand), AMI, availability zone, pool name, and boot timings. Useful for spot-checking what RunsOn actually launched. Also always on.

Opt-in per-job surfaces#

`runs-on/action@v2` CloudWatch metrics#

Add the action with the metric groups you want — cpu, network, memory, disk, io. The action configures the CloudWatch agent on the runner, then a post step queries CloudWatch and renders charts inside the Post Run runs-on/action@v2 step. See the action README ↗ for the full input list.

CloudWatch metric ingestion is billed per metric per month, which makes this the more expensive surface for high-cardinality runners. It is still useful when you don’t have an OTLP backend, or when you want the action’s other features (cost summary, sccache, etc.).

This path does not work for container-based jobs and does not write metrics.jsonl.

`extras=otel` runner-side OTLP export#

Add extras=otel to a job label (or to the pool runner spec) when you want the runner to ship its bootstrap logs, host metrics, and agent traces to your OTLP backend. The stack must already be configured with OtelExporterEndpoint.

This is the path forward for long-term observability — OTLP is cheaper per byte than CloudWatch and works with any OTLP-compatible backend (Grafana Cloud, SigNoz, Datadog, Honeycomb, etc.). See OpenTelemetry for the full emitted-signal inventory.

Stack-level surfaces (operator-facing)#

Server-side OTLP and per-step job spans#

When OtelExporterEndpoint is configured, the Flex control plane exports its own structured logs, the RunsOn server-metrics inventory (job counters, queue durations, pool state, spot circuit breaker, rate limiters), and one span per GitHub workflow step. The per-step spans are emitted automatically from the server — workflow YAML does not need extras=otel to get them.

CloudWatch stack dashboard#

Flex creates a CloudWatch dashboard for control-plane health: webhook ingress latency, Lambda invocations, queue depth, GitHub rate-limit pressure, and launch behavior. It is created automatically with the stack (no toggle). See CloudWatch.

Cost report and budget alarm#

A daily cost-report email summarizes spend on the deployed stack, and an AWS Budgets daily alarm fires when usage crosses a configured threshold. Configured with AppBudgetDailyUsd and CostReportsEnabled. See Cost control.

Flex provisions an SNS topic and an email subscription for the out-of-band situations that can’t be surfaced on the run: truly unschedulable jobs and license issues. An optional Slack webhook receives the same notifications. Configured with the EmailAddress stack parameter (required at install) and AlertTopicSlackWebhookUrl (optional). See Flex alerts.

Housekeeping watchdogs#

Background reconcilers detect idle runners (10 minute timeout), hard-limit timeouts (12 hours), retry spot interruptions, and clean up stale resources. Mostly invisible until something goes wrong. See Housekeeping.

OTEL vs CloudWatch#

When you have a choice, prefer OTLP:

OTLP ingestion is much cheaper than CloudWatch per byte, especially for metrics with attributes.
OTLP backends (Grafana Cloud, SigNoz, Datadog, Honeycomb, …) give you richer correlation between server metrics, runner host metrics, and per-step spans than CloudWatch alone.
CloudWatch is still the right answer when you don’t have an OTLP backend, when you want the AWS-native dashboard with no extra wiring, or when you specifically want the runs-on/action@v2 post-step ASCII charts.

These are not mutually exclusive — extras=otel and runs-on/action@v2 can run on the same job.

Flex vs Fleet#

Today, Fleet supports the per-job surfaces that come bundled with the shared runner agent and CLI. OTLP, the CloudWatch stack dashboard, and budgets/cost reports are still Flex-only; SNS/Slack alerting is shared and works on Fleet too. Platform teams considering Fleet should plan for control-plane observability separately (raw CloudWatch logs + your own dashboards) until these land.

Surface	Flex	Fleet
Built-in runner metrics (`metrics.jsonl`)	✓	✓
Runner metadata in job log	✓	✓
`runs-on/action@v2` CloudWatch	✓	✓ on the runner (shared action)
`roc logs` / `roc connect`	✓	Partial — see CLI product support
Housekeeping watchdogs	✓	✓ (shared reconciler)
Runner-side OTLP (`extras=otel`)	✓ (with stack OTLP endpoint configured)	— (no OTLP endpoint variable in Fleet Terraform)
Server-side OTLP + per-step spans	✓	— (Flex-only emitter)
CloudWatch stack dashboard	✓	—
Cost report + daily budget alarm	✓	—
SNS alerts + Slack	✓	✓ (shared alerts module)

OpenTelemetry Per-job runner metrics, `extras=otel`, the OTLP endpoint, and the full server + runner metric inventory. CloudWatch CloudWatch dashboard for control-plane health, plus per-job metrics via runs-on/action@v2. Cost control Spot tuning, daily budget, cost reports, and termination safeguards. Flex alerts SNS email subscription and optional Slack webhook for failure notifications. Housekeeping Idle, hard-limit, and reconciliation watchdogs that bound runaway runners. Fleet troubleshooting What is available today for Fleet operators.