RunsOn RunsOn

Stack Metrics

Monitor your RunsOn installation with OpenTelemetry and CloudWatch

RunsOn provides comprehensive monitoring capabilities through two complementary approaches:

  • OpenTelemetry (OTEL) metrics - Export detailed metrics to your observability platform
  • CloudWatch integration - Built-in dashboard and native AWS metrics

OpenTelemetry Metrics

RunsOn exports metrics in OpenTelemetry format via OTLP (HTTP for now), allowing you to integrate with popular observability platforms like Prometheus, Grafana Cloud, Signoz, Datadog, New Relic, and others.

Configuration

Configure OTEL metrics export using these CloudFormation parameters:

ParameterDescriptionExample
OtelExporterEndpointOTLP endpoint URL (e.g., ingest.eu.signoz.cloud:443). Only HTTP(s) protocol is supported.ingest.eu.signoz.cloud:443
OtelExporterHeadersHeaders for OTLP endpoint in W3C Baggage format: key1=value1,key2=value2signoz-ingestion-key=ABCD1234

Runner Logs, Traces, and Host Metrics

If you set extras=otel on a job, RunsOn starts a local OTEL collector on the runner. It forwards runner logs, traces, and host metrics to the OTLP endpoint configured above, so you can follow a job from the moment RunsOn receives it until the runner is terminated, including spans for each job step.

jobs:
  build:
    runs-on: runs-on=${{ github.run_id }}/runner=2cpu-linux-x64/extras=otel
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm test

You can combine it with other extras. For example, extras=s3-cache+otel enables Magic Cache and runner-side OTEL collection on the same job.

Available Metrics

Job Metrics

runs_on_jobs_total (Counter) Total number of jobs by status.

Attributes:

  • status: Job status (queued, scheduled, in_progress, completed)
  • conclusion: Job conclusion for completed status (success, failure, cancelled, skipped)
  • repo_full_name: Repository (e.g., owner/repo)
  • workflow_name: GitHub workflow name
  • instance_type: EC2 instance type (only for scheduled status)
  • instance_lifecycle: spot or on-demand (only for scheduled status)
  • pool_name: Pool name if scheduled from a pool
  • interrupted: Whether the job was interrupted
  • org: GitHub organization name
  • installation_id: GitHub App installation ID

runs_on_internal_queue_duration_seconds (Histogram) Time from job queued in RunsOn to instance scheduled.

runs_on_overall_queue_duration_seconds (Histogram) Time from job queued by GitHub to job started (includes instance launch and runner bootstrap).

runs_on_job_duration_seconds (Histogram) Time from job started to completed.

Pool Metrics

runs_on_pool_instances_total (Observable Gauge) Current number of pool instances by state.

Attributes:

  • pool_name: Pool name
  • state: Instance state (running, stopped, pending, terminating)
  • installation_id: GitHub App installation ID
  • org: GitHub organization name

Rate Limiter Metrics

runs_on_rate_limiter_tokens (Observable Gauge) Available tokens in rate limiter.

runs_on_rate_limiter_burst (Observable Gauge) Burst capacity of rate limiter.

Attributes:

  • limiter: Rate limiter name (github_api, ec2_api, etc.)

Spot Circuit Breaker Metrics

runs_on_spot_circuit_breaker_active (Observable Gauge) Whether spot circuit breaker is currently active (1 = active, 0 = inactive).

Resource Attributes

All metrics include these resource attributes:

AttributeDescription
service.nameAlways runs-on-server
app.versionRunsOn version
app.environmentEnvironment name
stack_nameCloudFormation stack name
regionAWS region

Structured Logs

In addition to OTLP metrics export, RunsOn emits periodic structured logs (JSON) containing metric snapshots. These logs are available in CloudWatch Logs and include:

  • Job summaries (metric_type=jobs_summary): Cumulative job counts
  • Job events (metric_type=job_event): Individual job lifecycle events
  • Pool instances (metric_type=pool_instances): Current pool state
  • Rate limiters (metric_type=rate_limiter): Rate limiter state
  • Spot interruptions (metric_type=spot_interruption): Spot interruption events

Querying Metrics

Example Prometheus queries:

# Job throughput (jobs/sec)
rate(runs_on_jobs_total[5m])

# Job success rate
sum(rate(runs_on_jobs_total{status="completed",conclusion="success"}[5m])) /
sum(rate(runs_on_jobs_total{status="completed"}[5m]))

# Average internal queue duration (p50)
histogram_quantile(0.5, rate(runs_on_internal_queue_duration_seconds_bucket[5m]))

# Pool capacity by org
sum by (org, pool_name) (runs_on_pool_instances_total)

# Circuit breaker status
runs_on_spot_circuit_breaker_active

CloudWatch Dashboard

RunsOn automatically creates a comprehensive CloudWatch dashboard when you deploy the stack. The dashboard provides real-time visibility into your runner operations without requiring any external tools.

Dashboard Widgets

The embedded dashboard includes:

Job Monitoring:

  • Jobs currently queued (SQS queue depth)
  • Total runners scheduled in current period
  • Runners scheduled over time (5-minute intervals)
  • Queue duration percentiles (P50/P90 for internal and overall queue times)
  • Completed jobs by conclusion (success/failure/cancelled)
  • Job status summary over time

Rate Limiter Monitoring:

  • EC2 API rate limiters (Read, Run, Terminate, Mutating operations)
  • S3 API rate limiter
  • GitHub API rate limiter
  • Real-time token availability and burst capacity

Spot Instance Management:

  • Spot circuit breaker status
  • Interruption count tracking
  • Recent spot interruptions with details

Pool Management:

  • Pool instances over time by state (hot, stopped, warming, ready, etc.)
  • Pool capacity tracking across multiple pools

Operational Monitoring:

  • Recent error messages (latest 50)
  • Webhook redelivery statistics
  • Webhook redelivery success details

Accessing the Dashboard

Just go to your CloudWatch console, and select “Dashboards” from the left sidebar. You should see a dashboard named RunsOn-<StackName>-Dashboard.

Dashboard Queries

All dashboard widgets use CloudWatch Logs Insights queries on structured log data. This provides powerful filtering and aggregation capabilities without additional infrastructure.

CloudWatch Metrics

RunsOn publishes custom metrics to CloudWatch in the RunsOn namespace:

  • Consumed minutes: Track runner usage across multiple dimensions
  • Custom job metrics: Available when using the runs-on/action@v2 action

You can use these metrics to:

  • Set up CloudWatch alarms for budget monitoring
  • Track usage patterns by organization, repository, or workflow
  • Create custom dashboards for specific use cases

Go Runtime Metrics

When OTEL metrics are enabled, RunsOn also automatically exports Go runtime metrics, so that you can monitor the health and performance of the RunsOn service itself.