Stack Metrics - RunsOn

RunsOn provides comprehensive monitoring capabilities through two complementary approaches:

OpenTelemetry (OTEL) metrics - Export detailed metrics to your observability platform
CloudWatch integration - Built-in dashboard and native AWS metrics

OpenTelemetry Metrics

RunsOn exports metrics in OpenTelemetry format via OTLP (HTTP for now), allowing you to integrate with popular observability platforms like Prometheus, Grafana Cloud, Signoz, Datadog, New Relic, and others.

Configuration

Configure OTEL metrics export using these CloudFormation parameters:

Parameter	Description	Example
`OtelExporterEndpoint`	OTLP endpoint URL (e.g., `ingest.eu.signoz.cloud:443`). Only HTTP(s) protocol is supported.	`ingest.eu.signoz.cloud:443`
`OtelExporterHeaders`	Headers for OTLP endpoint in W3C Baggage format: `key1=value1,key2=value2`	`signoz-ingestion-key=ABCD1234`

Runner Logs, Traces, and Host Metrics

If you set extras=otel on a job, RunsOn starts a local OTEL collector on the runner. It forwards runner logs, traces, and host metrics to the OTLP endpoint configured above, so you can follow a job from the moment RunsOn receives it until the runner is terminated, including spans for each job step.

jobs:
  build:
    runs-on: runs-on=${{ github.run_id }}/runner=2cpu-linux-x64/extras=otel
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm test

You can combine it with other extras. For example, extras=s3-cache+otel enables Magic Cache and runner-side OTEL collection on the same job.

Available Metrics

Job Metrics

runs_on_jobs_total (Counter) Total number of jobs by status.

Attributes:

status: Job status (queued, scheduled, in_progress, completed)
conclusion: Job conclusion for completed status (success, failure, cancelled, skipped)
repo_full_name: Repository (e.g., owner/repo)
workflow_name: GitHub workflow name
instance_type: EC2 instance type (only for scheduled status)
instance_lifecycle: spot or on-demand (only for scheduled status)
pool_name: Pool name if scheduled from a pool
interrupted: Whether the job was interrupted
org: GitHub organization name
installation_id: GitHub App installation ID

runs_on_internal_queue_duration_seconds (Histogram) Time from job queued in RunsOn to instance scheduled.

runs_on_overall_queue_duration_seconds (Histogram) Time from job queued by GitHub to job started (includes instance launch and runner bootstrap).

runs_on_job_duration_seconds (Histogram) Time from job started to completed.

Pool Metrics

runs_on_pool_instances_total (Observable Gauge) Current number of pool instances by state.

Attributes:

pool_name: Pool name
state: Instance state (running, stopped, pending, terminating)
installation_id: GitHub App installation ID
org: GitHub organization name

Rate Limiter Metrics

runs_on_rate_limiter_tokens (Observable Gauge) Available tokens in rate limiter.

runs_on_rate_limiter_burst (Observable Gauge) Burst capacity of rate limiter.

Attributes:

limiter: Rate limiter name (github_api, ec2_api, etc.)

Spot Circuit Breaker Metrics

runs_on_spot_circuit_breaker_active (Observable Gauge) Whether spot circuit breaker is currently active (1 = active, 0 = inactive).

Resource Attributes

All metrics include these resource attributes:

Attribute	Description
`service.name`	Always `runs-on-server`
`app.version`	RunsOn version
`app.environment`	Environment name
`stack_name`	CloudFormation stack name
`region`	AWS region

Structured Logs

In addition to OTLP metrics export, RunsOn emits periodic structured logs (JSON) containing metric snapshots. These logs are available in CloudWatch Logs and include:

Job summaries (metric_type=jobs_summary): Cumulative job counts
Job events (metric_type=job_event): Individual job lifecycle events
Pool instances (metric_type=pool_instances): Current pool state
Rate limiters (metric_type=rate_limiter): Rate limiter state
Spot interruptions (metric_type=spot_interruption): Spot interruption events

Querying Metrics

Example Prometheus queries:

# Job throughput (jobs/sec)
rate(runs_on_jobs_total[5m])

# Job success rate
sum(rate(runs_on_jobs_total{status="completed",conclusion="success"}[5m])) /
sum(rate(runs_on_jobs_total{status="completed"}[5m]))

# Average internal queue duration (p50)
histogram_quantile(0.5, rate(runs_on_internal_queue_duration_seconds_bucket[5m]))

# Pool capacity by org
sum by (org, pool_name) (runs_on_pool_instances_total)

# Circuit breaker status
runs_on_spot_circuit_breaker_active

CloudWatch Dashboard

RunsOn automatically creates a comprehensive CloudWatch dashboard when you deploy the stack. The dashboard provides real-time visibility into your runner operations without requiring any external tools.

Dashboard Widgets

The embedded dashboard includes:

Job Monitoring:

Jobs currently queued (SQS queue depth)
Total runners scheduled in current period
Runners scheduled over time (5-minute intervals)
Queue duration percentiles (P50/P90 for internal and overall queue times)
Completed jobs by conclusion (success/failure/cancelled)
Job status summary over time

Rate Limiter Monitoring:

EC2 API rate limiters (Read, Run, Terminate, Mutating operations)
S3 API rate limiter
GitHub API rate limiter
Real-time token availability and burst capacity

Spot Instance Management:

Spot circuit breaker status
Interruption count tracking
Recent spot interruptions with details

Pool Management:

Pool instances over time by state (hot, stopped, warming, ready, etc.)
Pool capacity tracking across multiple pools

Operational Monitoring:

Recent error messages (latest 50)
Webhook redelivery statistics
Webhook redelivery success details

Accessing the Dashboard

Just go to your CloudWatch console, and select “Dashboards” from the left sidebar. You should see a dashboard named RunsOn-<StackName>-Dashboard.

Dashboard Queries

All dashboard widgets use CloudWatch Logs Insights queries on structured log data. This provides powerful filtering and aggregation capabilities without additional infrastructure.

CloudWatch Metrics

RunsOn publishes custom metrics to CloudWatch in the RunsOn namespace:

Consumed minutes: Track runner usage across multiple dimensions
Custom job metrics: Available when using the runs-on/action@v2 action

You can use these metrics to:

Set up CloudWatch alarms for budget monitoring
Track usage patterns by organization, repository, or workflow
Create custom dashboards for specific use cases

Go Runtime Metrics

When OTEL metrics are enabled, RunsOn also automatically exports Go runtime metrics, so that you can monitor the health and performance of the RunsOn service itself.