Stack Metrics
Monitor your RunsOn installation with OpenTelemetry and CloudWatch
RunsOn provides comprehensive monitoring capabilities through two complementary approaches:
- OpenTelemetry (OTEL) metrics - Export detailed metrics to your observability platform
- CloudWatch integration - Built-in dashboard and native AWS metrics
OpenTelemetry Metrics
RunsOn exports metrics in OpenTelemetry format via OTLP (HTTP for now), allowing you to integrate with popular observability platforms like Prometheus, Grafana Cloud, Signoz, Datadog, New Relic, and others.
Configuration
Configure OTEL metrics export using these CloudFormation parameters:
| Parameter | Description | Example |
|---|---|---|
OtelExporterEndpoint | OTLP endpoint URL (e.g., ingest.eu.signoz.cloud:443). Only HTTP(s) protocol is supported. | ingest.eu.signoz.cloud:443 |
OtelExporterHeaders | Headers for OTLP endpoint in W3C Baggage format: key1=value1,key2=value2 | signoz-ingestion-key=ABCD1234 |
Runner Logs, Traces, and Host Metrics
If you set extras=otel on a job, RunsOn starts a local OTEL collector on the runner. It forwards runner logs, traces, and host metrics to the OTLP endpoint configured above, so you can follow a job from the moment RunsOn receives it until the runner is terminated, including spans for each job step.
jobs:
build:
runs-on: runs-on=${{ github.run_id }}/runner=2cpu-linux-x64/extras=otel
steps:
- uses: actions/checkout@v4
- run: npm ci
- run: npm test
You can combine it with other extras. For example, extras=s3-cache+otel enables Magic Cache and runner-side OTEL collection on the same job.
Available Metrics
Job Metrics
runs_on_jobs_total (Counter)
Total number of jobs by status.
Attributes:
status: Job status (queued,scheduled,in_progress,completed)conclusion: Job conclusion for completed status (success,failure,cancelled,skipped)repo_full_name: Repository (e.g.,owner/repo)workflow_name: GitHub workflow nameinstance_type: EC2 instance type (only forscheduledstatus)instance_lifecycle:spotoron-demand(only forscheduledstatus)pool_name: Pool name if scheduled from a poolinterrupted: Whether the job was interruptedorg: GitHub organization nameinstallation_id: GitHub App installation ID
runs_on_internal_queue_duration_seconds (Histogram)
Time from job queued in RunsOn to instance scheduled.
runs_on_overall_queue_duration_seconds (Histogram)
Time from job queued by GitHub to job started (includes instance launch and runner bootstrap).
runs_on_job_duration_seconds (Histogram)
Time from job started to completed.
Pool Metrics
runs_on_pool_instances_total (Observable Gauge)
Current number of pool instances by state.
Attributes:
pool_name: Pool namestate: Instance state (running,stopped,pending,terminating)installation_id: GitHub App installation IDorg: GitHub organization name
Rate Limiter Metrics
runs_on_rate_limiter_tokens (Observable Gauge)
Available tokens in rate limiter.
runs_on_rate_limiter_burst (Observable Gauge)
Burst capacity of rate limiter.
Attributes:
limiter: Rate limiter name (github_api,ec2_api, etc.)
Spot Circuit Breaker Metrics
runs_on_spot_circuit_breaker_active (Observable Gauge)
Whether spot circuit breaker is currently active (1 = active, 0 = inactive).
Resource Attributes
All metrics include these resource attributes:
| Attribute | Description |
|---|---|
service.name | Always runs-on-server |
app.version | RunsOn version |
app.environment | Environment name |
stack_name | CloudFormation stack name |
region | AWS region |
Structured Logs
In addition to OTLP metrics export, RunsOn emits periodic structured logs (JSON) containing metric snapshots. These logs are available in CloudWatch Logs and include:
- Job summaries (
metric_type=jobs_summary): Cumulative job counts - Job events (
metric_type=job_event): Individual job lifecycle events - Pool instances (
metric_type=pool_instances): Current pool state - Rate limiters (
metric_type=rate_limiter): Rate limiter state - Spot interruptions (
metric_type=spot_interruption): Spot interruption events
Querying Metrics
Example Prometheus queries:
# Job throughput (jobs/sec)
rate(runs_on_jobs_total[5m])
# Job success rate
sum(rate(runs_on_jobs_total{status="completed",conclusion="success"}[5m])) /
sum(rate(runs_on_jobs_total{status="completed"}[5m]))
# Average internal queue duration (p50)
histogram_quantile(0.5, rate(runs_on_internal_queue_duration_seconds_bucket[5m]))
# Pool capacity by org
sum by (org, pool_name) (runs_on_pool_instances_total)
# Circuit breaker status
runs_on_spot_circuit_breaker_active
CloudWatch Dashboard
RunsOn automatically creates a comprehensive CloudWatch dashboard when you deploy the stack. The dashboard provides real-time visibility into your runner operations without requiring any external tools.
Dashboard Widgets
The embedded dashboard includes:
Job Monitoring:
- Jobs currently queued (SQS queue depth)
- Total runners scheduled in current period
- Runners scheduled over time (5-minute intervals)
- Queue duration percentiles (P50/P90 for internal and overall queue times)
- Completed jobs by conclusion (success/failure/cancelled)
- Job status summary over time
Rate Limiter Monitoring:
- EC2 API rate limiters (Read, Run, Terminate, Mutating operations)
- S3 API rate limiter
- GitHub API rate limiter
- Real-time token availability and burst capacity
Spot Instance Management:
- Spot circuit breaker status
- Interruption count tracking
- Recent spot interruptions with details
Pool Management:
- Pool instances over time by state (hot, stopped, warming, ready, etc.)
- Pool capacity tracking across multiple pools
Operational Monitoring:
- Recent error messages (latest 50)
- Webhook redelivery statistics
- Webhook redelivery success details
Accessing the Dashboard
Just go to your CloudWatch console, and select “Dashboards” from the left sidebar. You should see a dashboard named RunsOn-<StackName>-Dashboard.
Dashboard Queries
All dashboard widgets use CloudWatch Logs Insights queries on structured log data. This provides powerful filtering and aggregation capabilities without additional infrastructure.
CloudWatch Metrics
RunsOn publishes custom metrics to CloudWatch in the RunsOn namespace:
- Consumed minutes: Track runner usage across multiple dimensions
- Custom job metrics: Available when using the
runs-on/action@v2action
You can use these metrics to:
- Set up CloudWatch alarms for budget monitoring
- Track usage patterns by organization, repository, or workflow
- Create custom dashboards for specific use cases
Go Runtime Metrics
When OTEL metrics are enabled, RunsOn also automatically exports Go runtime metrics, so that you can monitor the health and performance of the RunsOn service itself.