Stack Metrics
RunsOn provides comprehensive monitoring capabilities through two complementary approaches:
- OpenTelemetry (OTEL) metrics - Export detailed metrics to your observability platform
- CloudWatch integration - Built-in dashboard and native AWS metrics
OpenTelemetry Metrics
Section titled âOpenTelemetry MetricsâRunsOn exports metrics in OpenTelemetry format via OTLP (HTTP for now), allowing you to integrate with popular observability platforms like Prometheus, Grafana Cloud, Signoz, Datadog, New Relic, and others.
Configuration
Section titled âConfigurationâConfigure OTEL metrics export using these CloudFormation parameters:
| Parameter | Description | Example |
|---|---|---|
OtelExporterEndpoint | OTLP endpoint URL (e.g., ingest.eu.signoz.cloud:443). Only HTTP(s) protocol is supported. | ingest.eu.signoz.cloud:443 |
OtelExporterHeaders | Headers for OTLP endpoint in W3C Baggage format: key1=value1,key2=value2 | signoz-ingestion-key=ABCD1234 |
Available Metrics
Section titled âAvailable MetricsâJob Metrics
Section titled âJob Metricsâruns_on_jobs_total (Counter)
Total number of jobs by status.
Attributes:
status: Job status (queued,scheduled,in_progress,completed)conclusion: Job conclusion for completed status (success,failure,cancelled,skipped)repo_full_name: Repository (e.g.,owner/repo)workflow_name: GitHub workflow nameinstance_type: EC2 instance type (only forscheduledstatus)instance_lifecycle:spotoron-demand(only forscheduledstatus)pool_name: Pool name if scheduled from a poolinterrupted: Whether the job was interruptedorg: GitHub organization nameinstallation_id: GitHub App installation ID
runs_on_internal_queue_duration_seconds (Histogram)
Time from job queued in RunsOn to instance scheduled.
runs_on_overall_queue_duration_seconds (Histogram)
Time from job queued by GitHub to job started (includes instance launch and runner bootstrap).
runs_on_job_duration_seconds (Histogram)
Time from job started to completed.
Pool Metrics
Section titled âPool Metricsâruns_on_pool_instances_total (Observable Gauge)
Current number of pool instances by state.
Attributes:
pool_name: Pool namestate: Instance state (running,stopped,pending,terminating)installation_id: GitHub App installation IDorg: GitHub organization name
Rate Limiter Metrics
Section titled âRate Limiter Metricsâruns_on_rate_limiter_tokens (Observable Gauge)
Available tokens in rate limiter.
runs_on_rate_limiter_burst (Observable Gauge)
Burst capacity of rate limiter.
Attributes:
limiter: Rate limiter name (github_api,ec2_api, etc.)
Spot Circuit Breaker Metrics
Section titled âSpot Circuit Breaker Metricsâruns_on_spot_circuit_breaker_active (Observable Gauge)
Whether spot circuit breaker is currently active (1 = active, 0 = inactive).
Resource Attributes
Section titled âResource AttributesâAll metrics include these resource attributes:
| Attribute | Description |
|---|---|
service.name | Always runs-on-server |
app.version | RunsOn version |
app.environment | Environment name |
stack_name | CloudFormation stack name |
region | AWS region |
Structured Logs
Section titled âStructured LogsâIn addition to OTLP metrics export, RunsOn emits periodic structured logs (JSON) containing metric snapshots. These logs are available in CloudWatch Logs and include:
- Job summaries (
metric_type=jobs_summary): Cumulative job counts - Job events (
metric_type=job_event): Individual job lifecycle events - Pool instances (
metric_type=pool_instances): Current pool state - Rate limiters (
metric_type=rate_limiter): Rate limiter state - Spot interruptions (
metric_type=spot_interruption): Spot interruption events
Querying Metrics
Section titled âQuerying MetricsâExample Prometheus queries:
# Job throughput (jobs/sec)rate(runs_on_jobs_total[5m])
# Job success ratesum(rate(runs_on_jobs_total{status="completed",conclusion="success"}[5m])) /sum(rate(runs_on_jobs_total{status="completed"}[5m]))
# Average internal queue duration (p50)histogram_quantile(0.5, rate(runs_on_internal_queue_duration_seconds_bucket[5m]))
# Pool capacity by orgsum by (org, pool_name) (runs_on_pool_instances_total)
# Circuit breaker statusruns_on_spot_circuit_breaker_activeCloudWatch Dashboard
Section titled âCloudWatch DashboardâRunsOn automatically creates a comprehensive CloudWatch dashboard when you deploy the stack. The dashboard provides real-time visibility into your runner operations without requiring any external tools.
Dashboard Widgets
Section titled âDashboard WidgetsâThe embedded dashboard includes:
Job Monitoring:
- Jobs currently queued (SQS queue depth)
- Total runners scheduled in current period
- Runners scheduled over time (5-minute intervals)
- Queue duration percentiles (P50/P90 for internal and overall queue times)
- Completed jobs by conclusion (success/failure/cancelled)
- Job status summary over time
Rate Limiter Monitoring:
- EC2 API rate limiters (Read, Run, Terminate, Mutating operations)
- S3 API rate limiter
- GitHub API rate limiter
- Real-time token availability and burst capacity
Spot Instance Management:
- Spot circuit breaker status
- Interruption count tracking
- Recent spot interruptions with details
Pool Management:
- Pool instances over time by state (hot, stopped, warming, ready, etc.)
- Pool capacity tracking across multiple pools
Operational Monitoring:
- Recent error messages (latest 50)
- Webhook redelivery statistics
- Webhook redelivery success details
Accessing the Dashboard
Section titled âAccessing the DashboardâJust go to your CloudWatch console, and select âDashboardsâ from the left sidebar. You should see a dashboard named RunsOn-<StackName>-Dashboard.
Dashboard Queries
Section titled âDashboard QueriesâAll dashboard widgets use CloudWatch Logs Insights queries on structured log data. This provides powerful filtering and aggregation capabilities without additional infrastructure.
CloudWatch Metrics
Section titled âCloudWatch MetricsâRunsOn publishes custom metrics to CloudWatch in the RunsOn namespace:
- Consumed minutes: Track runner usage across multiple dimensions
- Custom job metrics: Available when using the
runs-on/action@v2action
You can use these metrics to:
- Set up CloudWatch alarms for budget monitoring
- Track usage patterns by organization, repository, or workflow
- Create custom dashboards for specific use cases
Go Runtime Metrics
Section titled âGo Runtime MetricsâWhen OTEL metrics are enabled, RunsOn also automatically exports Go runtime metrics, so that you can monitor the health and performance of the RunsOn service itself.