Skip to content

Stack Metrics

RunsOn provides comprehensive monitoring capabilities through two complementary approaches:

  • OpenTelemetry (OTEL) metrics - Export detailed metrics to your observability platform
  • CloudWatch integration - Built-in dashboard and native AWS metrics

RunsOn exports metrics in OpenTelemetry format via OTLP (HTTP for now), allowing you to integrate with popular observability platforms like Prometheus, Grafana Cloud, Signoz, Datadog, New Relic, and others.

Configure OTEL metrics export using these CloudFormation parameters:

ParameterDescriptionExample
OtelExporterEndpointOTLP endpoint URL (e.g., ingest.eu.signoz.cloud:443). Only HTTP(s) protocol is supported.ingest.eu.signoz.cloud:443
OtelExporterHeadersHeaders for OTLP endpoint in W3C Baggage format: key1=value1,key2=value2signoz-ingestion-key=ABCD1234

runs_on_jobs_total (Counter) Total number of jobs by status.

Attributes:

  • status: Job status (queued, scheduled, in_progress, completed)
  • conclusion: Job conclusion for completed status (success, failure, cancelled, skipped)
  • repo_full_name: Repository (e.g., owner/repo)
  • workflow_name: GitHub workflow name
  • instance_type: EC2 instance type (only for scheduled status)
  • instance_lifecycle: spot or on-demand (only for scheduled status)
  • pool_name: Pool name if scheduled from a pool
  • interrupted: Whether the job was interrupted
  • org: GitHub organization name
  • installation_id: GitHub App installation ID

runs_on_internal_queue_duration_seconds (Histogram) Time from job queued in RunsOn to instance scheduled.

runs_on_overall_queue_duration_seconds (Histogram) Time from job queued by GitHub to job started (includes instance launch and runner bootstrap).

runs_on_job_duration_seconds (Histogram) Time from job started to completed.

runs_on_pool_instances_total (Observable Gauge) Current number of pool instances by state.

Attributes:

  • pool_name: Pool name
  • state: Instance state (running, stopped, pending, terminating)
  • installation_id: GitHub App installation ID
  • org: GitHub organization name

runs_on_rate_limiter_tokens (Observable Gauge) Available tokens in rate limiter.

runs_on_rate_limiter_burst (Observable Gauge) Burst capacity of rate limiter.

Attributes:

  • limiter: Rate limiter name (github_api, ec2_api, etc.)

runs_on_spot_circuit_breaker_active (Observable Gauge) Whether spot circuit breaker is currently active (1 = active, 0 = inactive).

All metrics include these resource attributes:

AttributeDescription
service.nameAlways runs-on-server
app.versionRunsOn version
app.environmentEnvironment name
stack_nameCloudFormation stack name
regionAWS region

In addition to OTLP metrics export, RunsOn emits periodic structured logs (JSON) containing metric snapshots. These logs are available in CloudWatch Logs and include:

  • Job summaries (metric_type=jobs_summary): Cumulative job counts
  • Job events (metric_type=job_event): Individual job lifecycle events
  • Pool instances (metric_type=pool_instances): Current pool state
  • Rate limiters (metric_type=rate_limiter): Rate limiter state
  • Spot interruptions (metric_type=spot_interruption): Spot interruption events

Example Prometheus queries:

# Job throughput (jobs/sec)
rate(runs_on_jobs_total[5m])
# Job success rate
sum(rate(runs_on_jobs_total{status="completed",conclusion="success"}[5m])) /
sum(rate(runs_on_jobs_total{status="completed"}[5m]))
# Average internal queue duration (p50)
histogram_quantile(0.5, rate(runs_on_internal_queue_duration_seconds_bucket[5m]))
# Pool capacity by org
sum by (org, pool_name) (runs_on_pool_instances_total)
# Circuit breaker status
runs_on_spot_circuit_breaker_active

RunsOn automatically creates a comprehensive CloudWatch dashboard when you deploy the stack. The dashboard provides real-time visibility into your runner operations without requiring any external tools.

The embedded dashboard includes:

Job Monitoring:

  • Jobs currently queued (SQS queue depth)
  • Total runners scheduled in current period
  • Runners scheduled over time (5-minute intervals)
  • Queue duration percentiles (P50/P90 for internal and overall queue times)
  • Completed jobs by conclusion (success/failure/cancelled)
  • Job status summary over time

Rate Limiter Monitoring:

  • EC2 API rate limiters (Read, Run, Terminate, Mutating operations)
  • S3 API rate limiter
  • GitHub API rate limiter
  • Real-time token availability and burst capacity

Spot Instance Management:

  • Spot circuit breaker status
  • Interruption count tracking
  • Recent spot interruptions with details

Pool Management:

  • Pool instances over time by state (hot, stopped, warming, ready, etc.)
  • Pool capacity tracking across multiple pools

Operational Monitoring:

  • Recent error messages (latest 50)
  • Webhook redelivery statistics
  • Webhook redelivery success details

Just go to your CloudWatch console, and select “Dashboards” from the left sidebar. You should see a dashboard named RunsOn-<StackName>-Dashboard.

All dashboard widgets use CloudWatch Logs Insights queries on structured log data. This provides powerful filtering and aggregation capabilities without additional infrastructure.

RunsOn publishes custom metrics to CloudWatch in the RunsOn namespace:

  • Consumed minutes: Track runner usage across multiple dimensions
  • Custom job metrics: Available when using the runs-on/action@v2 action

You can use these metrics to:

  • Set up CloudWatch alarms for budget monitoring
  • Track usage patterns by organization, repository, or workflow
  • Create custom dashboards for specific use cases

When OTEL metrics are enabled, RunsOn also automatically exports Go runtime metrics, so that you can monitor the health and performance of the RunsOn service itself.