Your CI runners are shared mutable state

There are two ways to run self-hosted GitHub Actions runners. A long-lived runner registers once and processes job after job on the same machine. An ephemeral runner is created for a single job and destroyed when that job finishes.

The distinction looks like an operational detail. It’s actually a decision about whether every CI job shares mutable state with every job that ran before it — and that decision shows up later as flaky tests, leaked credentials, and builds that pass on one runner and fail on the next.

What “shared state” actually means#

A runner is a machine with a filesystem, a process table, a network namespace, and a set of credentials. A long-lived runner carries all of that from one job to the next. Concretely, that’s where the problems come from:

Filesystem and caches. Docker’s build cache and layer store, ~/.npm, ~/.cache, ~/.cargo, Gradle and Maven caches, /tmp — none of it is cleared between jobs. Most of the time that’s a speedup. Occasionally a job writes a poisoned cache entry (a half-downloaded artifact, a cache key collision, a node_modules from a different branch) and every subsequent job on that runner inherits it. You now have a failure whose root cause is a job that finished hours ago, on a different PR.

Credentials. A job that runs aws configure, writes a ~/.docker/config.json, exports a token into the environment, or drops a deploy key in ~/.ssh leaves that material on disk. The next job — possibly from a different pull request — starts with read access to it. On a persistent runner, your secret-isolation story is only as good as every cleanup step in every workflow that ever touched the machine.

Orphaned processes and ports. A test suite that spins up Postgres, a dev server, or a Docker container and doesn’t tear it down leaves it running. The next job hits “port 5432 already in use” or, worse, silently talks to the previous job’s database.

Resource drift. Disks fill with images and build artifacts until jobs fail with ENOSPC. Memory fragments. A runner that behaved like a fresh machine on Monday is swapping and OOM-killing by Friday — and the failure looks random because it depends on how many jobs happened to land there first.

Agent and toolchain drift. The runner agent auto-updates. Someone SSHes in to “quickly fix” a toolchain version. Six weeks later runner-07 is subtly different from runner-03, and “works on one runner, fails on another” becomes a category of bug you can’t reproduce locally.

None of these are exotic. They’re the normal entropy of any machine that stays up and does work. The problem is that CI is supposed to be the thing that tells you whether your code is correct, and a runner accumulating hidden state quietly turns it into a source of false signals.

The security boundary#

The state problem is also a security problem, and GitHub is explicit about it: their guidance is to use self-hosted runners only with private repositories, because “forks of your public repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request that executes the code in a workflow.”

A persistent runner makes that worse. If a job can run arbitrary code — which, by definition, CI does — then on a long-lived runner it can also leave something behind: a modified binary on PATH, a poisoned build cache, a cron entry, credentials it harvested from an earlier job. The trust boundary isn’t a single job; it’s every job that has ever run on that machine and every job that will. An ephemeral runner collapses that boundary back to one job. When the job ends, anything it did to the machine is destroyed with it.

Ephemeral runners as pure functions#

An ephemeral runner inverts the model: each job gets a freshly provisioned machine built from a known image, runs, and is terminated. The same job run twice starts from the same state twice.

That’s the property that makes CI trustworthy. A clean machine per job means a test failure is a fact about your code, not an artifact of what ran before it. Builds become reproducible because the inputs are fixed. Secrets don’t outlive the job that used them. There’s no cleanup step to forget, because there’s nothing to clean up — the disk is gone.

It also removes a whole category of operational work. No cron jobs to prune Docker images, no dashboards watching runner disk usage, no pager alert about runner-14 being unhealthy. A runner is either running a job or it doesn’t exist.

The honest tradeoff: cold start#

The real objection to ephemeral runners isn’t philosophical, it’s latency. A persistent runner has its caches warm and its base images already pulled. A fresh machine has to boot, and a naive implementation re-downloads everything every time.

This is worth taking seriously rather than waving away — and it’s a solved problem, not a free one. The fix is to move the warm state off the runner and make provisioning fast:

Prebaked images. Bake the OS, language toolchains, and common base images into the machine image so they’re present at boot instead of downloaded per job.
Externalized caches. Keep dependency and build caches in fast shared storage (S3, EBS snapshots) and restore them per job. You get cache reuse without the cache living on a shared machine.
Fast provisioning. With an optimized image and adequate network throughput, a runner can be booted and registered in well under a minute.

The result is a small, bounded, predictable cost at the start of each job — versus the unbounded, unpredictable cost of debugging a state bug that only reproduces on one runner after a specific sequence of prior jobs. Thirty seconds you can plan around beats three hours you can’t.

The economics#

Ephemeral runners also change the cost shape. A long-lived fleet is sized for peak and billed 24/7, so you pay for idle capacity nights and weekends. Ephemeral runners scale to zero: no queued jobs, no running machines, no cost. You pay for the compute a job actually uses, and because each job is independent and interruption-tolerant, that compute can run on spot/discounted capacity. The savings are real, but the more important property is that cost tracks usage instead of provisioned capacity.

How RunsOn does it#

RunsOn runs this model inside your own AWS account: one ephemeral runner per job, built from regularly-updated AMIs, booted in about 30 seconds, on spot instances when you want the discount, scaling to zero when the queue is empty. Because everything runs in your account, your code and secrets never leave your infrastructure — and because every runner is single-use, there’s no persistent state to manage, clean up, or get paged about.

If you’re running long-lived self-hosted runners today, the migration is mostly deletion: the cleanup scripts, the monitoring, and the “why did runner-07 fail” Slack thread all go away with the machines that caused them.

See how it works →