Job retries and housekeeping
RunsOn performs the following housekeeping tasks:
-
detect idle runners (those that didn’t start a job after 10 minutes) and remove them. This is automatically handled by the RunsOn agent running on the idle runner.
-
detect jobs that didn’t get a runner assigned due to AWS server errors (happens on high load). In this case, a new runner is launched by the RunsOn server.
-
detect failure modes where the agent somehow didn’t launch on the assigned runner instances. In this case, since the root cause is unknown, the runner is terminated and no new runner is launched. This could happen when AWS has network issues, or some custom image is not properly configured. So we don’t want to re-spawn instances blindly in this case.
-
detect spot interruptions, so that the job is properly shut down and the GitHub UI is updated. Also, since v2.6.0, the job is automatically re-tried once unless you opt out from this behaviour (also, non-spot instances are never retried).
AWS Server termination errors
These errors will be detected after launching an instance, and will trigger a new instance launch:
-
Server.InsufficientInstanceCapacity
: There was insufficient capacity available to satisfy the launch request. -
Server.InternalError
: An internal error caused the instance to terminate during launch. -
Server.ScheduledStop
: The instance was stopped due to a scheduled retirement. -
Server.SpotInstanceShutdown
: The instance was stopped because the number of Spot requests with a maximum price equal to or higher than the Spot price exceeded available capacity or because of an increase in the Spot price. -
Server.SpotInstanceTermination
: The instance was terminated because the number of Spot requests with a maximum price equal to or higher than the Spot price exceeded available capacity or because of an increase in the Spot price.