Job retries and housekeeping
RunsOn performs the following housekeeping tasks:
-
detect idle runners (those that didn’t start a job after 10 minutes) and remove them. This is automatically handled by the RunsOn agent running on the idle runner.
-
detect jobs that didn’t get a runner assigned due to AWS server errors (happens on high load). In this case, a new runner is launched by the RunsOn server.
-
detect failure modes where the agent somehow didn’t launch on the assigned runner instances. In this case, since the root cause is unknown, the runner is terminated and no new runner is launched. This could happen when AWS has network issues, or some custom image is not properly configured. So we don’t want to re-spawn instances blindly in this case.
-
detect spot interruptions, so that the job is properly shut down and the GitHub UI is updated. Currently, that job is not automatically re-tried.
AWS Server termination errors
Those errors will be detected after launching an instance, and will trigger a new instance launch:
-
Server.InsufficientInstanceCapacity
: There was insufficient capacity available to satisfy the launch request. -
Server.InternalError
: An internal error caused the instance to terminate during launch. -
Server.ScheduledStop
: The instance was stopped due to a scheduled retirement. -
Server.SpotInstanceShutdown
: The instance was stopped because the number of Spot requests with a maximum price equal to or higher than the Spot price exceeded available capacity or because of an increase in the Spot price. -
Server.SpotInstanceTermination
: The instance was terminated because the number of Spot requests with a maximum price equal to or higher than the Spot price exceeded available capacity or because of an increase in the Spot price.