You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We observed a gradual buildup of runners and instances that ultimately led to our CI grind to a halt.
The problem seems to be that the start job is not marked as always(). This open a race condition where an EC2 is started, but the start job gets cancelled before it reports back. In that case, the stop job can't terminate the instance because it has not yet received its name. Similarly, a runner can be left orphaned.
It seems that the start job in a ec2-github-runner based workflow must be marked always() so it cannot be cancelled, and the above race does not happen.
Note that if the cancellation of start jobs is common if the workflow is part of a concurrency group. For example, if it is triggered upon updates to a fixed PR, occasional fast back-to-back updates to the same PR would lead to the race, and the buildup of orphaned runners and instances.
Suggestion: Change the REAME.md to mark start as always(), and document that this is important.
The text was updated successfully, but these errors were encountered:
We observed a gradual buildup of runners and instances that ultimately led to our CI grind to a halt.
The problem seems to be that the
start
job is not marked asalways()
. This open a race condition where an EC2 is started, but thestart
job gets cancelled before it reports back. In that case, thestop
job can't terminate the instance because it has not yet received its name. Similarly, a runner can be left orphaned.It seems that the
start
job in aec2-github-runner
based workflow must be markedalways()
so it cannot be cancelled, and the above race does not happen.Note that if the cancellation of
start
jobs is common if the workflow is part of a concurrency group. For example, if it is triggered upon updates to a fixed PR, occasional fast back-to-back updates to the same PR would lead to the race, and the buildup of orphaned runners and instances.Suggestion: Change the
REAME.md
to markstart
asalways()
, and document that this is important.The text was updated successfully, but these errors were encountered: