-
Notifications
You must be signed in to change notification settings - Fork 15
Description
A user in my org contacted me with a job that never ran.
I found an error message like this ...
2024/05/23 21:16:34 failed to process job: failed to check to register runner (target ID: 10ed1fec-041c-4829-ab1c-b7de7ff9e673, job ID: 6bbe1e7e-9b8c-49c7-bbc3-623eee4ca54c): failed to check existing runner in GitHub: failed to get list of runners: failed to list runners: failed to list organization runners: GET https://REDACTED/actions/runners?per_page=100: 503 OrgTenant service unavailable []
I tracked that error down to call to list runners for the org.
Line 48 in cbe7eda
| runners, resp, err := listRunners(ctx, client, owner, repo, opts) |
in this particular case the trace is starting at starter.go in function ProcessJob where "Strict" config is true on a call to "checkRegisteredRunner".
The result of this 503 is deleteInstance is called in ProcessJob.
The overall impact of that error is that the runner is deleted. This lead to the job not getting worked on.
I contacted GitHub Enterprise support and they responded with the following suggestion...
Encountering a 503 error may occur when the server is temporarily overwhelmed and requires a moment to stabilize. This situation could be attributed to high traffic, maintenance activities, or a brief interruption.
In your specific case, the appearance of the error message "OrgTenant service unavailable" indicates a temporary disruption with the service responsible for managing organization actions/runners.
When confronted with a 503 error, it is advisable to establish a retry mechanism. It is important not to attempt immediate retries but rather consider implementing an exponential backoff strategy. This approach involves increasing the wait time between each retry to allow the server sufficient time to recover and mitigate potential complications.
I'll add a comment with how I mitigated w/code change.