-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Took quite a bit of digging to get a reliable test case that got me to a smoking gun...
First, the symptom....
For very short jobs, as in jobs that only ran for a 20 - 50 seconds (for example a job that errors out super fast on tag check), these jobs would produce a runner, great! The job would complete, good! Buutt... I'd end up eventually getting another runner that would replace that runner. When I'd check in the db i'd find a job in the jobs table for that same job which already ran.
As I turned debug on I kept finding these over and over...
2024/05/13 15:30:36 [DEBUG] myshoes-e06dbeee-2566-46af-86f1-6ea1f933aca3 is not found in GitHub, will retry... (second: 70s)
and these were getting produced despite the fact that I was watch with a gh api ... /repos/$org/$repo/actions/runners cli script the runner get registered, be online, then be removed.
yet still I kept seeing that message saying myshoes-e06dbeee-2566-46af-86f1-6ea1f933aca3 is not found in GitHub, will retry...
So I threw a bunch of logging in and finally nailed down to a test with to a minimum of this call
func Test(t *testing.T) {
config.Load()
config.Config.Debug = true
client, err := NewClient(os.Getenv("TEST_PAT"))
if err != nil {
t.Fatalf("%v", err)
}
count := 0
for count < 300 {
count++
ctx := context.Background()
var opts = &github.ListOptions{
Page: 0,
PerPage: 100,
}
runners, _, err := client.Actions.ListRunners(ctx, "REDACTED", "REDACTEDREPO", opts)
if err != nil {
t.Fatalf("failed to list repository runners %v", err)
}
for _, r := range runners.Runners {
fmt.Println("API Response:" + *r.Name + "\n")
}
time.Sleep(400 * time.Millisecond)
}
}
Running against my shell script...
until [ "" == "!" ]; do
echo "### SSSTTTAAARRRTTT"
date
gh api \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
--hostname REDACTED \
-q '.runners[] | "\(.name, .status)"' \
/repos/$org/$repo/actions/runners
echo "### EEENNNDDD"
sleep 2
done
and sure enough! I saw that the bash process was accurate but the client you are producing was reporting stale information.
So, I dug deeper! and found this...
func NewClient(token string) (*github.Client, error) {
oauth2Transport := &oauth2.Transport{
Source: oauth2.StaticTokenSource(
&oauth2.Token{AccessToken: token},
),
}
transport := &httpcache.Transport{
Transport: oauth2Transport,
Cache: httpCache,
MarkCachedResponses: true,
}
So, I decide, hey lets deactivate that and see what happens...
this
return github.NewEnterpriseClient(config.Config.GitHubURL, config.Config.GitHubURL, &http.Client{Transport: transport})
became this
return github.NewEnterpriseClient(config.Config.GitHubURL, config.Config.GitHubURL, &http.Client{Transport: oauth2Transport})
difference is, I'm not using the cache transport, I'm just using the outh2Transport now.
and when I put bash against the go test, they matched up! Good.
So I'm thinking is happening is the following...
Fast Job Race Condition
GHE -> MS (new job)
MS -> AWS (new EC2) 2024/05/13 14:02:50 instance create successfully! (job: 80813ad7-dbad-4f4f-aaba-147bbe4057fa, cloud ID: i-0a3d5b43ed7e398ad)
MS -> GHE (poll) is not found in GitHub, will retry...
EC2 -> GHE (register)
MS -> GHE (poll) is not found in GitHub, will retry...
GHE -> GHE (schedules on EC2)
MS -> GHE (poll) is not found in GitHub, will retry...
GHE -> GHE (unregisters worker)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
EC2 -> GHE (starts working on job)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
EC2 -> GHE (finishes the job)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
... 5 minutes of this ...
... myshoes now thiks the job never got picked up and tries to create a new runner ---
MS -> AWS (new EC2) 2024/05/13 14:07:56 instance create successfully! (job: 80813ad7-dbad-4f4f-aaba-147bbe4057fa, cloud ID: i-03b972b4d9ec4ca14)
... if your shoes-aws implementation doesn't catch the return of "80813ad7-dbad-4f4f-aaba-147bbe4057fa" you will get a new ec2 instance that sits idle.
MS -> MS (polling through function sanitizeGitHubRunner) "{...} is idle and not running 6h0m0s, so not will delete ..."
So eventually, after paying for a runner for 6 hours this gets cleaned up.