Skip to content

Httpcache is causing excessive runner creationΒ #205

@alfred-stokespace

Description

@alfred-stokespace

Took quite a bit of digging to get a reliable test case that got me to a smoking gun...

First, the symptom....

For very short jobs, as in jobs that only ran for a 20 - 50 seconds (for example a job that errors out super fast on tag check), these jobs would produce a runner, great! The job would complete, good! Buutt... I'd end up eventually getting another runner that would replace that runner. When I'd check in the db i'd find a job in the jobs table for that same job which already ran.

As I turned debug on I kept finding these over and over...

2024/05/13 15:30:36 [DEBUG] myshoes-e06dbeee-2566-46af-86f1-6ea1f933aca3 is not found in GitHub, will retry... (second: 70s)

and these were getting produced despite the fact that I was watch with a gh api ... /repos/$org/$repo/actions/runners cli script the runner get registered, be online, then be removed.
yet still I kept seeing that message saying myshoes-e06dbeee-2566-46af-86f1-6ea1f933aca3 is not found in GitHub, will retry...
So I threw a bunch of logging in and finally nailed down to a test with to a minimum of this call

func Test(t *testing.T) {
	config.Load()
	config.Config.Debug = true

	client, err := NewClient(os.Getenv("TEST_PAT"))
	if err != nil {
		t.Fatalf("%v", err)
	}
	count := 0
	for count < 300 {
		count++
		ctx := context.Background()
		var opts = &github.ListOptions{
			Page:    0,
			PerPage: 100,
		}
		runners, _, err := client.Actions.ListRunners(ctx, "REDACTED", "REDACTEDREPO", opts)
		if err != nil {
			t.Fatalf("failed to list repository runners %v", err)
		}
		for _, r := range runners.Runners {
			fmt.Println("API Response:" + *r.Name + "\n")
		}
		time.Sleep(400 * time.Millisecond)
	}
}

Running against my shell script...

until [ "" == "!" ]; do
    echo "### SSSTTTAAARRRTTT"
    date
    gh api \
      -H "Accept: application/vnd.github+json" \
      -H "X-GitHub-Api-Version: 2022-11-28" \
      --hostname REDACTED \
      -q '.runners[] | "\(.name, .status)"' \
      /repos/$org/$repo/actions/runners
    echo "### EEENNNDDD"
    sleep 2
done

and sure enough! I saw that the bash process was accurate but the client you are producing was reporting stale information.

So, I dug deeper! and found this...

func NewClient(token string) (*github.Client, error) {
	oauth2Transport := &oauth2.Transport{
		Source: oauth2.StaticTokenSource(
			&oauth2.Token{AccessToken: token},
		),
	}
	transport := &httpcache.Transport{
		Transport:           oauth2Transport,
		Cache:               httpCache,
		MarkCachedResponses: true,
	}

So, I decide, hey lets deactivate that and see what happens...

this

return github.NewEnterpriseClient(config.Config.GitHubURL, config.Config.GitHubURL, &http.Client{Transport: transport})

became this

return github.NewEnterpriseClient(config.Config.GitHubURL, config.Config.GitHubURL, &http.Client{Transport: oauth2Transport})

difference is, I'm not using the cache transport, I'm just using the outh2Transport now.

and when I put bash against the go test, they matched up! Good.

So I'm thinking is happening is the following...

Fast Job Race Condition
GHE -> MS (new job)
MS -> AWS (new EC2) 2024/05/13 14:02:50 instance create successfully! (job: 80813ad7-dbad-4f4f-aaba-147bbe4057fa, cloud ID: i-0a3d5b43ed7e398ad)
MS -> GHE (poll) is not found in GitHub, will retry... 
EC2 -> GHE (register)
MS -> GHE (poll) is not found in GitHub, will retry... 
GHE -> GHE (schedules on EC2)
MS -> GHE (poll) is not found in GitHub, will retry... 
GHE -> GHE (unregisters worker)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
EC2 -> GHE (starts working on job)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
EC2 -> GHE (finishes the job)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
... 5 minutes of this ...
... myshoes now thiks the job never got picked up and tries to create a new runner ---
MS -> AWS (new EC2) 2024/05/13 14:07:56 instance create successfully! (job: 80813ad7-dbad-4f4f-aaba-147bbe4057fa, cloud ID: i-03b972b4d9ec4ca14)
... if your shoes-aws implementation doesn't catch the return of "80813ad7-dbad-4f4f-aaba-147bbe4057fa" you will get a new ec2 instance that sits idle.
MS -> MS (polling through function sanitizeGitHubRunner) "{...} is idle and not running 6h0m0s, so not will delete ..."

So eventually, after paying for a runner for 6 hours this gets cleaned up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions