Httpcache is causing excessive runner creation

Took quite a bit of digging to get a reliable test case that got me to a smoking gun...

First, the symptom....

For very short jobs, as in jobs that only ran for a 20 - 50 seconds (for example a job that errors out super fast on tag check), these jobs would produce a runner, great!  The job would complete, good!  Buutt... I'd end up eventually getting another runner that would replace that runner. When I'd check in the db i'd find a job in the jobs table for that same job which already ran.

As I turned debug on I kept finding these over and over...
```
2024/05/13 15:30:36 [DEBUG] myshoes-e06dbeee-2566-46af-86f1-6ea1f933aca3 is not found in GitHub, will retry... (second: 70s)
```
and these were getting produced despite the fact that I was watch with a ` gh api ... 
 /repos/$org/$repo/actions/runners` cli script the runner get registered, be online, then be removed.
yet still I kept seeing that message saying `myshoes-e06dbeee-2566-46af-86f1-6ea1f933aca3 is not found in GitHub, will retry... `
So I threw a bunch of logging in and finally nailed down to a test with to a minimum of this call 

```
func Test(t *testing.T) {
	config.Load()
	config.Config.Debug = true

	client, err := NewClient(os.Getenv("TEST_PAT"))
	if err != nil {
		t.Fatalf("%v", err)
	}
	count := 0
	for count < 300 {
		count++
		ctx := context.Background()
		var opts = &github.ListOptions{
			Page:    0,
			PerPage: 100,
		}
		runners, _, err := client.Actions.ListRunners(ctx, "REDACTED", "REDACTEDREPO", opts)
		if err != nil {
			t.Fatalf("failed to list repository runners %v", err)
		}
		for _, r := range runners.Runners {
			fmt.Println("API Response:" + *r.Name + "\n")
		}
		time.Sleep(400 * time.Millisecond)
	}
}
```
Running against my shell script...
```
until [ "" == "!" ]; do
    echo "### SSSTTTAAARRRTTT"
    date
    gh api \
      -H "Accept: application/vnd.github+json" \
      -H "X-GitHub-Api-Version: 2022-11-28" \
      --hostname REDACTED \
      -q '.runners[] | "\(.name, .status)"' \
      /repos/$org/$repo/actions/runners
    echo "### EEENNNDDD"
    sleep 2
done
```
and sure enough! I saw that the bash process was accurate but the client you are producing was reporting stale information.

So, I dug deeper! and found this...
```
func NewClient(token string) (*github.Client, error) {
	oauth2Transport := &oauth2.Transport{
		Source: oauth2.StaticTokenSource(
			&oauth2.Token{AccessToken: token},
		),
	}
	transport := &httpcache.Transport{
		Transport:           oauth2Transport,
		Cache:               httpCache,
		MarkCachedResponses: true,
	}
```
So, I decide, hey lets deactivate that and see what happens...

this 
```
return github.NewEnterpriseClient(config.Config.GitHubURL, config.Config.GitHubURL, &http.Client{Transport: transport})
```
became this

```
return github.NewEnterpriseClient(config.Config.GitHubURL, config.Config.GitHubURL, &http.Client{Transport: oauth2Transport})
```
difference is, I'm not using the cache transport, I'm just using the outh2Transport now.

and when I put bash against the go test, they matched up!  Good.

So I'm thinking is happening is the following...

```
Fast Job Race Condition
GHE -> MS (new job)
MS -> AWS (new EC2) 2024/05/13 14:02:50 instance create successfully! (job: 80813ad7-dbad-4f4f-aaba-147bbe4057fa, cloud ID: i-0a3d5b43ed7e398ad)
MS -> GHE (poll) is not found in GitHub, will retry... 
EC2 -> GHE (register)
MS -> GHE (poll) is not found in GitHub, will retry... 
GHE -> GHE (schedules on EC2)
MS -> GHE (poll) is not found in GitHub, will retry... 
GHE -> GHE (unregisters worker)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
EC2 -> GHE (starts working on job)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
EC2 -> GHE (finishes the job)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
... 5 minutes of this ...
... myshoes now thiks the job never got picked up and tries to create a new runner ---
MS -> AWS (new EC2) 2024/05/13 14:07:56 instance create successfully! (job: 80813ad7-dbad-4f4f-aaba-147bbe4057fa, cloud ID: i-03b972b4d9ec4ca14)
... if your shoes-aws implementation doesn't catch the return of "80813ad7-dbad-4f4f-aaba-147bbe4057fa" you will get a new ec2 instance that sits idle.
MS -> MS (polling through function sanitizeGitHubRunner) "{...} is idle and not running 6h0m0s, so not will delete ..."
```
So eventually, after paying for a runner for 6 hours this gets cleaned up. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Httpcache is causing excessive runner creation #205

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Httpcache is causing excessive runner creation #205

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions