Delay in starting an EC2 instance #2045

Fgerthoffert · 2022-05-10T03:03:35Z

Fgerthoffert
May 10, 2022

Hi,

We've noticed lately that some runs are staying a very long time in this state

Job defined at: ORG/REPO/.github/workflows/nightly.yml@refs/heads/master
Waiting for a runner to pick up this job...

It ultimately seems to start, but I'm wondering what could be the typical cause of such issues.

In Cloud Watch I do see the following:

Very first occurrence of a log associated with the run I started

2022-05-09T20:45:39.808Z 2022-05-09 20:45:39.807  INFO  [:0b3b702a-4ea2-4357-a79b-405e17844b01 index.js:85260  handle] Processing Github event {"event":"workflow_job","repository":"ORG/REPO","action":"queued","name":"Integration Tests (Quarantined)","status":"queued","started_at":"2022-05-09T20:45:38Z","completed_at":null,"conclusion":null}

Corresponding EC2 instance being started:

2022-05-10T00:03:51.408Z 2022-05-10 00:03:51.408  INFO  [runners:1d997de4-0ecc-54cd-8200-2717a0da3a52 index.js:113678  createRunner] Created instance(s):  i-062557ee038232013 {"runnerType":"Org","runnerOwner":"ORG","event":"workflow_job","id":"6361531150"}

A bit more than 3 hours between the job is first received, and the EC2 instance is started.

It does not happen on all jobs, we've only noticed it recently and our initial investigation seems to point towards matrix and/or nightly jobs.

We are using these settings:

  # enable access to the runners via SSM
  enable_ssm_on_runners = true

  # Let the module manage the service linked role
  # create_service_linked_role_spot = true

  # instance_types = ["m5.large", "c5.large"]
  instance_types = ["t3a.2xlarge", "t3.2xlarge"]

  # override delay of events in seconds
  delay_webhook_event = 0

  # Ensure you set the number not too low, each build require a new instance
  runners_maximum_count = 20

  # override scaling down
  scale_down_schedule_expression = "cron(* * * * ? *)"

  enable_ephemeral_runners = true

We are not hitting the maximum runner count. For example, I'm seeing the issue on a job right now and there is only one github runner started

Any pointers to help us understand where the issue could come from would be greatly appreciated.

Thanks,

Answered by crohr

Apr 10, 2024

Most likely a case of a runner being stolen by another job. I.e. imagine job A and B are launched, and use the same runs-on labels. If runner A fails to start, runner B might be assigned to job A, while job B hangs for a while, until job C with the same labels is started. At this point job B might start executing, while job C hangs, etc.

Best way to debug this would be to assign ${{ github.run_id }} in your runs-on labels to force the runner to process the job it was started for, but I don't think this project supports that.

View full answer

crohr · 2024-04-10T18:54:41Z

crohr
Apr 10, 2024

Most likely a case of a runner being stolen by another job. I.e. imagine job A and B are launched, and use the same runs-on labels. If runner A fails to start, runner B might be assigned to job A, while job B hangs for a while, until job C with the same labels is started. At this point job B might start executing, while job C hangs, etc.

Best way to debug this would be to assign ${{ github.run_id }} in your runs-on labels to force the runner to process the job it was started for, but I don't think this project supports that.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Delay in starting an EC2 instance #2045

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Delay in starting an EC2 instance #2045

Uh oh!

Uh oh!

Fgerthoffert May 10, 2022

Replies: 1 comment

Uh oh!

crohr Apr 10, 2024

Fgerthoffert
May 10, 2022

crohr
Apr 10, 2024