-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GitHub workflow action failures for large feeds #1304
Comments
As a follow-up, I've verified that heap usage doesn't look like it's changed significantly when validating against But it still takes a lot of heap to process these feeds. We max out at 8GB heap used for So what do we do about this? A couple of options:
@isabelle-dr I think I'm going to need your input here. I specifically don't have access to the MobilityData orgs action usage stats but maybe you do? Could be useful to give us an estimate of what this might cost to migrate to the larger runners. We might be able to limit spend by only using large runners for acceptance tests and limiting their execution? |
So I've looked into our action usage stats, but there is still nothing. I believe it is due to MobilityData's status as a non-profit which gives us a free Team plan, therefore they don't cumulate any stats. I've registered for the large runners, it's a beta program, and I'm waiting for a confirmation before these runners can be accessed. We'll have to see what costs it incurs I guess. Did you try the Alternatively, we could host our own runner by creating a VM in Digital Ocean just for these tasks, but the costs could be important, between 48$ and 96$/month depending on the setup. I could check if we don't already have a running VM that could be used for this, since it's ephemeral in usage. |
Alright, that was fast. The large runners are ready, I believe the one you need is called Runner group: Default Larger Runners |
@fredericsimard I can't access the link you posted. Just to confirm, is |
And I've removed the larger ones, leaving the classic one and the 4-core VM. |
…ning out of heap memory when processing large feeds. Closes #1304.
* Bump validation workflows to use large machine instances to avoid running out of heap memory when processing large feeds. Closes #1304. * Try using a label instead. * label => labels * Try yet another runner syntax.
I'm going to reopen this for some follow-up discussion. In #1320, we modified our workflows to use runners with more memory to help process larger feeds. A couple thoughts from that update:
Thoughts @isabelle-dr @fredericsimard ? |
I'm willing to try it. There's only one large runner in the runner group, so perhaps adding one more would help. But wouldn't it be the same cost? Two machines doing the same work in half the time as one?
That's a lot. Even a fully dedicated VM for this kind of task on DO would be 580..1200$/year. What I mean by that is we can run our own runner instead of GitHub's. For the price difference, I think it's worth it. There might be some work to make it run with Public repos I believe, I've seen something to that matter last night, would need to be confirmed. |
Interesting, there does appear to be /some/ parallelization of jobs with the existing single runner, maybe using each of the 4 cores to run individual task?
That's my interpretation as well, but we'd want to double check the billing numbers after enabling. But I wouldn't stop at two runners, I'd like 4 or 5 to get our runtime numbers down.
I'll ultimately leave that call up to you, but a couple of counterpoints:
|
If we decided to run the acceptance tests run on demand before merging instead of on every commit, would this make a difference? |
It's possible. I haven't seen anything about that in the documentation I've read so far.
We could test that yes. Let me know when you're ready to test this.
Yes, I've read that to. Wouldn't the PRs requiring a reviewer be our line of defence in this case? I'm sure there is a way to wipe the VM after a run, surely we can replicate this behaviour in some form? Also I've looked into the costs for the runner and its usage so far, we have: At that price I could've had a 2-core VM running for a month. This is an expensive service from GH. |
It's not so simple. Our workflows are defined in our repository and in practice, a user can submit a pull request against those workflows and any changes in those workflows will be reflected immediately in the execution of those workflows, whether they have been reviewed or not. If this sounds like a flaw in GitHub, I'm inclined to agree with you. So long-story-short, if we have a self-hosted runners associated with our public repository, then we cannot prevent someone from executing arbitrary commands on them. This is why GitHub advices against this setup. |
Yeah, makes sense. Let's do the test with multiple machines. Although the cost is really high for what it is, I'd need to get an approval for such a high amount. And demonstrate it fixes an issue we can't fix any other way, or for cheaper. |
As a follow-up, @fredericsimard bumped the max runner pool size from 10 => 20 yesterday. I ran a new instance of the Rule acceptance workflow (run link) with the updated runner count. Results:
That's roughly the same billable time as the previous test runs in 43% less actual run time (not quite half, but close). So main conclusion is that we can bump the size of the runner pool to get faster execution at about the same cost. I'd propose bumping to 30 :) I also want to point out that there was some confusion earlier about how many runners we had. I had interpreted @fredericsimard 's comment above to mean we only had a single runner for our pool, when in fact we had 10. So @fredericsimard to your point about self-hosting a VM, I'd actually be asking you to host 10-20 runners, not just 1-2. I feel like that makes self-hosted runners less competitive on price. Combine that with the security concerns and it seems like a non-starter to me. But maybe you disagree? My take is that we are left with two options:
Again, I'm going to defer to @fredericsimard and @isabelle-dr on the cost + quality trade-offs here. Personally, I'm inclined to just go with #2 for now and potentially explore #3 as a follow-up. Thoughts? |
@bdferris-v2 I just want to understand why these tests were required in the first place: isn't it because the classic runners are just not up to the task and some automated tests never got completed? So the question is, we need large runners to complete those tests, or remove those tests, which is not the desired outcome, correct? I can try and get a budget approved for ~5000$/year for this if we can justify it. So is it absolutely required, and if yes, why? |
There are two key workflows here:
These tests used to run more reliably but either due to changes in runner memory configuration (see discussion in opening comment) or increase in feed size, they were no longer reliably running. Thus the switch to larger runners. These tests are important and we do not want to remove them entirely. However, as discussed above, we could remove some of the larger feeds from our test corpus, allowing us to go back to smaller runners, at the expense of reduced test coverage. |
Alright, I have my bi-weekly meeting tomorrow AM, will bring this up. Thanks! |
Any update on this? |
Oh yeah, sorry. I have a pre-approval, but we need more data in order to get a monthly spending limit set. Last time I check the bill was at 116 h, so less than 5 days, and 109$ USD has been accrued. I've set the limit to 500$ and I'll keep on checking on it once every few days. For now, keep going as normal so we can get real numbers. |
…tyData#1320) * Bump validation workflows to use large machine instances to avoid running out of heap memory when processing large feeds. Closes MobilityData#1304. * Try using a label instead. * label => labels * Try yet another runner syntax.
…tyData#1320) * Bump validation workflows to use large machine instances to avoid running out of heap memory when processing large feeds. Closes MobilityData#1304. * Try using a label instead. * label => labels * Try yet another runner syntax.
@bdferris-v2 @fredericsimard I think we can close this as a few actions have occurred since it was open, and for the past year, the E2E and acceptance tests actions have been executed reliably. We will continue monitoring the GitHub actions and bills. cc: @emmambd
|
@davidgamez Completely agree. Closing! |
I've noticed an increased incidence of GitHub workflow action failures (link) for our
end_to_end
andacceptance_test
workflows. Digging deeper into these failures, the validator itself isn't failing. Instead, our workflow runners are being terminated with the following message:The process itself includes the following in its logs:
This is consistent with the process being terminated externally via SIGTERM.
So what's actually going on? Other folks are reporting the same issue to GitHub and it's pointed out that most of these runners are using
ubuntu-latest
, which has been upgraded to22.04
over the past week (link). That potentially changes the resource envelope of the running worker, which might lead to resource exhaustion at different points than the previous runner.Indeed, looking at our action failures, it seems to occur most often for large feeds such as
no-unknown-agder-kollektivtrafikk-as-gtfs-1078
(11M stop-time entries) andde-unknown-ulmer-eisenbahnfreunde-gtfs-1081
(huge shapes.txt file).So what do we do about this? Still digging into that.
But just to check, @aababilov and @asvechnikov2, your recent changes over the past month should have reduced memory usage for large feeds, not increased it, correct?
The text was updated successfully, but these errors were encountered: