-
Notifications
You must be signed in to change notification settings - Fork 10
RFC: taskgraph: Detect when builds from a previous push can be reused #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,112 @@ | ||
| # RFC 32 - taskgraph: Detect when builds from a previous push can be reused | ||
| * Comments: [#32](https://api.github.com/repos/mozilla-releng/releng-rfcs/issues/32) | ||
| * Proposed by: @JohanLorenzo | ||
|
|
||
| # Summary | ||
|
|
||
| Taskgraph, in the decision task should be able to retrieve a previous build if the changeset is a test-only one. | ||
|
|
||
| ## Motivation | ||
|
|
||
| Per [1]: | ||
| > Approximately 13-14% of pushes to the try and autoland branches are test-only pushes. If test repackaging can be done without running builds, this can reduce the CI costs. | ||
|
|
||
| # Details | ||
|
|
||
| ## In a few words | ||
|
|
||
| The decision task looks at the push it's dealing with and gets: | ||
| * the revision before this push | ||
| * the files that were changed by this push. | ||
|
|
||
| If none of the changed files is a source file, then the decision task will get the reuse the builds made in the previous push. | ||
|
|
||
|
|
||
| ## Workflows | ||
|
|
||
| ### Happy path | ||
|
|
||
| Conditions: | ||
|
|
||
| a. The previous push produced builds. | ||
| b. All builds from the previous push are completed. | ||
| c. No platform is missing in the previous push. | ||
| d. The previous decision task has already run. | ||
| e. The previous decision task is green. | ||
| f. We are confident the previous builds are good. | ||
|
|
||
|
|
||
| The decision task runs. Everything goes as usual up until the optimization phase. During this one, taskgraph gets the push metadata thanks to the `json-pushes` endpoint[2], and gets the list of changed files. If a source file is present, then the build tasks in the graph remain unchanged. Otherwise, it gets the hash of the parent of the last changeset (no need to make a new request, it's available in the previous response). Then, taskgraph talks to the Taskcluster index to get the corresponding builds. The indexes will contain (if they don't already) the revision so that's how taskgraph will easily look them up. Then, taskgraph optimizes away the build tasks, just like we do with Docker images. Finally, it injects some `insert-indexes` tasks to point this revision to the builds it uses. This will be important for the next case. | ||
|
|
||
|
|
||
| ### The previous push also re-used builds. | ||
|
|
||
| Condition a. is changed. A new push comes in. Taskgraph processes it the same way as described above: it gets the hash of the parent of the last change fo the current push. It finds existing builds on the Index thanks to this very hash, because it was populated by the `insert-indexes` task. Thanks to this logic, condition a. is not important anymore, it's easy to deal with both cases. Let's ignore it from now on. | ||
|
|
||
|
|
||
| ### Builds of the previous push aren't completed yet | ||
|
|
||
| Condition b. is changed. A new push comes in. By the time the decision task runs, builds are not done yet (they may have even failed). In the current way we use indexes, we have no way to tell that a given route is being populated by another task. That's something Callek calls "eager indexes" and he filed bug 1653058, for it. Thus, this current bug depends on it. Once this is solved, taskgraph will just have to listen to the right indexes to be sure to wait on upcoming builds. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks like we're polling indexes at runtime, rather than expanding indexes at decision time? The former is more flexible but less secure. I'm wondering if we can: a. insert an index pointing at taskID before taskID runs
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then again, perhaps we don't care as much for tests. We could go with the index solution, but prefer not to use it for any tasks that create artifacts we plan to ship.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My understanding is the intent is to resolve the indexes in the decision task still. They'll just not be complete yet, which is a change from previously.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with Tom. I'd like indexes to be expanded at decision time. |
||
|
|
||
|
|
||
| ### The previous push didn't build all platforms | ||
|
|
||
| Condition c. is changed. We don't care about condition a. and b., here. We may end up in situations where just a single platform was built in the previous push. In this case, the previous push will add `insert-indexes` tasks for any optimized-away platform. | ||
|
|
||
|
|
||
| ### The previous decision task is still running | ||
|
|
||
| Condition d. is changed. We may end up in situations where the current decision tasks starts while the previous is still running. That can easily happen on try or on autoland. We cannot use the index of the previous decision task because it's yet to be populated. This means, for any missing previous build, we need to make sure that the decision task was completed in the first place. Then, if it's not, we need to make sure it's not running. We could use Treeherder to know piece of data, but that ties taskgraph to another service. We could also change `.taskcluster.yml` to spin up a dummy task along the decision one to "eager index" it. I kind of like the latter, but I wonder if Chain of Trust may be broken because of it. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could potentially be something that is handled by the We definitely should not use treeherder for this.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree for Treeherder. Just to make sure I understand your first sentence: do you suggest that the dummy task should be spun up by the decision task?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had been suggesting it be created by build-decision (alternatively, it could create the indexes directly, rather than requiring a separate task. That said, we already wait for the on-push decision task in cron, and it may be reasonable to do the same here. If we are waiting, I'm not sure there is much difference between waiting on a task, and waiting for an index to appear.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm sorry, I'm not sure to follow your proposal. Let me try to rephrase it. Should we change I don't have opinion at the moment, I'm just making I grasp this right. |
||
|
|
||
| ### The previous decision task has failed and a rerun won't fix it | ||
|
|
||
| Condition e. is changed. The previous push may just turn red and we can only salvage it with another push. The "eager index" created in the previous paragraph tells taskgraph the previous task has started. Sadly, since the decision task remains red, there is no way to populate the index with that broken task. So taskgraph needs to know when to stop waiting for the previous decision task. A simple solution could be to bake the `taskId` of the previous decision task in the payload of the previous dummy task. It's quite easy to do with `.taskcluster.yml`. This way, taskgraph know where the dummy task is, thanks to "eager indexes", then it knows the `taskId` of the previous decision task so it can monitor it. | ||
|
|
||
|
|
||
| ### Sheriffs want to respin builds on a test-only changeset. | ||
|
|
||
| Condition f. is changed. For whatever reasons, we want to make builds on a revision that got its builds optimized away. This is going to be handled in an action task. I don't know whether `rebuild-kinds` will work out of the box, or if we need to pass a custom parameter to tell taskgraph to not optimize away them once more. In any case, it should be a matter of guarding the optimization code with an `if` statement. | ||
|
|
||
|
|
||
| ### Any other edge cases? | ||
|
|
||
| I currently don't see any edge case that isn't covered in the setup hg + hg.mozilla.org. Feel free to tell me if you see something missing :) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One case that I don't see discussed is when the decision task is succesful, but the builds being pointed at failed.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a case that is known (and should be captured here) -- in theory if the decision task was green but the build failed it is one of three things, intermittent failure, would also fail on this current test-only push, or needing a new build on this cset anyway [whether or not we recognized that with a bug in our logic]. This is partly why sheriffs want some extra treeherder UX to identify "depends on previous push" type stuff.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll add it explicitly. In my opinion, the current decision task shouldn't wait for previous builds, otherwise we may end up with decision tasks that will be running for too long. So, I think the current decision task reuse previous builds no matter their state. Then, if something goes wrong, sheriffs will be able to backfill builds on top of the current decision task. These builds will overwrite the indexes tied to this revision, so any future decision task will backtrack up to the current revision instead of the previous one. How does that sound to you all?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I don't think we should wait on the tasks to complete. It may be worth seeing if they are complete, and doing something different if they have already failed though. This does not need to be handled in the initial implementation though.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea! I agree we can do something if the builds have already failed! |
||
|
|
||
|
|
||
|
|
||
| ## Local runs | ||
|
|
||
| Callek pointed out something really important when we chatted: people must be able to run taskgraph while being offline. So any calls made to hg.mozilla.org must be translated into some hg commands that run locally. I know `hg wip` provides something close to `json-pushes` in the sense that you know what your current head was based off of. So, we can get the last public changeset and taskgraph will point optimized-away builds to this repo/revision. I just need to figure a more machine-friendly command. If any of you know what this hg command is, I'll be happy to hear from you :) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Depending on where the code gets added, this might not be necessary,
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I should note my discussion was
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay! For the record,
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm less eager to require a token for local use, but we can probably fudge it a bit if necessary. Besides we're not likely to rely on github for local gecko decision task stuff to the extent we need to for actual pushes. We can also probably have decent fallbacks for local, that while not exactly accurate are "close enough" |
||
|
|
||
| I see one issue with that: we won't be able to know what revisions were already pushed to try. In fact, pushing there doesn't make any changeset public (which is good). So the result you have locally, may not be the same as the one the decision task created. The decision will be able to optimize 2 try pushes away thanks to the `json-pushes` endpoint. | ||
|
|
||
|
|
||
|
|
||
| ## Translating the proposal to git | ||
|
|
||
| Rail made a point when we chatted: even though we want this to work on hg (and more precisely hg.mozilla.org), we still want to support git (and github.com) workflows for projects like Fenix. Good news is: GitHub now exposes pushes metadata thanks to this new API[3]. So we should be able to implement a similar logic than the one with `json-pushes`. The main issue, though: we need to give a GitHub token to the decision task. The number of unauthenticated requests is way too low to the scale of Taskcluster. Authenticated requests may be not enough either, because one user has a certain amount of requests he/she/they can make. So may need several bot accounts to absorb the full load. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe there are two alternatives to minimize the number of bot accounts to manage:
One potential bot avoidance (if suitable for this case) would be to do a "very light weight" fetch (or clone) to see if there are new commits-of-interest (aka pushes).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regrading the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the input @hwine! I'm not too sure if we could make a Github app, in this case. The code will run inside the decision task (or on someone's computer). I get the feeling a regular token is what we need. What do you think? In any case, I like the idea about the limit increase! If we confirm the Github app isn't suite, then I'll be happy to write this down in the PR. Good call on the curl -X POST -H "Authorization: bearer $GITHUB_TOKEN" -H 'Accept: application/vnd.github.antiope-preview+json' 'https://api.github.com/graphql' --data '{"query": "query { repository(name: \"fenix\", owner: \"mozilla-mobile\") { object(oid: \"aaae70f3dc9db8e58b70a4220bc84c513a64750f\") { ... on Commit { oid, checkSuites(first:1) { nodes { push { previousSha } } } } } } }" }'returns {
"data": {
"repository": {
"object": {
"oid": "aaae70f3dc9db8e58b70a4220bc84c513a64750f",
"checkSuites": {
"nodes": [{
"push": {
"previousSha": "0e3acfcd9467060c9ae056a715c777516c7d12e8"
}
}]
}
}
}
}
}This means, we're able to get the range of a push if we know its head commit. Proof that returned values are correct: https://treeherder.mozilla.org/#/jobs?repo=fenix&revision=aaae70f3dc9db8e58b70a4220bc84c513a64750f Then, we can run git locally based on this range: outputs
|
||
|
|
||
|
|
||
|
|
||
|
|
||
| # Open Questions | ||
|
|
||
| * Will Chain of Trust break if `.taskcluster.yml` spins 2 tasks instead of a single one? | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I read it, we compare the parent task (decision/action/cron) to every task generated from .taskcluster.yml, and fails if it doesn't match any. We should verify.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good! I'll create a small test case to be 100% sure, then!
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I confirm it works! I created this "eager index" task on my Fenix fork (it already has everything set up so it was easier for me to use it): JohanLorenzo/fenix@8807e3a. A (level-1) signing task ran and successfully passed: https://firefox-ci-tc.services.mozilla.com/tasks/C0uKmk5GTA2EnRW87IJDcQ/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FC0uKmk5GTA2EnRW87IJDcQ%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Fchain_of_trust.log#L158. |
||
| * Does `rebuild-kind` get optimized away? | ||
| * Any other edge cases? | ||
|
|
||
| # Implementation | ||
|
|
||
| <once the RFC is decided, these links will provide readers a way to track the | ||
| implementation through to completion> | ||
|
|
||
| * <link to tracker bug, issue, etc.> | ||
| * <...> | ||
|
|
||
|
|
||
|
|
||
|
|
||
| [1] https://docs.google.com/document/d/1dVhDPCe3ibEJzD9VZ47VUMCy4s18NbUgeLnYJFNQtTU/edit | ||
| [2] E.g.: https://hg.mozilla.org/mozilla-central/json-pushes?version=2&changeset=56082fc4acfacba40993e47ef8302993c59e264e&full=1 | ||
| [3] https://developer.github.com/v4/object/push/ | ||
| [4] https://developer.github.com/v4/guides/resource-limitations/#rate-limit | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative would be to use hg locally to find the appropriate revision to get builds from, and then query
hg.m.oto find the corresponding push. That would also take care of the next case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we did so, then a future decision task would be able to reuse back-filled builds (see #32 (comment) for the use-case). But maybe I'm missing something. Could you expand on your proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was to use hg locally to find the commit to get builds from, and then turn that into a push using hg.m.o. I was thinking that might be easier that walking
json-pushesan arbitrary amount. That said, if builds are run on a later push, we'd always ignore them, so it may not be a good idea.Thinking out loud, I wonder if
json-automationrelevancecould be enhanced to provide the information we want? Putting it on the server does make it harder to change or experiment with the logic, but it does move the data closer to the logic. I wonder if there is a middle path to offload the work to the server, while keeping the logic in-tree. I wonder if @cgsheeh has any thoughts here.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hg.m.o would work for gecko, but not for non-gecko. Are we looking for a gecko-only solution or a taskgraph-wide solution? (Edit: ah, I see the git thread above.)
An aside about indexes: are we going to expand them and use the taskIds, or bake the index into the task definition? The former is more explicit and less vulnerable to manipulation at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies. I don't think my explanations weren't clear on this. We don't walk
json-pushes. Everything is known thanks to a single request tojson-pushes. This requests tells taskgraph what files were changed in the current push and what was the revision before this push. That revision points (thanks to TC indexes) to builds no matter if the decision task actually made them or if it reused previous ones. This is how we don't need to walkjson-pushes.Does it make sense to you @tomprince? If so, do you think we should expand
json-automationrelevance. If not, I'm happy to chat on a call to clarify our respective thoughts.That is a good question! I believe the decision task should do the former, just like it does with in-tree docker images.