RFC: taskgraph: Detect when builds from a previous push can be reused #32

JohanLorenzo · 2020-07-27T13:40:50Z

Callek · 2020-07-27T13:52:24Z

Offhand commentary:

Optimzation phase of the local taskgraph run doesn't need to run locally, without network but does need to be able to run locally.
try pushes should only consult previous builds based on its own parents not on whoever just pushed to try before me.

(we may want to allow try to depend on already built builds on its parent branch, and thus reuse a bit of the artifact-build logic -- but that is far less certain to work)

hwine · 2020-07-27T16:53:16Z

rfcs/0032-taskgraph-detect-when-builds-from-a-previous-push-can-be-reused.md

+
+## Translating the proposal to git
+
+Rail made a point when we chatted: even though we want this to work on hg (and more precisely hg.mozilla.org), we still want to support git (and github.com) workflows for projects like Fenix. Good news is: GitHub now exposes pushes metadata thanks to this new API[3]. So we should be able to implement a similar logic than the one with `json-pushes`. The main issue, though: we need to give a GitHub token to the decision task. The number of unauthenticated requests is way too low to the scale of Taskcluster. Authenticated requests may be not enough either, because one user has a certain amount of requests he/she/they can make. So may need several bot accounts to absorb the full load.


I believe there are two alternatives to minimize the number of bot accounts to manage:

write the bot as a GitHub app <= much higher limits

explore a limit increase for a smaller number of bot accounts by opening a support ticket with GitHub

One potential bot avoidance (if suitable for this case) would be to do a "very light weight" fetch (or clone) to see if there are new commits-of-interest (aka pushes).

Regrading the push object in the GraphQL api, it looks like that is only available via checks associated to that push. I'm not sure how accessible that is directly from a commit or other repository object.

Thanks for the input @hwine! I'm not too sure if we could make a Github app, in this case. The code will run inside the decision task (or on someone's computer). I get the feeling a regular token is what we need. What do you think?

In any case, I like the idea about the limit increase! If we confirm the Github app isn't suite, then I'll be happy to write this down in the PR.

Good call on the push object, @tomprince! I hadn't noticed it was tied to the GitHub checks. Good news: I found a way to get the required data. I'm dumping it here, so I can remember what I did, next time.

curl -X POST -H "Authorization: bearer $GITHUB_TOKEN" -H 'Accept: application/vnd.github.antiope-preview+json' 'https://api.github.com/graphql' --data '{"query": "query { repository(name: \"fenix\", owner: \"mozilla-mobile\") { object(oid: \"aaae70f3dc9db8e58b70a4220bc84c513a64750f\") { ... on Commit { oid, checkSuites(first:1) { nodes { push { previousSha } } } } } } }" }'

returns

{ "data": { "repository": { "object": { "oid": "aaae70f3dc9db8e58b70a4220bc84c513a64750f", "checkSuites": { "nodes": [{ "push": { "previousSha": "0e3acfcd9467060c9ae056a715c777516c7d12e8" } }] } } } } }

This means, we're able to get the range of a push if we know its head commit. Proof that returned values are correct: https://treeherder.mozilla.org/#/jobs?repo=fenix&revision=aaae70f3dc9db8e58b70a4220bc84c513a64750f

Then, we can run git locally based on this range:

git log --name-only --format='' 0e3acfcd9467060c9ae056a715c777516c7d12e8..aaae70f3dc9db8e58b70a4220bc84c513a64750f

outputs

buildSrc/src/main/java/AndroidComponents.kt app/src/main/java/org/mozilla/fenix/browser/BaseBrowserFragment.kt app/src/main/java/org/mozilla/fenix/browser/BaseBrowserFragment.kt

⚠️ The push object is still unstable. Github may change its API without any notice.

tomprince

I think we probably want to ensure that this optimization does not happen on release branches, at least for shippable builds

tomprince · 2020-07-27T19:45:17Z

rfcs/0032-taskgraph-detect-when-builds-from-a-previous-push-can-be-reused.md

+
+## Translating the proposal to git
+
+Rail made a point when we chatted: even though we want this to work on hg (and more precisely hg.mozilla.org), we still want to support git (and github.com) workflows for projects like Fenix. Good news is: GitHub now exposes pushes metadata thanks to this new API[3]. So we should be able to implement a similar logic than the one with `json-pushes`. The main issue, though: we need to give a GitHub token to the decision task. The number of unauthenticated requests is way too low to the scale of Taskcluster. Authenticated requests may be not enough either, because one user has a certain amount of requests he/she/they can make. So may need several bot accounts to absorb the full load.


Regrading the push object in the GraphQL api, it looks like that is only available via checks associated to that push. I'm not sure how accessible that is directly from a commit or other repository object.

tomprince · 2020-07-27T19:46:47Z

rfcs/0032-taskgraph-detect-when-builds-from-a-previous-push-can-be-reused.md

+
+## Local runs
+
+Callek pointed out something really important when we chatted: people must be able to run taskgraph while being offline. So any calls made to hg.mozilla.org must be translated into some hg commands that run locally. I know `hg wip` provides something close to `json-pushes` in the sense that you know what your current head was based off of. So, we can get the last public changeset and taskgraph will point optimized-away builds to this repo/revision. I just need to figure a more machine-friendly command. If any of you know what this hg command is, I'll be happy to hear from you :)


Depending on where the code gets added, this might not be necessary, mach taskgraph optimized already needs network access, so if the code is run as part of that, I don't know that we need to worry about being able to run it offline.

I should note my discussion was offline for full and target but only "able to run locally" for optimized and morphed (e.g. network is ok for them)

Okay! For the record, optimized may require a Github token, in addition to network access. Does this sound good to you all?

I'm less eager to require a token for local use, but we can probably fudge it a bit if necessary. Besides we're not likely to rely on github for local gecko decision task stuff to the extent we need to for actual pushes.

We can also probably have decent fallbacks for local, that while not exactly accurate are "close enough"

tomprince · 2020-07-27T20:14:39Z

rfcs/0032-taskgraph-detect-when-builds-from-a-previous-push-can-be-reused.md

+ f. We are confident the previous builds are good.
+
+
+The decision task runs. Everything goes as usual up until the optimization phase. During this one, taskgraph gets the push metadata thanks to the `json-pushes` endpoint[2], and gets the list of changed files. If a source file is present, then the build tasks in the graph remain unchanged. Otherwise, it gets the hash of the parent of the last changeset (no need to make a new request, it's available in the previous response). Then, taskgraph talks to the Taskcluster index to get the corresponding builds. The indexes will contain (if they don't already) the revision so that's how taskgraph will easily look them up. Then, taskgraph optimizes away the build tasks, just like we do with Docker images. Finally, it injects some `insert-indexes` tasks to point this revision to the builds it uses. This will be important for the next case.


Another alternative would be to use hg locally to find the appropriate revision to get builds from, and then query hg.m.o to find the corresponding push. That would also take care of the next case.

I think if we did so, then a future decision task would be able to reuse back-filled builds (see #32 (comment) for the use-case). But maybe I'm missing something. Could you expand on your proposal?

My idea was to use hg locally to find the commit to get builds from, and then turn that into a push using hg.m.o. I was thinking that might be easier that walking json-pushes an arbitrary amount. That said, if builds are run on a later push, we'd always ignore them, so it may not be a good idea.

Thinking out loud, I wonder if json-automationrelevance could be enhanced to provide the information we want? Putting it on the server does make it harder to change or experiment with the logic, but it does move the data closer to the logic. I wonder if there is a middle path to offload the work to the server, while keeping the logic in-tree. I wonder if @cgsheeh has any thoughts here.

hg.m.o would work for gecko, but not for non-gecko. Are we looking for a gecko-only solution or a taskgraph-wide solution? (Edit: ah, I see the git thread above.)

An aside about indexes: are we going to expand them and use the taskIds, or bake the index into the task definition? The former is more explicit and less vulnerable to manipulation at runtime.

My idea was to use hg locally to find the commit to get builds from, and then turn that into a push using hg.m.o. I was thinking that might be easier that walking json-pushes an arbitrary amount.

My apologies. I don't think my explanations weren't clear on this. We don't walk json-pushes. Everything is known thanks to a single request to json-pushes. This requests tells taskgraph what files were changed in the current push and what was the revision before this push. That revision points (thanks to TC indexes) to builds no matter if the decision task actually made them or if it reused previous ones. This is how we don't need to walk json-pushes.

Does it make sense to you @tomprince? If so, do you think we should expand json-automationrelevance. If not, I'm happy to chat on a call to clarify our respective thoughts.

An aside about indexes: are we going to expand them and use the taskIds, or bake the index into the task definition? The former is more explicit and less vulnerable to manipulation at runtime.

That is a good question! I believe the decision task should do the former, just like it does with in-tree docker images.

tomprince · 2020-07-27T20:15:40Z

rfcs/0032-taskgraph-detect-when-builds-from-a-previous-push-can-be-reused.md

+
+### Any other edge cases?
+
+I currently don't see any edge case that isn't covered in the setup hg + hg.mozilla.org. Feel free to tell me if you see something missing :)


One case that I don't see discussed is when the decision task is succesful, but the builds being pointed at failed.

This is a case that is known (and should be captured here) -- in theory if the decision task was green but the build failed it is one of three things, intermittent failure, would also fail on this current test-only push, or needing a new build on this cset anyway [whether or not we recognized that with a bug in our logic].

This is partly why sheriffs want some extra treeherder UX to identify "depends on previous push" type stuff.

I'll add it explicitly. In my opinion, the current decision task shouldn't wait for previous builds, otherwise we may end up with decision tasks that will be running for too long. So, I think the current decision task reuse previous builds no matter their state. Then, if something goes wrong, sheriffs will be able to backfill builds on top of the current decision task. These builds will overwrite the indexes tied to this revision, so any future decision task will backtrack up to the current revision instead of the previous one.

How does that sound to you all?

Yeah, I don't think we should wait on the tasks to complete. It may be worth seeing if they are complete, and doing something different if they have already failed though. This does not need to be handled in the initial implementation though.

Good idea! I agree we can do something if the builds have already failed!

tomprince · 2020-07-27T21:27:41Z

rfcs/0032-taskgraph-detect-when-builds-from-a-previous-push-can-be-reused.md

+
+### The previous decision task is still running
+
+Condition d. is changed. We may end up in situations where the current decision tasks starts while the previous is still running. That can easily happen on try or on autoland. We cannot use the index of the previous decision task because it's yet to be populated. This means, for any missing previous build, we need to make sure that the decision task was completed in the first place. Then, if it's not, we need to make sure it's not running. We could use Treeherder to know piece of data, but that ties taskgraph to another service. We could also change `.taskcluster.yml` to spin up a dummy task along the decision one to "eager index" it. I kind of like the latter, but I wonder if Chain of Trust may be broken because of it.


This could potentially be something that is handled by the build-decision task directly.

We definitely should not use treeherder for this.

I agree for Treeherder.

Just to make sure I understand your first sentence: do you suggest that the dummy task should be spun up by the decision task?

I had been suggesting it be created by build-decision (alternatively, it could create the indexes directly, rather than requiring a separate task. That said, we already wait for the on-push decision task in cron, and it may be reasonable to do the same here. If we are waiting, I'm not sure there is much difference between waiting on a task, and waiting for an index to appear.

I'm sorry, I'm not sure to follow your proposal. Let me try to rephrase it. Should we change .taskcluster.yml so it does not schedule a regular decision task on push, but instead it schedules a build-decision task so we can wait for the previous decision task to finish and spin the real decision task at the right time?

I don't have opinion at the moment, I'm just making I grasp this right.

escapewindow · 2020-08-03T16:50:51Z

rfcs/0032-taskgraph-detect-when-builds-from-a-previous-push-can-be-reused.md

+
+# Open Questions
+
+ * Will Chain of Trust break if `.taskcluster.yml` spins 2 tasks instead of a single one?


As I read it, we compare the parent task (decision/action/cron) to every task generated from .taskcluster.yml, and fails if it doesn't match any. We should verify.

Sounds good! I'll create a small test case to be 100% sure, then!

I confirm it works! I created this "eager index" task on my Fenix fork (it already has everything set up so it was easier for me to use it): JohanLorenzo/fenix@8807e3a. A (level-1) signing task ran and successfully passed: https://firefox-ci-tc.services.mozilla.com/tasks/C0uKmk5GTA2EnRW87IJDcQ/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FC0uKmk5GTA2EnRW87IJDcQ%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Fchain_of_trust.log#L158.

escapewindow · 2020-08-03T16:53:15Z

rfcs/0032-taskgraph-detect-when-builds-from-a-previous-push-can-be-reused.md

+ f. We are confident the previous builds are good.
+
+
+The decision task runs. Everything goes as usual up until the optimization phase. During this one, taskgraph gets the push metadata thanks to the `json-pushes` endpoint[2], and gets the list of changed files. If a source file is present, then the build tasks in the graph remain unchanged. Otherwise, it gets the hash of the parent of the last changeset (no need to make a new request, it's available in the previous response). Then, taskgraph talks to the Taskcluster index to get the corresponding builds. The indexes will contain (if they don't already) the revision so that's how taskgraph will easily look them up. Then, taskgraph optimizes away the build tasks, just like we do with Docker images. Finally, it injects some `insert-indexes` tasks to point this revision to the builds it uses. This will be important for the next case.


hg.m.o would work for gecko, but not for non-gecko. Are we looking for a gecko-only solution or a taskgraph-wide solution? (Edit: ah, I see the git thread above.)

An aside about indexes: are we going to expand them and use the taskIds, or bake the index into the task definition? The former is more explicit and less vulnerable to manipulation at runtime.

escapewindow · 2020-08-03T16:56:58Z

rfcs/0032-taskgraph-detect-when-builds-from-a-previous-push-can-be-reused.md

+
+### Builds of the previous push aren't completed yet
+
+Condition b. is changed. A new push comes in. By the time the decision task runs, builds are not done yet (they may have even failed). In the current way we use indexes, we have no way to tell that a given route is being populated by another task. That's something Callek calls "eager indexes" and he filed bug 1653058, for it. Thus, this current bug depends on it. Once this is solved, taskgraph will just have to listen to the right indexes to be sure to wait on upcoming builds.


This looks like we're polling indexes at runtime, rather than expanding indexes at decision time? The former is more flexible but less secure. I'm wondering if we can:

a. insert an index pointing at taskID before taskID runs
b. create an ordered list of taskIDs, from those indexes, of builds to point at. The first would be the most recent task, but might not have completed yet; the last would be the most recent successfully completed task. If there are multiple unfinished tasks in between, then we have those in our list as well.
c. add logic to use the build from the most recent successful task in that list.

Then again, perhaps we don't care as much for tests. We could go with the index solution, but prefer not to use it for any tasks that create artifacts we plan to ship.

My understanding is the intent is to resolve the indexes in the decision task still. They'll just not be complete yet, which is a change from previously.

I agree with Tom. I'd like indexes to be expanded at decision time.

RFC: taskgraph: Detect when builds from a previous push can be reused

512558d

JohanLorenzo force-pushed the test-only-runs branch from d2b5094 to 512558d Compare July 27, 2020 13:41

JohanLorenzo added the Phase: Draft label Jul 27, 2020

hwine reviewed Jul 27, 2020

View reviewed changes

tomprince reviewed Jul 27, 2020

View reviewed changes

JohanLorenzo self-assigned this Jul 30, 2020

JohanLorenzo added Phase: Proposal and removed Phase: Draft labels Jul 30, 2020

escapewindow reviewed Aug 3, 2020

View reviewed changes


		## Translating the proposal to git

		Rail made a point when we chatted: even though we want this to work on hg (and more precisely hg.mozilla.org), we still want to support git (and github.com) workflows for projects like Fenix. Good news is: GitHub now exposes pushes metadata thanks to this new API[3]. So we should be able to implement a similar logic than the one with `json-pushes`. The main issue, though: we need to give a GitHub token to the decision task. The number of unauthenticated requests is way too low to the scale of Taskcluster. Authenticated requests may be not enough either, because one user has a certain amount of requests he/she/they can make. So may need several bot accounts to absorb the full load.


		## Local runs

		Callek pointed out something really important when we chatted: people must be able to run taskgraph while being offline. So any calls made to hg.mozilla.org must be translated into some hg commands that run locally. I know `hg wip` provides something close to `json-pushes` in the sense that you know what your current head was based off of. So, we can get the last public changeset and taskgraph will point optimized-away builds to this repo/revision. I just need to figure a more machine-friendly command. If any of you know what this hg command is, I'll be happy to hear from you :)

		f. We are confident the previous builds are good.


		The decision task runs. Everything goes as usual up until the optimization phase. During this one, taskgraph gets the push metadata thanks to the `json-pushes` endpoint[2], and gets the list of changed files. If a source file is present, then the build tasks in the graph remain unchanged. Otherwise, it gets the hash of the parent of the last changeset (no need to make a new request, it's available in the previous response). Then, taskgraph talks to the Taskcluster index to get the corresponding builds. The indexes will contain (if they don't already) the revision so that's how taskgraph will easily look them up. Then, taskgraph optimizes away the build tasks, just like we do with Docker images. Finally, it injects some `insert-indexes` tasks to point this revision to the builds it uses. This will be important for the next case.


		### Any other edge cases?

		I currently don't see any edge case that isn't covered in the setup hg + hg.mozilla.org. Feel free to tell me if you see something missing :)


		### The previous decision task is still running

		Condition d. is changed. We may end up in situations where the current decision tasks starts while the previous is still running. That can easily happen on try or on autoland. We cannot use the index of the previous decision task because it's yet to be populated. This means, for any missing previous build, we need to make sure that the decision task was completed in the first place. Then, if it's not, we need to make sure it's not running. We could use Treeherder to know piece of data, but that ties taskgraph to another service. We could also change `.taskcluster.yml` to spin up a dummy task along the decision one to "eager index" it. I kind of like the latter, but I wonder if Chain of Trust may be broken because of it.


		# Open Questions

		* Will Chain of Trust break if `.taskcluster.yml` spins 2 tasks instead of a single one?


		### Builds of the previous push aren't completed yet

		Condition b. is changed. A new push comes in. By the time the decision task runs, builds are not done yet (they may have even failed). In the current way we use indexes, we have no way to tell that a given route is being populated by another task. That's something Callek calls "eager indexes" and he filed bug 1653058, for it. Thus, this current bug depends on it. Once this is solved, taskgraph will just have to listen to the right indexes to be sure to wait on upcoming builds.

RFC: taskgraph: Detect when builds from a previous push can be reused #32

Are you sure you want to change the base?

RFC: taskgraph: Detect when builds from a previous push can be reused #32

Uh oh!

Conversation

JohanLorenzo commented Jul 27, 2020 • edited by tomprince Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Callek commented Jul 27, 2020

Uh oh!

hwine Jul 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomprince left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

escapewindow Aug 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

escapewindow Aug 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

JohanLorenzo commented Jul 27, 2020 •

edited by tomprince

Loading

hwine Jul 27, 2020 •

edited

Loading

escapewindow Aug 3, 2020 •

edited

Loading

escapewindow Aug 3, 2020 •

edited

Loading