Skip to content

Conversation

@JohanLorenzo
Copy link
Contributor

@JohanLorenzo JohanLorenzo commented Jul 27, 2020

@Callek
Copy link
Contributor

Callek commented Jul 27, 2020

Offhand commentary:

  • Optimzation phase of the local taskgraph run doesn't need to run locally, without network but does need to be able to run locally.
  • try pushes should only consult previous builds based on its own parents not on whoever just pushed to try before me.

(we may want to allow try to depend on already built builds on its parent branch, and thus reuse a bit of the artifact-build logic -- but that is far less certain to work)


## Translating the proposal to git

Rail made a point when we chatted: even though we want this to work on hg (and more precisely hg.mozilla.org), we still want to support git (and github.com) workflows for projects like Fenix. Good news is: GitHub now exposes pushes metadata thanks to this new API[3]. So we should be able to implement a similar logic than the one with `json-pushes`. The main issue, though: we need to give a GitHub token to the decision task. The number of unauthenticated requests is way too low to the scale of Taskcluster. Authenticated requests may be not enough either, because one user has a certain amount of requests he/she/they can make. So may need several bot accounts to absorb the full load.
Copy link

@hwine hwine Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there are two alternatives to minimize the number of bot accounts to manage:

  • write the bot as a GitHub app <= much higher limits
  • explore a limit increase for a smaller number of bot accounts by opening a support ticket with GitHub

One potential bot avoidance (if suitable for this case) would be to do a "very light weight" fetch (or clone) to see if there are new commits-of-interest (aka pushes).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regrading the push object in the GraphQL api, it looks like that is only available via checks associated to that push. I'm not sure how accessible that is directly from a commit or other repository object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input @hwine! I'm not too sure if we could make a Github app, in this case. The code will run inside the decision task (or on someone's computer). I get the feeling a regular token is what we need. What do you think?

In any case, I like the idea about the limit increase! If we confirm the Github app isn't suite, then I'll be happy to write this down in the PR.


Good call on the push object, @tomprince! I hadn't noticed it was tied to the GitHub checks. Good news: I found a way to get the required data. I'm dumping it here, so I can remember what I did, next time.

curl -X POST -H "Authorization: bearer $GITHUB_TOKEN" -H 'Accept: application/vnd.github.antiope-preview+json' 'https://api.github.com/graphql' --data '{"query": "query { repository(name: \"fenix\", owner: \"mozilla-mobile\") { object(oid: \"aaae70f3dc9db8e58b70a4220bc84c513a64750f\") { ... on Commit { oid, checkSuites(first:1) { nodes { push { previousSha } } } } } } }" }'

returns

{
    "data": {
        "repository": {
            "object": {
                "oid": "aaae70f3dc9db8e58b70a4220bc84c513a64750f",
                "checkSuites": {
                    "nodes": [{
                        "push": {
                            "previousSha": "0e3acfcd9467060c9ae056a715c777516c7d12e8"
                        }
                    }]
                }
            }
        }
    }
}

This means, we're able to get the range of a push if we know its head commit. Proof that returned values are correct: https://treeherder.mozilla.org/#/jobs?repo=fenix&revision=aaae70f3dc9db8e58b70a4220bc84c513a64750f

Then, we can run git locally based on this range:

git log --name-only --format='' 0e3acfcd9467060c9ae056a715c777516c7d12e8..aaae70f3dc9db8e58b70a4220bc84c513a64750f

outputs

buildSrc/src/main/java/AndroidComponents.kt
app/src/main/java/org/mozilla/fenix/browser/BaseBrowserFragment.kt
app/src/main/java/org/mozilla/fenix/browser/BaseBrowserFragment.kt

⚠️ The push object is still unstable. Github may change its API without any notice.

Copy link
Contributor

@tomprince tomprince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want to ensure that this optimization does not happen on release branches, at least for shippable builds


## Translating the proposal to git

Rail made a point when we chatted: even though we want this to work on hg (and more precisely hg.mozilla.org), we still want to support git (and github.com) workflows for projects like Fenix. Good news is: GitHub now exposes pushes metadata thanks to this new API[3]. So we should be able to implement a similar logic than the one with `json-pushes`. The main issue, though: we need to give a GitHub token to the decision task. The number of unauthenticated requests is way too low to the scale of Taskcluster. Authenticated requests may be not enough either, because one user has a certain amount of requests he/she/they can make. So may need several bot accounts to absorb the full load.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regrading the push object in the GraphQL api, it looks like that is only available via checks associated to that push. I'm not sure how accessible that is directly from a commit or other repository object.


## Local runs

Callek pointed out something really important when we chatted: people must be able to run taskgraph while being offline. So any calls made to hg.mozilla.org must be translated into some hg commands that run locally. I know `hg wip` provides something close to `json-pushes` in the sense that you know what your current head was based off of. So, we can get the last public changeset and taskgraph will point optimized-away builds to this repo/revision. I just need to figure a more machine-friendly command. If any of you know what this hg command is, I'll be happy to hear from you :)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on where the code gets added, this might not be necessary, mach taskgraph optimized already needs network access, so if the code is run as part of that, I don't know that we need to worry about being able to run it offline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should note my discussion was offline for full and target but only "able to run locally" for optimized and morphed (e.g. network is ok for them)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! For the record, optimized may require a Github token, in addition to network access. Does this sound good to you all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm less eager to require a token for local use, but we can probably fudge it a bit if necessary. Besides we're not likely to rely on github for local gecko decision task stuff to the extent we need to for actual pushes.

We can also probably have decent fallbacks for local, that while not exactly accurate are "close enough"

f. We are confident the previous builds are good.


The decision task runs. Everything goes as usual up until the optimization phase. During this one, taskgraph gets the push metadata thanks to the `json-pushes` endpoint[2], and gets the list of changed files. If a source file is present, then the build tasks in the graph remain unchanged. Otherwise, it gets the hash of the parent of the last changeset (no need to make a new request, it's available in the previous response). Then, taskgraph talks to the Taskcluster index to get the corresponding builds. The indexes will contain (if they don't already) the revision so that's how taskgraph will easily look them up. Then, taskgraph optimizes away the build tasks, just like we do with Docker images. Finally, it injects some `insert-indexes` tasks to point this revision to the builds it uses. This will be important for the next case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative would be to use hg locally to find the appropriate revision to get builds from, and then query hg.m.o to find the corresponding push. That would also take care of the next case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we did so, then a future decision task would be able to reuse back-filled builds (see #32 (comment) for the use-case). But maybe I'm missing something. Could you expand on your proposal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was to use hg locally to find the commit to get builds from, and then turn that into a push using hg.m.o. I was thinking that might be easier that walking json-pushes an arbitrary amount. That said, if builds are run on a later push, we'd always ignore them, so it may not be a good idea.

Thinking out loud, I wonder if json-automationrelevance could be enhanced to provide the information we want? Putting it on the server does make it harder to change or experiment with the logic, but it does move the data closer to the logic. I wonder if there is a middle path to offload the work to the server, while keeping the logic in-tree. I wonder if @cgsheeh has any thoughts here.

Copy link
Contributor

@escapewindow escapewindow Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hg.m.o would work for gecko, but not for non-gecko. Are we looking for a gecko-only solution or a taskgraph-wide solution? (Edit: ah, I see the git thread above.)

An aside about indexes: are we going to expand them and use the taskIds, or bake the index into the task definition? The former is more explicit and less vulnerable to manipulation at runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was to use hg locally to find the commit to get builds from, and then turn that into a push using hg.m.o. I was thinking that might be easier that walking json-pushes an arbitrary amount.

My apologies. I don't think my explanations weren't clear on this. We don't walk json-pushes. Everything is known thanks to a single request to json-pushes. This requests tells taskgraph what files were changed in the current push and what was the revision before this push. That revision points (thanks to TC indexes) to builds no matter if the decision task actually made them or if it reused previous ones. This is how we don't need to walk json-pushes.

Does it make sense to you @tomprince? If so, do you think we should expand json-automationrelevance. If not, I'm happy to chat on a call to clarify our respective thoughts.

An aside about indexes: are we going to expand them and use the taskIds, or bake the index into the task definition? The former is more explicit and less vulnerable to manipulation at runtime.

That is a good question! I believe the decision task should do the former, just like it does with in-tree docker images.


### Any other edge cases?

I currently don't see any edge case that isn't covered in the setup hg + hg.mozilla.org. Feel free to tell me if you see something missing :)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One case that I don't see discussed is when the decision task is succesful, but the builds being pointed at failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a case that is known (and should be captured here) -- in theory if the decision task was green but the build failed it is one of three things, intermittent failure, would also fail on this current test-only push, or needing a new build on this cset anyway [whether or not we recognized that with a bug in our logic].

This is partly why sheriffs want some extra treeherder UX to identify "depends on previous push" type stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add it explicitly. In my opinion, the current decision task shouldn't wait for previous builds, otherwise we may end up with decision tasks that will be running for too long. So, I think the current decision task reuse previous builds no matter their state. Then, if something goes wrong, sheriffs will be able to backfill builds on top of the current decision task. These builds will overwrite the indexes tied to this revision, so any future decision task will backtrack up to the current revision instead of the previous one.

How does that sound to you all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think we should wait on the tasks to complete. It may be worth seeing if they are complete, and doing something different if they have already failed though. This does not need to be handled in the initial implementation though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I agree we can do something if the builds have already failed!


### The previous decision task is still running

Condition d. is changed. We may end up in situations where the current decision tasks starts while the previous is still running. That can easily happen on try or on autoland. We cannot use the index of the previous decision task because it's yet to be populated. This means, for any missing previous build, we need to make sure that the decision task was completed in the first place. Then, if it's not, we need to make sure it's not running. We could use Treeherder to know piece of data, but that ties taskgraph to another service. We could also change `.taskcluster.yml` to spin up a dummy task along the decision one to "eager index" it. I kind of like the latter, but I wonder if Chain of Trust may be broken because of it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially be something that is handled by the build-decision task directly.

We definitely should not use treeherder for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree for Treeherder.

Just to make sure I understand your first sentence: do you suggest that the dummy task should be spun up by the decision task?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had been suggesting it be created by build-decision (alternatively, it could create the indexes directly, rather than requiring a separate task. That said, we already wait for the on-push decision task in cron, and it may be reasonable to do the same here. If we are waiting, I'm not sure there is much difference between waiting on a task, and waiting for an index to appear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I'm not sure to follow your proposal. Let me try to rephrase it. Should we change .taskcluster.yml so it does not schedule a regular decision task on push, but instead it schedules a build-decision task so we can wait for the previous decision task to finish and spin the real decision task at the right time?

I don't have opinion at the moment, I'm just making I grasp this right.


# Open Questions

* Will Chain of Trust break if `.taskcluster.yml` spins 2 tasks instead of a single one?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I read it, we compare the parent task (decision/action/cron) to every task generated from .taskcluster.yml, and fails if it doesn't match any. We should verify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I'll create a small test case to be 100% sure, then!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm it works! I created this "eager index" task on my Fenix fork (it already has everything set up so it was easier for me to use it): JohanLorenzo/fenix@8807e3a. A (level-1) signing task ran and successfully passed: https://firefox-ci-tc.services.mozilla.com/tasks/C0uKmk5GTA2EnRW87IJDcQ/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FC0uKmk5GTA2EnRW87IJDcQ%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Fchain_of_trust.log#L158.

f. We are confident the previous builds are good.


The decision task runs. Everything goes as usual up until the optimization phase. During this one, taskgraph gets the push metadata thanks to the `json-pushes` endpoint[2], and gets the list of changed files. If a source file is present, then the build tasks in the graph remain unchanged. Otherwise, it gets the hash of the parent of the last changeset (no need to make a new request, it's available in the previous response). Then, taskgraph talks to the Taskcluster index to get the corresponding builds. The indexes will contain (if they don't already) the revision so that's how taskgraph will easily look them up. Then, taskgraph optimizes away the build tasks, just like we do with Docker images. Finally, it injects some `insert-indexes` tasks to point this revision to the builds it uses. This will be important for the next case.
Copy link
Contributor

@escapewindow escapewindow Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hg.m.o would work for gecko, but not for non-gecko. Are we looking for a gecko-only solution or a taskgraph-wide solution? (Edit: ah, I see the git thread above.)

An aside about indexes: are we going to expand them and use the taskIds, or bake the index into the task definition? The former is more explicit and less vulnerable to manipulation at runtime.


### Builds of the previous push aren't completed yet

Condition b. is changed. A new push comes in. By the time the decision task runs, builds are not done yet (they may have even failed). In the current way we use indexes, we have no way to tell that a given route is being populated by another task. That's something Callek calls "eager indexes" and he filed bug 1653058, for it. Thus, this current bug depends on it. Once this is solved, taskgraph will just have to listen to the right indexes to be sure to wait on upcoming builds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like we're polling indexes at runtime, rather than expanding indexes at decision time? The former is more flexible but less secure. I'm wondering if we can:

a. insert an index pointing at taskID before taskID runs
b. create an ordered list of taskIDs, from those indexes, of builds to point at. The first would be the most recent task, but might not have completed yet; the last would be the most recent successfully completed task. If there are multiple unfinished tasks in between, then we have those in our list as well.
c. add logic to use the build from the most recent successful task in that list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again, perhaps we don't care as much for tests. We could go with the index solution, but prefer not to use it for any tasks that create artifacts we plan to ship.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is the intent is to resolve the indexes in the decision task still. They'll just not be complete yet, which is a change from previously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Tom. I'd like indexes to be expanded at decision time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants