-
Notifications
You must be signed in to change notification settings - Fork 10
Release for Mozilla: Shared Trust Domain & Workers #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d9858ff to
c1c9b7e
Compare
Co-authored-by: Hal Wine <[email protected]>
…rfcs into generic-worker-pools
rfcs/0033-shared-trust-domain.md
Outdated
| * A new Trust Domain (`mozilla`) that is not tied to a specific project or product | ||
| * New Workers for builds on Linux, macOS 11.0, and Windows Server 2012 | ||
| * These will be created under a new `mozilla-1` provisioner | ||
| * New Workers for tests on Linux (through developer provided Docker images), macOS 11.0, and Windows 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about these separated build & test pools a bit more, specifically in the context of the lack of separation of these things proposed in #36. Specifically, I'm wondering if we should do away with that distinction here and just share general purpose pools. Something like:
Pool ID | Purpose
====================================================================================
mozilla-1/linux | Linux jobs
mozilla-1/linux-highcpu | Linux jobs requiring more CPU resources
mozilla-1/win2012 | Windows Server 2012 jobs
mozilla-1/win2012-highcpu | Windows Server 2012 jobs requiring more CPU resources
mozilla-1/win10 | Windows 10 jobs
mozilla-1/macos-bigsur | macOS Big Sur jobs
I can see why we keep them separate for Gecko (it lets us share test machines for level 1 & level 3 jobs), but I can't think of a reason why we need to do it here. @escapewindow - do you have any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to push a change with the proposed pools; this comment may disappear when I do so, but we can re-open or talk in a new thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why we keep them separate for Gecko (it lets us share test machines for level 1 & level 3 jobs), but I can't think of a reason why we need to do it here. @escapewindow - do you have any thoughts on this?
Another reason for the separate pools is so we can trust that builders haven't been corrupted by a rogue test run. At least in the case of "release builds" (whatever that means in this context), it would be good to have a high level of trust in the builder.
I know there are other ways to achieve that goal than a separate pool, but if there's a spot we can capture that requirement, that would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can "rogue builds" corrupt in the same way that "rogue tests" can? Or is it about minimizing the things that can influencing builds?
Either way, separating these out is not terribly difficult, so I don't strong feelings here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! we do care about rogue builds in the "accept outside PRs" world. If there is a concept of a "release build", it should include purging the cache of non-release-build artifacts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, keeping them compute pools rather than build/test pools may make sense for now, as long as we're limited to level 1 (in level 3, we need to guard against tests poisoning release build caches and workers). If we need separation in the future, we can add b- and t- prefixes as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it sounds like we're all OK with shared pools for level-1, but (almost?) certainly not for level-3 -- which are not part of this proposal.
5f33541 to
5c21de6
Compare
|
I think we're getting close to consensus here. I think the two things left are:
If there's anything else I'm missing, please let me know. |
@ajvb -- I'll defer to you on the |
|
@bhearsum In regards to the My (little) familiarity with indexes in Taskcluster leads me to believe there isn't much of security concern unless we are taking about one project polluting another project, which isn't really related to the index format. |
|
@bhearsum Ok, Hal gave me some more context about this over slack. As to "indexes being used to create segmentation" from a security context, I don't think I know enough to answer whether that's good enough. I'd imagine it'd be scopes + the |
The |
Oh I may have misunderstood then. BTW, you should capture the |
I think so. I think this will have multiple parts:
As mentioned elsewhere, I think we could have a similar process to the level 1 commit privileges agreement that allows us to know what github usernames should be enabled. |
This is not something I particularly want to include in an index, as they are meant to human readable. Is it a requirement? |
It's a race condition -- the "full name" (org/repo) is only good until the name changes, then it may be impossible to relink. So "requirement" depends on your use cases. |
|
Putting some answers inline to make sure we're generally aligned before updating the RFC.
Explicitly not migrating Gecko. There's no reason to, so it would just be scope creep IMO. We should do it at some point, when it becomes important.
Same as Gecko - not migrating until there's a need to.
I'm a little unsure how to respond to this part, so let me know if I'm missing something... We're talking about adding one new index format, not N, so I don't see this as terribly complex. Any tool that needs to generally look for things will need to:
Even if we were migrating everything to v3 indexes, we'd still need support for both to do a graceful transition. I imagine we'll eventually move everything over to v3 and drop v2 support - I just don't think it makes sense to include it in the scope of this.
I agree that we should do whatever we currently do for L1 here. For some reason I thought that we automatically granted L1 to new engineers, but I could be wrong about that. I'll look into it. |
IMO this is a caveat we can live with. If a repo moves or renames, I think this sort of "bustage" is to be expected. The benefits of having a user-readable index (which is the point of the index in the first place) outweigh the benefits of stability in a rename/move. (If you disagree with this, please say so!) |
Ah, I thought we were going to spin the specific v3 index format out into its own RFC. I have concerns with the v3 index as proposed in this comment.
I agreed with you here, Ben, in that I wanted human-readable indexes, and then I realized we could make it less human-unfriendly with tooling and a map. The index format is worth further discussion, since every version adds complexity we have to maintain or migrate to. Do we want to continue discussing that here, or resolve this point with "we'll decide what that index looks like in FUTURE-RFC"? |
I think it's pretty clear that a separate RFC makes sense for this. It's a big enough thing on its (more than I expected), and has a separate set of stakeholders. Do you want to block resolution of this on that being resolved? I could go either way. |
I think I'd be happy here if we say, "we need a new index format, schema and specifics TBD, but our requirements here are that we don't have index collisions between repositories of the same name but different orgs" or something. It sounds like we may need to open the v3 index discussion to allow for standalone taskgraph and gecko taskgraph to live together, and other consumers may have use cases that help us decide what that thing looks like. Hopefully we can get this hashed out soonish, but we may want to take enough time to get a v3 index format that will stick for a while, and not have to resort to a v3.1 or v4 soon after. I'm hoping it's clear that I'm raising concerns that will help improve the project, and I'm not trying to throw arbitrary obstacles in your way. I'd love for this to be a simple problem. Thank you for tackling this! |
OK, let's not block moving this along in that case, and I will make sure that I push along index discussions elsewhere in short order.
You're just pointing out obstacles that are already there, it's super helpful! |
|
Some thoughts I had last night on indexes: Aiui, mshal spent multiple months and/or quarters refining and implementing the v2 index format for Gecko. This was a complex problem. We may be able to bug spelunk to determine his requirements and solutions, though we may have shoehorned additional use cases on top of Gecko v2 index since then. However, we then also implemented another v2 index format for standalone taskgraph, and I suspect that the two v2 index formats are not fully compatible, which can cause issues when we use standalone taskgraph in Gecko. We may want to define the requirements here: finding decision tasks, cached tasks, toolchain + docker-image formats, treeherder, etc. as well as existing indexes for current standalone taskgraph projects. Plus, with the new Release for Mozilla RFCs, we want to avoid index collisions between projects of the same name but different organizations. I suspect if we gather all of the above, we'll have a good idea of the requirements for a consolidated v3 index format. If we look at the audience involved in the v2 Gecko index format, we'll have part of the answer of "who is affected by an index format change", although we'll need to answer that question for standalone taskgraph projects as well. I suspect we'll want that shared and standardized v3 index format to solve all the above use cases. We probably want to send this proposal out to a wider audience, since this will affect downstream consumers. However, I understand that while a consolidated v3 index format that solves all our combined use cases would be preferable, it's also a lot of work. I could also see proposing adding a v3 index format specifically for the |
|
Thanks for all of this, it's great to have some of the historical perspective here.
Regardless of what we do here, looking into what we did for the last change to the index format seems like a good idea if we're proposing a new one :).
IMO, the only viable path forward in the medium-term is a new that's only used by either the There's also another option that we haven't given serious consideration, which is: don't change the index format at all. The only driver for it is avoiding collisions, and when I step back and look at it, the likelyhood of two repos having the exact same name is not terribly likely, so it may be worth thinking a bit more about the cost/benefit, and whether or not this is something we must deal with now. |
+1 :)
Sure, I mentioned that's a possible stopgap v3, and then @ahal can do the v4 index compatibility work as a blocker to using standalone taskgraph in Gecko. Unless the standalone and gecko indexes are already compatible, which I think we're all hoping for.
This is interesting, and I gave it some thought. We're helped here by the fact that the current proposal only targets level 1, and we've historically been ok with, say, Try tasks being capable of polluting other Try tasks, as long as Release tasks are secure. Some thoughts here: First, this would be easier to accept if, say, we knew that there were a Second, I'm not sure Third, I'm not sure how uncommon name collisions will be. Maybe we don't care from a security perspective if [EDIT] Automation may make this worse: if we go find the decision task via the |
|
I'll note that this collision problem predates this RFC: When enabling the staging repos for fenix, android-components, xpi-manifest, and adhoc-signing, I intentionally added a However, the [EDIT] I suppose we could put in a check that verifies the repo names in the auto-projects module are unique, and fails out, and we prevent an autoland with a failed check... could work? I'm not sure that's a good user experience, especially having to rename your fork, but that might be a short- or intermediate-term solution. [Edit 2] I'm not sure what happens here on repo rename. (Sorry for thinking out loud in Github comments, rubber ducky) |
|
Another thought, sorry: we could limit the [Edit, because of course] Then I thought, maybe a |
|
For the collision problem couldn't we make sure the org is part of the
index? GitHub projects are always identified by `<org>/<repo>` everywhere
that I've seen.
…On Thu., May 13, 2021, 8:51 p.m. Aki Sasaki, ***@***.***> wrote:
Another thought, sorry: we could limit the mozilla trust domain to a
single org, no user repos, no 2nd org, which prevents name collisions. This
reduces its functionality in that it's useful to enable TC on your fork to
test out automation changes before creating a PR, but this could be a
solution.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#33 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACC2RXMQTE232QROMNHOP3TNRXYFANCNFSM4236M5JQ>
.
|
Yes :) This echoes what I said in April :) |
|
|
||
| ## `v3` Taskcluster index format | ||
|
|
||
| The current `v2` index format only includes repository names as an identifier (not user or organization). This is generally not an issue for any of current trust domains (because they generally only support one project), but in this new pool where we have N projects, it introduces the potential for collisions or pollution between them. To ensure this isn't an issue we will introduce a new `v3` index format that includes the repository location in its path as well - including both domain and path. Examples include: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mobile repos define a trust_project in their project.yml:
https://hg.mozilla.org/ci/ci-configuration/file/73a8d7921adb15a8ca3b726225655fc9114524a7/projects.yml#l485
This is then used in the index route. I think we could simply define this as org/repo for any projects that use the mozilla trust domain instead of a creating a new format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds fine to me - good idea!
| * `index.gecko.v3.hg.mozilla.org.releases.mozilla-beta.latest.firefox.decision` | ||
|
|
||
| This will require changes to a few things to support the new format: | ||
| * build-decision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What build-decision changes were you anticipating? I'd guess any projects using this will be on Github.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I...I have no idea at this point. Maybe I was thinking that some new projects might not be on Git. We can probably remove/ignore this.
| * New Workers for tests on Linux, macOS 10.15, and Windows 10 | ||
| * A RelEng maintained Docker image will be provided for Linux | ||
| * These will be created under a new `mozilla-t` provisioner | ||
| * New Scriptworker instances for signing and mac-signing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit on the fence about this. On the one hand, I realize that the intent here is to make it easier for us to stand things up, especially for things like mac signers. On the other, maybe when a project reaches the point of needing scriptworkers, then that is a signal that it is time for them to graduate to their own trust domain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we're only supporting L1 scriptworkers on this pool, then is it really saving us any work? We'll still need to wrangle mac signers for the L3 pool anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this loses a fair amount of value if we can't have signing as part of this. We'd have to do one of the following pretty quickly:
- Manually sign (avoiding this was part of the movitation behind this idea)
- Move projects very quickly from the shared pool to their own exclusive one
- Have automation to sign outside of scriptworkers (eg: a task that signs through autograph-edge). I don't even know if SecEng would allow this though :).
A potentially crazier idea could be to allow most tasks to run in the shared pool, but somehow put signing in its own pool (which means tasks in the same graph would be have two trust domains...which would probably means CoT changes). This would at least mean that signing would be purely adding to existing things, rather than getting blocked on a migration to a different pool.
| * Workers will be prefixed with `mozilla-t-` | ||
| * mac-signing will run 10.14, like our other mac-signing workers (there's no known reason to upgrade) | ||
|
|
||
| Notably, we are only concerned with level 1 workers at this time, which means we can ignore things like scriptworkers that are only used when shipping. Level 3 workers will be dealt with at a later stage, and most likely will not use a shared trust domain or workers across projects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a sort of follow-on to my previous comment, what if instead of disallowing L3 on the shared pools, we instead disallow scriptworkers? Then once the project is a bit more mature, we would graduate them to their own trust domain and add the scriptworker tasks.
This way they could get started with Taskcluster right away, and have a proper PR == L1 and push == L3 setup. Security wise, the setup wouldn't be ideal.. but it would still be better than GH Actions, or doing signing on personal dev machines.
It kind of reminds me of the autonomous driving quote. It doesn't have to be perfect, it just has to be better than manual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a sort of follow-on to my previous comment, what if instead of disallowing L3 on the shared pools, we instead disallow scriptworkers?
I think this is a great idea, independent of what we do with any other part of this! I can't think of any downsides to separating PRs and pushes.
|
@ahal - is this RFC still relevant given your recent work on this? |
|
Hmm, yeah I guess I should have merged this first. There might still be some open issues around non-level 1 / scriptworker pools.. but yeah, it's mostly implemented already. |
rendered