Skip to content
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions rfcs/0033-shared-trust-domain.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# RFC 33 - Release for Mozilla: Shared Trust Domain & Workers
* Comments: [#33](https://github.com/mozilla-releng/releng-rfcs/pull/33)
* Proposed by: @bhearsum

# Summary

Build and maintain a shared Trust Domain, Workers, and Scriptworkers on the Firefox CI cluster that any Mozilla project can use. (Browser products will remain in their existing - separate - trust domain.)

## Motivation

One of the barriers to entry for using Taskcluster is waiting on RelEng to create and deploy a new Trust Domain and Worker for a new project. Even when this takes less than a day to do (and it often takes longer), it's still something that needs to be waited on, and is slower than using CircleCI or Github Actions.

# Details

We will create a new trust domain and workers that are generally available for Mozilla employees and trusted volunteers to use. Specifically:

* A new Trust Domain (`mozilla`) that is not tied to a specific project or product
* New Workers for builds on Linux, macOS 11.0, and Windows Server 2012
* These will be created under a new `mozilla-1` provisioner
* New Workers for tests on Linux (through developer provided Docker images), macOS 11.0, and Windows 10
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about these separated build & test pools a bit more, specifically in the context of the lack of separation of these things proposed in #36. Specifically, I'm wondering if we should do away with that distinction here and just share general purpose pools. Something like:

Pool ID                   | Purpose
====================================================================================
mozilla-1/linux           | Linux jobs
mozilla-1/linux-highcpu   | Linux jobs requiring more CPU resources
mozilla-1/win2012         | Windows Server 2012 jobs
mozilla-1/win2012-highcpu | Windows Server 2012 jobs requiring more CPU resources
mozilla-1/win10           | Windows 10 jobs
mozilla-1/macos-bigsur    | macOS Big Sur jobs

I can see why we keep them separate for Gecko (it lets us share test machines for level 1 & level 3 jobs), but I can't think of a reason why we need to do it here. @escapewindow - do you have any thoughts on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to push a change with the proposed pools; this comment may disappear when I do so, but we can re-open or talk in a new thread.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why we keep them separate for Gecko (it lets us share test machines for level 1 & level 3 jobs), but I can't think of a reason why we need to do it here. @escapewindow - do you have any thoughts on this?

Another reason for the separate pools is so we can trust that builders haven't been corrupted by a rogue test run. At least in the case of "release builds" (whatever that means in this context), it would be good to have a high level of trust in the builder.

I know there are other ways to achieve that goal than a separate pool, but if there's a spot we can capture that requirement, that would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can "rogue builds" corrupt in the same way that "rogue tests" can? Or is it about minimizing the things that can influencing builds?

Either way, separating these out is not terribly difficult, so I don't strong feelings here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! we do care about rogue builds in the "accept outside PRs" world. If there is a concept of a "release build", it should include purging the cache of non-release-build artifacts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, keeping them compute pools rather than build/test pools may make sense for now, as long as we're limited to level 1 (in level 3, we need to guard against tests poisoning release build caches and workers). If we need separation in the future, we can add b- and t- prefixes as needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it sounds like we're all OK with shared pools for level-1, but (almost?) certainly not for level-3 -- which are not part of this proposal.

* These will be created under a new `mozilla-t` provisioner
* New Scriptworkers instances forsigning and mac-signing
* These will be created under the existing `scriptworker-k8s` and `scriptworker-prov-v1` provisioners
* Workers will be prefixed with `mozilla-1-`

Notably, we are only concerned with level 1 workers at this time, which means we can ignore things like scriptworkers that are only used when shipping. Level 3 workers will be dealt with at a later stage.

Access to create and manage tasks on these new workers will be granted to anyone with `scm_level_1`.

Going forward, we will ensure workers for other supported build or target platforms are added to this pool. (For example, when we add support for scheduling iOS tests in Taskcluster, that will be made available in the `mozilla-t` provisioner as well.)

# Open Questions

* Are we happy with the new trust domain name & provisioners for the workers?
* Where did we get the macOS hardware for the build, test, and signing pools?
* New or pull from existing pools?
* How many machines do we need in each hardware pool?
* Is macOS 11.0 the right version to use for build and test?
* Are there other test platforms or scriptworkers we should support?
* Is `scm_level_1` the right group to use, or do we need a new one for this purpose?

# Implementation

<once the RFC is decided, these links will provide readers a way to track the
implementation through to completion>

* <link to tracker bug, issue, etc.>
* <...>