Skip to content

Commit 368a9e0

Browse files
committed
add more lessons learned
1 parent a59d59c commit 368a9e0

File tree

1 file changed

+38
-28
lines changed

1 file changed

+38
-28
lines changed

src/release/rollups.md

Lines changed: 38 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,15 @@
22

33
## Background
44

5-
The Rust project has a policy that every pull request must be tested after merge
6-
before it can be pushed to master. As PR volume increases, this can scale poorly,
7-
especially given the long (~3.5hr) current CI duration for Rust.
5+
The Rust project has a policy that every pull request must be tested before it can be merged to master.
6+
As PR volume increases, this can scale poorly, especially given the long (~3.5hr) current CI duration for Rust.
87

98
Enter rollups! Changes that are small, not performance sensitive, or not platform
109
dependent are marked with the `rollup` command to bors (`@bors r+ rollup` to
1110
approve a PR and mark it as a rollup, `@bors rollup` to mark a previously approved
1211
PR, `@bors rollup-` to un-mark as a rollup). 'Performing a Rollup' then means
13-
collecting these changes into one PR and merging them all at once. The rollup
14-
command accepts four values: `always`, `maybe`, `iffy`, and `never`. See [the
12+
collecting these independent changes into one bigger PR with the goal of merging them all at once.
13+
The rollup command accepts four values: `always`, `maybe`, `iffy`, and `never`. See [the
1514
Rollups section] of the review policies for guidance on what these different
1615
statuses mean.
1716

@@ -48,40 +47,40 @@ queue has been merged.
4847
4948
Tools and subtree syncs should use `p=5`, like rollups, so they interleave between rollups, as
5049
tools and subtree syncs are typically more tricky to fix than rebasing PRs.
50+
As the only exception, you shall chose a minimally higher priority (`x+1`) if a PR you include (such as a subtree sync) is already of `p=x`,
51+
with the purpose of ensuring that the rollup is tested earlier than subtree sync it includes.
5152
52-
3. If the rollup fails, use the logs rust-log-analyzer
53-
provides to bisect the failure to a specific PR and
54-
`@bors r-` the PR. If the PR is running, you need to do `@bors r- retry`. Otherwise,
55-
your rollup succeeded. If it did, proceed to the next rollup (every now and
56-
then let `rollup=never/iffy` and toolstate PRs progress).
53+
3. If the rollup fails, use the rust-log-analyzer or ci logs
54+
to pinpoint the failure to a specific PR and unapprove (`@bors r-`) the PR.
55+
If the PR is running, you need to do `@bors r- retry`. Otherwise,
56+
your rollup succeeded, great! If it did, proceed to the next rollup (every other time, let the `rollup=never/iffy` and toolstate PRs progress).
5757
4. Recreate the rollup without the offending PR starting again from **1.**.
5858
There's a link in the rollup PR's body to automatically prefill the rollup UI
5959
with the existing PRs (minus any PRs that have been `r-`d)
6060
Try avoiding adding any additional PR to the current "batch" as this
61-
unnecessarily increases the chance of test failures.
62-
Rollups should not overlap, if a PR is already contained in a rollup that is not closed,
63-
it should not be added to another different rollup at the same time.
61+
unnecessarily increases the chance of test failures: select a batch of hopefully conflict-free PRs, weed out the failures, merge, repeat!
62+
Rollups should not overlap, if a PR is already contained in a rollup that waiting in the queue, it should not be added to another different rollup at the same time.
6463
If a rollup fails and you are not sure which PR caused the problem,
6564
you may bisect the rollup and split it up into two rollups until the offending PR becomes clear.
6665
6766
## Selecting Pull Requests
6867
6968
The queue is sorted by rollup status. In general:
7069
71-
- A good rollup contains a bunch of `maybe` (unmarked) PRs, and a large pile of `always` PRs. You
72-
can include one or two `iffy` PRs to amortize the cost of testing full CI.
70+
- A good rollup contains a bunch of `maybe` (unmarked) PRs, and a large pile of `always` PRs.
71+
You can include one or two `iffy` PRs to amortize the cost of testing full CI.
7372
- If you have a lot of time, you can also make a rollup just from `iffy` PRs (if there are enough of
7473
them) and weed out the failures one by one.
7574
- A rollup must **never** include `rollup=never` PRs.
7675
7776
The actual absolute size of the rollup can depend based on experience and current length of the queue.
7877
People new to making rollups might start with including 1 `iffy`, 2 `maybe`s, and 4 `always`s. Usually 6-8 PRs per rollup is a good compromise.
79-
There is rarely a need to roll up more than 10 PRs at once (unless there are >30 PRs waiting the queue), keep in mind that we also try to minimize regressions per merge.
78+
There is rarely a need to roll up more than 15 PRs at once (unless there are >40 PRs waiting the queue), keep in mind that we also try to minimize regressions per merge.
8079
8180
Limit the size of rollups, *even if* the queue is backed up. Large rollups run the risk of failing
8281
or merge conflicts, and smaller rollups keep potential regressions [bisectable]. On average, rollups
83-
are 7 PRs large, often varying from 5 to 10 depending on the number of `rollup=always` PRs that can
84-
be included.
82+
are 7 PRs large, often varying from 5 to 12 depending on the number of `rollup=always` PRs that can
83+
be included. Rollups of 15+ prs can be made for the sake of weeding out failures via try-jobs but should not be put into the merge queue as to not block other prs from being merged in the meantime
8584
8685
Don't hesitate to downgrade the rollup status of a PR! If your intuition tells you that a `rollup=always` PR has some chances for failures, mark it `rollup=maybe` or better `rollup=iffy`. A lot of the unmarked `maybe` PRs are categorized as such because the reviewer may not have considered rollupability, so it's always worth picking them with a critical eye. Similarly, if a PR causes your rollup to fail, it's worth considering changing its rollup status.
8786
@@ -90,14 +89,14 @@ Generally, PRs that touch CI configuration or the bootstrapping process are prob
9089
Avoid having too many PRs with large diffs or subtree changes in the same rollup. Self-contained submodule changes (such as `miri` updates) on the other hand may be fine to be rolled up with other unrelated PRs.
9190
Also avoid having PRs you suspect will have large perf impacts and mark them as `rollup=never`.
9291
93-
If an `iffy` PR keeps failing in a rollups, it should be marked `never` to prevent it from causing further problems inside unrelated rollups. This will also cause it to bump up in front of all `maybe`s in the queue and the author will get feedback quicker in case of subsequent failures.
92+
If an `iffy` PR keeps failing in a rollup, it should be marked `never` to prevent it from causing further problems inside unrelated rollups. This will also cause it to bump up in front of all `maybe`s in the queue and the author will get feedback quicker in case of subsequent failures.
9493
It should be noted which runner the PR failed on, to run this runner as a `try-job` job and make sure it succeeds there before another merge is attempted (example on syntax [here]).
9594
In general, if possible, try to test a failed PR via a handful of carefully selected try-jobs instead of having to run the full battery of all 60 runners on if it's likely a PR may fail again.
9695
97-
To not have `never` or `iffy` PRs stuck in the queue indefinitely, it is recommended to alternate
96+
To not have `never` or `iffy` PRs stuck in the queue indefinitely, we should alternate
9897
between rollup and non-rollup prs, so one `never`, one rollup, one `iffy`, one `rollup`, one `never`
9998
etc, until most of the `never`s are merged. You can selectively bump the priority on certain
100-
`iffy`/`never` PRs to `p=1` (or even `p=5`) if you want to interleave them between rollups.
99+
`iffy`/`never` PRs to `p=1` (up to `p=5`) if you want to interleave them between rollups, while keeping a rollup / non-rollup alternation.
101100
102101
If you are the only person making rollups, you can also leave a couple of `never`/`iffy`s for a time
103102
where you know nobody will be doing rollups actively, or for weekends which generally see a lower
@@ -115,14 +114,15 @@ repeatedly, to have the same priority as rollups or sandwich between rollups.
115114
116115
Ultimately, we want to keep the number of regressions per merge at a minimum while also minimizing
117116
the amount of time between approval and the final merge of a PR, to avoid unnecessary merge
118-
conflicts and rebases.
117+
conflicts and rebases. Aim for a steady, continuous and even flow of similarly-sized rollups alternated by non-rollups.
118+
119119
120120
## Failed rollups
121121
122-
If the rollup has failed, run the `@bors retry` command if the
122+
If a rollup has failed, run the `@bors retry` command if the
123123
failure was spurious (e.g. due to a network problem or a timeout).
124-
There may be a matching `CI-spurious-fail-.*` label that you can use to tag the PR as such, to help
125-
discover common fail patterns.
124+
There may be a matching `CI-spurious-fail-.*` [label] that you can use to tag the PR as such, to help
125+
discover common failure patterns.
126126
127127
If it wasn't spurious, find the offending PR and return it to the author to be fixed by copying a
128128
link to the `rust-log-analyzer` comment, and leaving a comment like
@@ -138,25 +138,35 @@ respective line in the log output. Hopefully, the author or reviewer will give f
138138
PR fixed or confirm that it's not at fault. The failed rollup PR should then be closed.
139139

140140
Once you've removed the offending PR, recreate your rollup without it (see 1.).
141-
Merge one batch of PRs by throwing out the failures one by one instead of adding new PRs to it, as this may introduce additional points of failure.
141+
Merge one batch of PRs by throwing out the failures one by one instead of adding new PRs to it, as this may introduce additional points of failure,
142+
as don't want to block the queue forever until we have identified every single failing pr in it.
142143

143144
Sometimes it is hard to find the offending PR. There are usually three strategies to figure out the
144145
offending PR:
145146

146147
1. **Avoid the problem**: Use your intuition to avoid the PRs that you suspect are the problem and
147-
recreate the rollup.
148+
recreate the rollup without them
148149
2. **Test suspected PRs alone**: Raise the priority of the PRs you suspect, mark them as
149150
`rollup=never` (or `iffy`) and let bors test them standalone to dismiss or confirm your
150-
hypothesis.
151+
hypothesis. Or even better, run a `try job` on the pr to confirm it is cause of the failure.
152+
Note: you will have to unapprove the pr before running a try job, do not forget to re-approve if the pr turns out to not be the cause.
151153
3. **Divide and conquer**: Split rollup into 2 smaller ones until you are certain of the failure
152154
cause. If a PR was found to be the cause and other PRs were "wrongfully" marked `iffy`, they can
153155
of course be marked as `maybe` again with `@bors rollup=maybe` or `@bors rollup-`.
154156

155157
If a PR in a rollup continues to fail you can run the `@bors rollup=never` command to ensure the PR
156158
gets tested independently, since it's likely it will fail again in the future.
157159

160+
Make sure you do not forget to triage your failed rollups, don't just close them!
161+
Otherwise someone else might roll up the same failing pr again which is just a waste of time.
162+
Likewise, its often a good idea to quickly check the latest closed rollup to make sure it was triaged properly.
163+
164+
If you don't have time to triage your rollups after failure, consider leaving rollup creation
165+
to someone that will have time to do the triage later on.
166+
158167

159168
[Homu queue]: https://bors.rust-lang.org/queue/rust
160169
[the Rollups section]: ../compiler/reviews.md#rollups
161170
[here]: https://github.com/rust-lang/rust/pull/132434#issue-2628063878
162171
[bisectable]: https://rust-lang.github.io/cargo-bisect-rustc/
172+
[label]: https://github.com/rust-lang/rust/labels?q=CI-spurious

0 commit comments

Comments
 (0)