Use `ubuntu-22.04-arm` for the ARM64 `test-fast` job #1802

EliahKagan · 2025-01-24T00:46:48Z

In the AArch64/ARM64 (64-bit, non-containerized) test-fast job, this uses the ubuntu-22.04-arm runner instead of the ubuntu-24.04-arm runner. This is to avoid the errors described in #1790, i.e., to work around rust-lang/rust#135867.

Such problems have not been observed on the 22.04 runner, including in tests intended to find them, and switching to it seems to be a complete workaround for the problem. In contrast, continuing to use the 24.04 runner, but attempting to work around the problem by switching from the stable to the beta channel, looks like it would greatly decrease the frequency of the errors but not eliminate them. A problem with actions/checkout failing is likewise observed on the 24.04 runner only, so using 22.04 avoids that too.

Because that seems like a complete workaround, this also reverts 50da7cb (#1792). That is to say that the ARM64 test-fast job is again in the test-fast matrix. It is capable of cancelling or being cancelled by the other test-fast checks. Code duplication in the workflow is somewhat decreased. The job will again block PR auto-merge.

Similar errors do not seem to have occurred in the test-32bit job that runs an arm32v7 Docker image in ubuntu-24.04-arm, and it is not clear that changing the runner image would help with #1780, nor even if that issue is still happening. Therefore, it is not changed there at this time.

This affects only ARM Linux runners. The x86-64 runners continue to use ubuntu-latest, which is currently resolved to ubuntu-24.04, and that does not need to be changed. Likewise, the macos-latest runners use ARM processors (Apple Silicon) and they are fine.

Various experiments were done in a separate workflow. Those experiments comprise all but the last commit here. I often drop or squash such things, but here they seem sufficiently valuable and interesting to keep. However, if that is not preferred, another option is to squash all but the last commit into one commit. The last experiment is the most useful and comprehensive, and thus the most important to have in the history. My preference for keeping the workflow details for all of them, rather than just that last one, is slight. So I would be pleased to squash those if desired.

The last commit includes the removal of the experiment workflow. Irrespective of what is done with its history, I don't think there is value in having that file in current commits. In addition, if it were kept, it would have to be modified to avoid running hundreds of extra checks on each and every push.

This is to investigate the problem on the `test-fast` job with the new ARM64 runner described in GitoxideLabs#1790. This experiment does not produce useful results yet, because it has no way to distinguish happenstance from correlation. To do that, I need either to rerun each job repeatedly, or further parameterize the matrix to do that. I'll be doing the latter, but right now this dimension has size 1 (i.e., the only value of `number` is `0`) so I don't start a large number of jobs when something is broken due to a mistake in the workflows.

This makes two changes, with the intent of producing a usable test: - Removes `nightly`, since a test is currently failing on it. It can be tested later in case it fixes the SIGSEGV bug, if other changes don't help. - Have `number` take on 16 values instead of just one. This is to make it possible to figure something out about how often the failure happens with the other variables and whether the other variables make a difference. This is needed because the failures are nondeterministic, may not even usually happen, or may happen less often but still happen for some combination of the other variables. (See GitoxideLabs#1790 for context.)

The previous experiment[1][2] didn't have enough of memory-related errors to clearly show which values of the variables have an effect, though it *looked* like the memory-related errors in `rustc` only happened in Ubuntu 24.04 (not 22.04) and only happened on the stable channel (not beta). That's one reason to increase the total number of jobs in the experiment. Another reason is that the memory-related errors are more varied. Not all were true memory errors involving SIGSEGV and SIGBUS anymore. Some were, same as reported in [3]. But some others were panics, looking like this (the index and slice vary but, in each, the start index is much larger than the length): thread 'rustc' panicked at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/compiler/rustc_serialize/src/opaque.rs:269:45: range start index 159846347648097871 out of range for slice of length 39963722 Since the distribution of errors across jobs might also have related to the order and times in which jobs started, for example if there are inadvertent differences between different hosts (the ARM64 Linux runners are in preview, so this seems plausible, though fairly unlikely), this now expresses the repetition with two variables: a high-order one, listed first in the matrix, and a low-order one, listed last in the matrix. Besides to allow more reps with the same values of the meaningful variables, the reason to stop testing with `RUST_MIN_STACK` is that it didn't seem to make a difference other than to change the message shown, which suggests setting it to an even higher value. [1]: e71b0cf [2]: https://github.com/EliahKagan/gitoxide/actions/runs/12903958398 [3]: GitoxideLabs#1790

When using `dtolnay/rust-toolchain` with the `toolchain` key to specify a channel, the action version should be given as `@master`. But I accidentally kept it at `@stable`! This caused `beta` and `nightly` to refer to the most recent beta and nightly builds *prior* to the current stable version. That made the conclucions about beta and nightly builds inaccurate. This rectifies that error and repeats the experiment. See e71b0cf (1f3f6b5), GitoxideLabs#1790, and rust-lang/rust#135867 for context. (I made this mistake in both experiment 1 and experiment 2, having wrongly thought I'd changed `@stable` to `@master` for experiment 1. This commit just repeats experiment 1, but experiment 2 should also be repeated for the same reason.)

As noted in the preceding commit, when I ran experiments 1 and 2 the first time, I accidentally used `dtolnay/rust-toolchain@stable` instead of `dtolnay/rust-toolchain@master`, even though the latter is needed to use current values of the `toolchain` key rather than the builds they referred to at the time the most recent stable build was updated. The preceding commit redid experiment 1 with that fixed. This commit redoes experiment 2 with te same fix. See 5a71963 (1b3e2cd), GitoxideLabs#1790, and rust-lang/rust#135867 for context.

In case the installation method makes a difference. Also, this brings back testing of the unstable toolchain. This has just one job for each meaningful combination, so mistakes in the experiment workflow can be found before doing nine times as much work. The experiment this prepares should hopefully shed more light on GitoxideLabs#1790 (or increase confidence in the observations so far), but this is just preparation: variation across runs will likely be due to the bug being nondeterministic.

This varies: - `ubuntu-22.04-arm` vs. `ubuntu-24.04.arm` GHA runner. - Installing Rust via the `rust-toolchain` action vs. with curl.sh. - Installing the stable vs. beta Rust toolchain. - Installing nextest via `install-action` quickinstall/binstall. *If* this also confirms that the only fully consistent factor in whether errors happen is `ubuntu-22.04-arm` vs. `ubuntu-24.04.arm`, then that will make it clearer that the problem is likely specific to the `ubuntu-24.04.arm` runner. See GitoxideLabs#1790 and rust-lang/rust#135867 for context.

In the AArch64/ARM64 (64-bit, non-containerized) test-fast job, this uses the `ubuntu-22.04-arm` runner instead of the `ubuntu-24.04-arm` runner. This is to avoid the errors described in GitoxideLabs#1790, i.e., to work around rust-lang/rust#135867. Such problems have not been observed on the 22.04 runner, including in tests intended to find them, and switching to it seems to be a complete workaround for the problem. In contrast, continuing to use the 24.04 runner, but attempting to work around the problem by switching from the stable to the beta channel, looks like it would greatly decrease the frequency of the errors but not eliminate them. A problem with `actions/checkout` failing is likewise observed on the 24.04 runner only, so using 22.04 avoids that too. Because that seems like a complete workaround, this also reverts 50da7cb (GitoxideLabs#1792). That is to say that the ARM64 test-fast job is again in the `test-fast` matrix. It is capable of cancelling or being cancelled by the other `test-fast` checks. Code duplication in the workflow is somewhat decreased. The job will again block PR auto-merge. Similar errors do not seem to have occurred in the `test-32bit` job that runs an arm32v7 Docker image in `ubuntu-24.04-arm`, and it is not clear that changing the runner image would help with GitoxideLabs#1780, nor even if that issue is still happening. Therefore, it is not changed there at this time. This affects only ARM Linux runners. The x86-64 runners continue to use `ubuntu-latest`, which is currently resolved to `ubuntu-24.04`, and that does not need to be changed. Likewise, the `macos-latest` runners use ARM processors (Apple Silicon) and they are fine. Various experiments were done in a separate workflow. This commit also removes that workflow, because it is not actively needed anymore, and because, if kept, it would have to be modified to avoid running hundreds of extra checks on each and every push.

Byron

This is great, and a huge improvement! Thank you!

From ubuntu-24.04-arm, which seems to be having some problems there too, as discussed in GitoxideLabs#1828. This can be viewed as a follow-up on GitoxideLabs#1802, which made the analogous change for the non-containerized test job only.

EliahKagan added 8 commits January 23, 2025 07:04

EliahKagan marked this pull request as ready for review January 24, 2025 00:54

This was referenced Jan 24, 2025

Container creation sometimes fails in the 32-bit ARM test job #1780

Closed

test-fast on ubuntu-24.04-arm intermittent SIGSEGV or SIGBUS in rustc #1790

Closed

Byron approved these changes Jan 24, 2025

View reviewed changes

Byron merged commit f58f3ea into GitoxideLabs:main Jan 24, 2025
21 checks passed

EliahKagan deleted the arm-segv branch January 24, 2025 07:34

EliahKagan mentioned this pull request Feb 4, 2025

build(deps): bump openssl from 0.10.68 to 0.10.70 in the cargo group across 1 directory #1828

Merged

EliahKagan mentioned this pull request Feb 4, 2025

Change test-32bit arm32v7 runner to ubuntu-22.04-arm #1830

Merged

EliahKagan mentioned this pull request Feb 26, 2025

ubuntu-24.04-arm can probably be used again #1866

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use `ubuntu-22.04-arm` for the ARM64 `test-fast` job #1802

Use `ubuntu-22.04-arm` for the ARM64 `test-fast` job #1802

Uh oh!

EliahKagan commented Jan 24, 2025

Uh oh!

Byron left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Use ubuntu-22.04-arm for the ARM64 test-fast job #1802

Use ubuntu-22.04-arm for the ARM64 test-fast job #1802

Uh oh!

Conversation

EliahKagan commented Jan 24, 2025

Uh oh!

Byron left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Use `ubuntu-22.04-arm` for the ARM64 `test-fast` job #1802

Use `ubuntu-22.04-arm` for the ARM64 `test-fast` job #1802