Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-fast on ubuntu-24.04-arm intermittent SIGSEGV or SIGBUS in rustc #1790

Closed
EliahKagan opened this issue Jan 22, 2025 · 7 comments · Fixed by #1802
Closed

test-fast on ubuntu-24.04-arm intermittent SIGSEGV or SIGBUS in rustc #1790

EliahKagan opened this issue Jan 22, 2025 · 7 comments · Fixed by #1802
Labels
acknowledged an issue is accepted as shortcoming to be fixed help wanted Extra attention is needed

Comments

@EliahKagan
Copy link
Member

EliahKagan commented Jan 22, 2025

Current behavior 😯

Since #1777, a test-fast CI job runs on ubuntu-24.04-arm, which is one of the newly more available 64-bit ARM (AArch64/ARM64) Linux runners. This job initially worked with no problems in this job: the ARM failures mentioned in #1777 and #1778 and tracked in #1780 apply to a test-32bit job, occur only with Docker, and do not affect test-fast.

However, the ARM64 test-fast now intermittently fails with SIGBUS or SIGSEGV in rustc. This is probably a bug in rustsc or another component of the Rust toolchain for ARM64, but I have not reproduced it locally or otherwise ruled out a problem on the runner image. The job, like the other test-fast jobs, uses a stable toolchain.

Expected behavior 🤔

The compilation should complete, or give an error, but not crash. SIGSEGV and SIGBUS should not occur.

The underlying bug is not in gitoxide, but I'm opening this issue in gitoxide to track the problem with the affected CI job here, which may need to be removed, skipped, or made continue-on-error.

Git behavior

Not applicable.

Steps to reproduce 🕹

Run or rerun the test-fast (ubuntu-24.04-arm) job on any commit. It seems less likely to happen if rust-cache is able to retrieve cache dependencies, since there is less to build, but it happens even when caching retrieves everything except what is built from this repository's workspace. It may be necessary to rerun a job multiple times to observe the problem.

A few runs that show this are:

The first link is to a run on the main branch in my fork. The following are relevant pieces of the output of that one run (not separate runs):

Compiling gix-date v0.9.3 (/home/runner/work/gitoxide/gitoxide/gix-date)
error: rustc interrupted by SIGSEGV, printing backtrace

/home/runner/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-bedc4a794a543ce8.so(+0xbf78ec)[0xff6a483f78ec]
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xff6a517ba7e0]

note: we would appreciate a report at https://github.com/rust-lang/rust
help: you can increase rustc's stack size by setting RUST_MIN_STACK=[16](https://github.com/EliahKagan/gitoxide/actions/runs/12874716014/job/35967617793#step:8:17)777216
error: could not compile `gix-trace` (lib)

Caused by:
  process didn't exit successfully: `/home/runner/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin/rustc --crate-name gix_trace --edition=2021 gix-trace/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 '--warn=clippy::pedantic' '--allow=clippy::wildcard_imports' '--allow=clippy::used_underscore_binding' '--allow=clippy::unused_self' '--allow=clippy::unreadable_literal' '--allow=clippy::unnecessary_wraps' '--allow=clippy::unnecessary_join' '--allow=clippy::trivially_copy_pass_by_ref' '--allow=clippy::transmute_ptr_to_ptr' '--allow=clippy::too_many_lines' '--allow=clippy::too_long_first_doc_paragraph' '--allow=clippy::struct_field_names' '--allow=clippy::struct_excessive_bools' '--allow=clippy::stable_sort_primitive' '--allow=clippy::single_match_else' '--allow=clippy::similar_names' '--allow=clippy::should_panic_without_expect' '--allow=clippy::return_self_not_must_use' '--allow=clippy::redundant_else' '--allow=clippy::range_plus_one' '--allow=clippy::option_option' '--allow=clippy::no_effect_underscore_binding' '--allow=clippy::needless_raw_string_hashes' '--allow=clippy::needless_pass_by_value' '--allow=clippy::needless_for_each' '--allow=clippy::naive_bytecount' '--allow=clippy::mut_mut' '--allow=clippy::must_use_candidate' '--allow=clippy::module_name_repetitions' '--allow=clippy::missing_panics_doc' '--allow=clippy::missing_errors_doc' '--allow=clippy::match_wildcard_for_single_variants' '--allow=clippy::match_wild_err_arm' '--allow=clippy::match_same_arms' '--allow=clippy::match_bool' '--allow=clippy::many_single_char_names' '--allow=clippy::manual_string_new' '--allow=clippy::manual_let_else' '--allow=clippy::manual_is_variant_and' '--allow=clippy::manual_assert' '--allow=clippy::large_stack_arrays' '--allow=clippy::iter_without_into_iter' '--allow=clippy::iter_not_returning_iterator' '--allow=clippy::items_after_statements' '--allow=clippy::inline_always' '--allow=clippy::inefficient_to_string' '--allow=clippy::inconsistent_struct_constructor' '--allow=clippy::implicit_clone' '--allow=clippy::ignored_unit_patterns' '--allow=clippy::if_not_else' '--allow=clippy::from_iter_instead_of_collect' '--allow=clippy::fn_params_excessive_bools' '--allow=clippy::filter_map_next' '--allow=clippy::explicit_iter_loop' '--allow=clippy::explicit_into_iter_loop' '--allow=clippy::explicit_deref_methods' '--allow=clippy::enum_glob_use' '--allow=clippy::empty_docs' '--allow=clippy::doc_markdown' '--allow=clippy::default_trait_access' '--allow=clippy::copy_iterator' '--allow=clippy::checked_conversions' '--allow=clippy::cast_sign_loss' '--allow=clippy::cast_precision_loss' '--allow=clippy::cast_possible_wrap' '--allow=clippy::cast_possible_truncation' '--allow=clippy::cast_lossless' '--allow=clippy::borrow_as_ptr' '--allow=clippy::bool_to_int_with_if' --cfg 'feature="default"' --cfg 'feature="tracing"' --cfg 'feature="tracing-detail"' --check-cfg 'cfg(docsrs)' --check-cfg 'cfg(feature, values("default", "document-features", "tracing", "tracing-detail"))' -C metadata=70615eb61[26](https://github.com/EliahKagan/gitoxide/actions/runs/12874716014/job/35967617793#step:8:27)64f03 -C extra-filename=-70615eb612664f03 --out-dir /home/runner/work/gitoxide/gitoxide/target/debug/deps -L dependency=/home/runner/work/gitoxide/gitoxide/target/debug/deps --extern tracing_core=/home/runner/work/gitoxide/gitoxide/target/debug/deps/libtracing_core-90c3db60dd8d72e4.rmeta` (signal: 11, SIGSEGV: invalid memory reference)
error: could not compile `gix-hash` (lib)

Caused by:
  process didn't exit successfully: `/home/runner/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin/rustc --crate-name gix_hash --edition=2021 gix-hash/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no -C debuginfo=2 '--warn=clippy::pedantic' '--allow=clippy::wildcard_imports' '--allow=clippy::used_underscore_binding' '--allow=clippy::unused_self' '--allow=clippy::unreadable_literal' '--allow=clippy::unnecessary_wraps' '--allow=clippy::unnecessary_join' '--allow=clippy::trivially_copy_pass_by_ref' '--allow=clippy::transmute_ptr_to_ptr' '--allow=clippy::too_many_lines' '--allow=clippy::too_long_first_doc_paragraph' '--allow=clippy::struct_field_names' '--allow=clippy::struct_excessive_bools' '--allow=clippy::stable_sort_primitive' '--allow=clippy::single_match_else' '--allow=clippy::similar_names' '--allow=clippy::should_panic_without_expect' '--allow=clippy::return_self_not_must_use' '--allow=clippy::redundant_else' '--allow=clippy::range_plus_one' '--allow=clippy::option_option' '--allow=clippy::no_effect_underscore_binding' '--allow=clippy::needless_raw_string_hashes' '--allow=clippy::needless_pass_by_value' '--allow=clippy::needless_for_each' '--allow=clippy::naive_bytecount' '--allow=clippy::mut_mut' '--allow=clippy::must_use_candidate' '--allow=clippy::module_name_repetitions' '--allow=clippy::missing_panics_doc' '--allow=clippy::missing_errors_doc' '--allow=clippy::match_wildcard_for_single_variants' '--allow=clippy::match_wild_err_arm' '--allow=clippy::match_same_arms' '--allow=clippy::match_bool' '--allow=clippy::many_single_char_names' '--allow=clippy::manual_string_new' '--allow=clippy::manual_let_else' '--allow=clippy::manual_is_variant_and' '--allow=clippy::manual_assert' '--allow=clippy::large_stack_arrays' '--allow=clippy::iter_without_into_iter' '--allow=clippy::iter_not_returning_iterator' '--allow=clippy::items_after_statements' '--allow=clippy::inline_always' '--allow=clippy::inefficient_to_string' '--allow=clippy::inconsistent_struct_constructor' '--allow=clippy::implicit_clone' '--allow=clippy::ignored_unit_patterns' '--allow=clippy::if_not_else' '--allow=clippy::from_iter_instead_of_collect' '--allow=clippy::fn_params_excessive_bools' '--allow=clippy::filter_map_next' '--allow=clippy::explicit_iter_loop' '--allow=clippy::explicit_into_iter_loop' '--allow=clippy::explicit_deref_methods' '--allow=clippy::enum_glob_use' '--allow=clippy::empty_docs' '--allow=clippy::doc_markdown' '--allow=clippy::default_trait_access' '--allow=clippy::copy_iterator' '--allow=clippy::checked_conversions' '--allow=clippy::cast_sign_loss' '--allow=clippy::cast_precision_loss' '--allow=clippy::cast_possible_wrap' '--allow=clippy::cast_possible_truncation' '--allow=clippy::cast_lossless' '--allow=clippy::borrow_as_ptr' '--allow=clippy::bool_to_int_with_if' -C debug-assertions=on --cfg 'feature="serde"' --check-cfg 'cfg(docsrs)' --check-cfg 'cfg(feature, values("document-features", "serde"))' -C metadata=afbae439064e8dcf -C extra-filename=-afbae439064e8dcf --out-dir /home/runner/work/gitoxide/gitoxide/target/debug/deps -L dependency=/home/runner/work/gitoxide/gitoxide/target/debug/deps --extern faster_hex=/home/runner/work/gitoxide/gitoxide/target/debug/deps/libfaster_hex-3224d66e2cae46b5.rmeta --extern serde=/home/runner/work/gitoxide/gitoxide/target/debug/deps/libserde-fc1536d1[29](https://github.com/EliahKagan/gitoxide/actions/runs/12874716014/job/35967617793#step:8:30)a55ca3.rmeta --extern thiserror=/home/runner/work/gitoxide/gitoxide/target/debug/deps/libthiserror-0af4056d81b4b441.rmeta` (signal: 7, SIGBUS: access to undefined memory)
error: rustc interrupted by SIGSEGV, printing backtrace

/home/runner/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-bedc4a794a543ce8.so(+0xbf78ec)[0xff5a2c7f78ec]
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xff5a35c3f7e0]

note: we would appreciate a report at https://github.com/rust-lang/rust
help: you can increase rustc's stack size by setting RUST_MIN_STACK=16777216
error: could not compile `gix-utils` (lib)

Caused by:
  process didn't exit successfully: `/home/runner/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin/rustc --crate-name gix_utils --edition=2021 gix-utils/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 '--warn=clippy::pedantic' '--allow=clippy::wildcard_imports' '--allow=clippy::used_underscore_binding' '--allow=clippy::unused_self' '--allow=clippy::unreadable_literal' '--allow=clippy::unnecessary_wraps' '--allow=clippy::unnecessary_join' '--allow=clippy::trivially_copy_pass_by_ref' '--allow=clippy::transmute_ptr_to_ptr' '--allow=clippy::too_many_lines' '--allow=clippy::too_long_first_doc_paragraph' '--allow=clippy::struct_field_names' '--allow=clippy::struct_excessive_bools' '--allow=clippy::stable_sort_primitive' '--allow=clippy::single_match_else' '--allow=clippy::similar_names' '--allow=clippy::should_panic_without_expect' '--allow=clippy::return_self_not_must_use' '--allow=clippy::redundant_else' '--allow=clippy::range_plus_one' '--allow=clippy::option_option' '--allow=clippy::no_effect_underscore_binding' '--allow=clippy::needless_raw_string_hashes' '--allow=clippy::needless_pass_by_value' '--allow=clippy::needless_for_each' '--allow=clippy::naive_bytecount' '--allow=clippy::mut_mut' '--allow=clippy::must_use_candidate' '--allow=clippy::module_name_repetitions' '--allow=clippy::missing_panics_doc' '--allow=clippy::missing_errors_doc' '--allow=clippy::match_wildcard_for_single_variants' '--allow=clippy::match_wild_err_arm' '--allow=clippy::match_same_arms' '--allow=clippy::match_bool' '--allow=clippy::many_single_char_names' '--allow=clippy::manual_string_new' '--allow=clippy::manual_let_else' '--allow=clippy::manual_is_variant_and' '--allow=clippy::manual_assert' '--allow=clippy::large_stack_arrays' '--allow=clippy::iter_without_into_iter' '--allow=clippy::iter_not_returning_iterator' '--allow=clippy::items_after_statements' '--allow=clippy::inline_always' '--allow=clippy::inefficient_to_string' '--allow=clippy::inconsistent_struct_constructor' '--allow=clippy::implicit_clone' '--allow=clippy::ignored_unit_patterns' '--allow=clippy::if_not_else' '--allow=clippy::from_iter_instead_of_collect' '--allow=clippy::fn_params_excessive_bools' '--allow=clippy::filter_map_next' '--allow=clippy::explicit_iter_loop' '--allow=clippy::explicit_into_iter_loop' '--allow=clippy::explicit_deref_methods' '--allow=clippy::enum_glob_use' '--allow=clippy::empty_docs' '--allow=clippy::doc_markdown' '--allow=clippy::default_trait_access' '--allow=clippy::copy_iterator' '--allow=clippy::checked_conversions' '--allow=clippy::cast_sign_loss' '--allow=clippy::cast_precision_loss' '--allow=clippy::cast_possible_wrap' '--allow=clippy::cast_possible_truncation' '--allow=clippy::cast_lossless' '--allow=clippy::borrow_as_ptr' '--allow=clippy::bool_to_int_with_if' --cfg 'feature="bstr"' --check-cfg 'cfg(docsrs)' --check-cfg 'cfg(feature, values("bstr"))' -C metadata=78d5054a26f3e2ca -C extra-filename=-78d5054a26f3e2ca --out-dir /home/runner/work/gitoxide/gitoxide/target/debug/deps -L dependency=/home/runner/work/gitoxide/gitoxide/target/debug/deps --extern bstr=/home/runner/work/gitoxide/gitoxide/target/debug/deps/libbstr-36fa1f333109ab1c.rmeta --extern fastrand=/home/runner/work/gitoxide/gitoxide/target/debug/deps/libfastrand-e2bb9a6311991097.rmeta --extern unicode_normalization=/home/runner/work/gitoxide/gitoxide/target/debug/deps/libunicode_normalization-6d4aa053d3ce5a81.rmeta` (signal: 11, SIGSEGV: invalid memory reference)
error: could not compile `gix-date` (lib)

Caused by:
  process didn't exit successfully: `/home/runner/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin/rustc --crate-name gix_date --edition=2021 gix-date/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 '--warn=clippy::pedantic' '--allow=clippy::wildcard_imports' '--allow=clippy::used_underscore_binding' '--allow=clippy::unused_self' '--allow=clippy::unreadable_literal' '--allow=clippy::unnecessary_wraps' '--allow=clippy::unnecessary_join' '--allow=clippy::trivially_copy_pass_by_ref' '--allow=clippy::transmute_ptr_to_ptr' '--allow=clippy::too_many_lines' '--allow=clippy::too_long_first_doc_paragraph' '--allow=clippy::struct_field_names' '--allow=clippy::struct_excessive_bools' '--allow=clippy::stable_sort_primitive' '--allow=clippy::single_match_else' '--allow=clippy::similar_names' '--allow=clippy::should_panic_without_expect' '--allow=clippy::return_self_not_must_use' '--allow=clippy::redundant_else' '--allow=clippy::range_plus_one' '--allow=clippy::option_option' '--allow=clippy::no_effect_underscore_binding' '--allow=clippy::needless_raw_string_hashes' '--allow=clippy::needless_pass_by_value' '--allow=clippy::needless_for_each' '--allow=clippy::naive_bytecount' '--allow=clippy::mut_mut' '--allow=clippy::must_use_candidate' '--allow=clippy::module_name_repetitions' '--allow=clippy::missing_panics_doc' '--allow=clippy::missing_errors_doc' '--allow=clippy::match_wildcard_for_single_variants' '--allow=clippy::match_wild_err_arm' '--allow=clippy::match_same_arms' '--allow=clippy::match_bool' '--allow=clippy::many_single_char_names' '--allow=clippy::manual_string_new' '--allow=clippy::manual_let_else' '--allow=clippy::manual_is_variant_and' '--allow=clippy::manual_assert' '--allow=clippy::large_stack_arrays' '--allow=clippy::iter_without_into_iter' '--allow=clippy::iter_not_returning_iterator' '--allow=clippy::items_after_statements' '--allow=clippy::inline_always' '--allow=clippy::inefficient_to_string' '--allow=clippy::inconsistent_struct_constructor' '--allow=clippy::implicit_clone' '--allow=clippy::ignored_unit_patterns' '--allow=clippy::if_not_else' '--allow=clippy::from_iter_instead_of_collect' '--allow=clippy::fn_params_excessive_bools' '--allow=clippy::filter_map_next' '--allow=clippy::explicit_iter_loop' '--allow=clippy::explicit_into_iter_loop' '--allow=clippy::explicit_deref_methods' '--allow=clippy::enum_glob_use' '--allow=clippy::empty_docs' '--allow=clippy::doc_markdown' '--allow=clippy::default_trait_access' '--allow=clippy::copy_iterator' '--allow=clippy::checked_conversions' '--allow=clippy::cast_sign_loss' '--allow=clippy::cast_precision_loss' '--allow=clippy::cast_possible_wrap' '--allow=clippy::cast_possible_truncation' '--allow=clippy::cast_lossless' '--allow=clippy::borrow_as_ptr' '--allow=clippy::bool_to_int_with_if' --cfg 'feature="serde"' --check-cfg 'cfg(docsrs)' --check-cfg 'cfg(feature, values("document-features", "serde"))' -C metadata=d3f2fa21d502802c -C extra-filename=-d3f2fa21d502802c --out-dir /home/runner/work/gitoxide/gitoxide/target/debug/deps -L dependency=/home/runner/work/gitoxide/gitoxide/target/debug/deps --extern bstr=/home/runner/work/gitoxide/gitoxide/target/debug/deps/libbstr-36fa1f333109ab1c.rmeta --extern itoa=/home/runner/work/gitoxide/gitoxide/target/debug/deps/libitoa-5181ef3238623c92.rmeta --extern jiff=/home/runner/work/gitoxide/gitoxide/target/debug/deps/libjiff-4e2aca16ca2f1b49.rmeta --extern serde=/home/runner/work/gitoxide/gitoxide/target/debug/deps/libserde-fc1536d129a55ca3.rmeta --extern thiserror=/home/runner/work/gitoxide/gitoxide/target/debug/deps/libthiserror-0af4056d81b4b[44](https://github.com/EliahKagan/gitoxide/actions/runs/12874716014/job/35967617793#step:8:45)1.rmeta` (signal: 11, SIGSEGV: invalid memory reference)
@Byron Byron added help wanted Extra attention is needed acknowledged an issue is accepted as shortcoming to be fixed labels Jan 22, 2025
@Byron
Copy link
Member

Byron commented Jan 22, 2025

Thanks for reporting!

I didn't notice this yet and hope that this will be a rare occurrence. If not, like proposed, it could be made 'non-blocking'.

@Byron
Copy link
Member

Byron commented Jan 22, 2025

Actually, it just failed on main: https://github.com/GitoxideLabs/gitoxide/actions/runs/12903033259/job/35977612391

Maybe it's best to just make it non-blocking right away.

@EliahKagan
Copy link
Member Author

EliahKagan commented Jan 22, 2025

The error messages suggest to force a minimum stack size for rustc using an environment variable. This is part of the messages from before as well. This is in the same message as its recommendation to open a bug report, so if this works then it's a workaround and there's still a bug that should be reported for rustc, unless it's somehow due to a problem in the runner image. I'm not sure if that suggestion is given anytime rustc fails with SIGSEGV, anytime it fails with SIGSEGV where SIGBUS occurred, or more specifically such that it means there is actually an indication of insufficient stack.

I am about to look into whether setting that helps. Then I'll open a PR to improve the situation one way or another. (Making it non-blocking for PRs by splitting it out of test-fast and having it not be a dependency of a required check is something I hadn't thought to do, but it is a good idea.)

Rerunning the check may make it pass, since the failure is intermittent. But I am not recommending that as a substitute for a change that would make the check fail less often (or not at all) or that would change how we treat failures of that check.

EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 22, 2025
This is to investigate the problem on the `test-fast` job with the
new ARM64 runner described in GitoxideLabs#1790.

This experiment does not produce useful results yet, because it has
no way to distinguish happenstance from correlation. To do that, I
need either to rerun each job repeatedly, or further parameterize
the matrix to do that. I'll be doing the latter, but right now this
dimension has size 1 (i.e., the only value of `number` is `0`)
so I don't start a large number of jobs when something is broken
due to a mistake in the workflows.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 22, 2025
The previous experiment[1][2] didn't have enough of memory-related
errors to clearly show which values of the variables have an
effect, though it *looked* like the memory-related errors in
`rustc` only happened in Ubuntu 24.04 (not 22.04) and only happened
on the stable channel (not beta). That's one reason to increase the
total number of jobs in the experiment.

Another reason is that the memory-related errors are more varied.
Not all were true memory errors involving SIGSEGV and SIGBUS
anymore. Some were, same as reported in [3]. But some others were
panics, looking like this (the index and slice vary but, in each,
the start index is much larger than the length):

    thread 'rustc' panicked at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/compiler/rustc_serialize/src/opaque.rs:269:45:
    range start index 159846347648097871 out of range for slice of length 39963722

Since the distribution of errors across jobs might also have
related to the order and times in which jobs started, for example
if there are inadvertent differences between different hosts (the
ARM64 Linux runners are in preview, so this seems plausible, though
fairly unlikely), this now expresses the repetition with two
variables: a high-order one, listed first in the matrix, and a
low-order one, listed last in the matrix.

Besides to allow more reps with the same values of the meaningful
variables, the reason to stop testing with `RUST_MIN_STACK` is that
it didn't seem to make a difference other than to change the
message shown, which suggests setting it to an even higher value.

[1]: e71b0cf
[2]: https://github.com/EliahKagan/gitoxide/actions/runs/12903958398
[3]: GitoxideLabs#1790
EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 22, 2025
As suggested in:
GitoxideLabs#1790 (comment)

It likely won't have to be kept this way. But making it nonrequired
for now makes it so that investigating what triggers the SIGSEGV
(and SIGBUS) errors -- as well as other errors that were found
while investigating that (d9e7fdb, e71b0cf, 5a71963) -- doesn't
have to be rushed.
@EliahKagan
Copy link
Member Author

Maybe it's best to just make it non-blocking right away.

I've gone ahead and done this in #1792. I suspect it can be adjusted and made blocking again, but I'm not done with the research to figure out how, so I think it does make sense to make it non-blocking temporarily.

@EliahKagan
Copy link
Member Author

EliahKagan commented Jan 22, 2025

In both e71b0cf (results) and 5a71963 (results), a pattern emerges: the memory errors on ARM64 Linux CI runners seem only ever to happen on the ubuntu-24.04-arm runner, not the ubuntu-22.04-arm runner, and seem only ever to happen with the stable channel, not the beta channel (and also probably not the nightly channel, but I didn't test as much with that--only in e71b0cf--and there are ordinary build errors that happen on it).

This remains the case even if we regard panics that suggest but do not prove memory errors, such as large range start indices that are much bigger than the range, to be memory errors. I observed these in 5a71963.

A third kind of error, which is a memory error even if narrowly defined, also happened only in 5a71963, and only once: in dtolnay/rust-toolchain, running rustup failed with free(): invalid next size (fast). This was also on 24.04 with the stable toolchain.

Another kind of error that I had originally assumed was unrelated was that actions/checkout failed a few times. This happened occasionally in both e71b0cf and 5a71963. This is unrelated to Rust. It was strange, because no error message was written to the log, suggesting something crashed. It also only happened on 24.04. But it didn't happen often enough that this is decisive; it's possible that Rust is essential to this problem, or even that there are two or more separate problems some of which are entirely in the Rust toolchain. I suspect something is going on with the 24.04 image, though.

For jobs where the actions/checkout error happened, I reran those jobs. (It did not recur.) I did not rerun any other jobs in either e71b0cf or 5a71963 (the repetition to distinguish correlation from happenstance is in the CI matrix definition itself, rather than requiring checks to be manually rerun).

So it looks like it would be sufficient to change ubuntu-24.04-arm to ubuntu-22.04-arm in the test-fast job. Alternatively, it would likely be sufficient to use the beta toolchain instead of the stable one (which could be special-cased even if the affected ARM64 jobs remain part of, or are put back into, the test-fast matrix).

Why do (22.04, stable), (22.04, beta), and (24.04, beta) all work, and it is only (24.04, stable) that does not? I think #1792 may still be the best first move, until I've investigated that question more. So far I've look at issues related to the runner images, and the only issue that seems maybe related, indirectly, is actions/partner-runner-images#36. I have not yet looked at what has changed between the stable and beta channels of the toolchain the ARM64 jobs are using.

@EliahKagan
Copy link
Member Author

The SIGSEGV and "range start index" errors are rust-lang/rust#135867.

EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 23, 2025
This is to investigate the problem on the `test-fast` job with the
new ARM64 runner described in GitoxideLabs#1790.

This experiment does not produce useful results yet, because it has
no way to distinguish happenstance from correlation. To do that, I
need either to rerun each job repeatedly, or further parameterize
the matrix to do that. I'll be doing the latter, but right now this
dimension has size 1 (i.e., the only value of `number` is `0`)
so I don't start a large number of jobs when something is broken
due to a mistake in the workflows.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 23, 2025
This makes two changes, with the intent of producing a usable test:

- Removes `nightly`, since a test is currently failing on it. It
  can be tested later in case it fixes the SIGSEGV bug, if other
  changes don't help.

- Have `number` take on 16 values instead of just one. This is to
  make it possible to figure something out about how often the
  failure happens with the other variables and whether the other
  variables make a difference. This is needed because the failures
  are nondeterministic, may not even usually happen, or may happen
  less often but still happen for some combination of the other
  variables.

(See GitoxideLabs#1790 for context.)
EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 23, 2025
The previous experiment[1][2] didn't have enough of memory-related
errors to clearly show which values of the variables have an
effect, though it *looked* like the memory-related errors in
`rustc` only happened in Ubuntu 24.04 (not 22.04) and only happened
on the stable channel (not beta). That's one reason to increase the
total number of jobs in the experiment.

Another reason is that the memory-related errors are more varied.
Not all were true memory errors involving SIGSEGV and SIGBUS
anymore. Some were, same as reported in [3]. But some others were
panics, looking like this (the index and slice vary but, in each,
the start index is much larger than the length):

    thread 'rustc' panicked at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/compiler/rustc_serialize/src/opaque.rs:269:45:
    range start index 159846347648097871 out of range for slice of length 39963722

Since the distribution of errors across jobs might also have
related to the order and times in which jobs started, for example
if there are inadvertent differences between different hosts (the
ARM64 Linux runners are in preview, so this seems plausible, though
fairly unlikely), this now expresses the repetition with two
variables: a high-order one, listed first in the matrix, and a
low-order one, listed last in the matrix.

Besides to allow more reps with the same values of the meaningful
variables, the reason to stop testing with `RUST_MIN_STACK` is that
it didn't seem to make a difference other than to change the
message shown, which suggests setting it to an even higher value.

[1]: e71b0cf
[2]: https://github.com/EliahKagan/gitoxide/actions/runs/12903958398
[3]: GitoxideLabs#1790
EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 23, 2025
When using `dtolnay/rust-toolchain` with the `toolchain` key to
specify a channel, the action version should be given as `@master`.
But I accidentally kept it at `@stable`! This caused `beta` and
`nightly` to refer to the most recent beta and nightly builds
*prior* to the current stable version. That made the conclucions
about beta and nightly builds inaccurate. This rectifies that
error and repeats the experiment.

See e71b0cf (1f3f6b5), GitoxideLabs#1790, and rust-lang/rust#135867 for context.

(I made this mistake in both experiment 1 and experiment 2, having
wrongly thought I'd changed `@stable` to `@master` for experiment
1. This commit just repeats experiment 1, but experiment 2 should
also be repeated for the same reason.)
EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 23, 2025
As noted in the preceding commit, when I ran experiments 1 and 2
the first time, I accidentally used `dtolnay/rust-toolchain@stable`
instead of `dtolnay/rust-toolchain@master`, even though the latter
is needed to use current values of the `toolchain` key rather than
the builds they referred to at the time the most recent stable
build was updated. The preceding commit redid experiment 1 with
that fixed.

This commit redoes experiment 2 with te same fix.

See 5a71963 (1b3e2cd), GitoxideLabs#1790, and rust-lang/rust#135867 for context.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 23, 2025
This is to investigate the problem on the `test-fast` job with the
new ARM64 runner described in GitoxideLabs#1790.

This experiment does not produce useful results yet, because it has
no way to distinguish happenstance from correlation. To do that, I
need either to rerun each job repeatedly, or further parameterize
the matrix to do that. I'll be doing the latter, but right now this
dimension has size 1 (i.e., the only value of `number` is `0`)
so I don't start a large number of jobs when something is broken
due to a mistake in the workflows.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 23, 2025
This makes two changes, with the intent of producing a usable test:

- Removes `nightly`, since a test is currently failing on it. It
  can be tested later in case it fixes the SIGSEGV bug, if other
  changes don't help.

- Have `number` take on 16 values instead of just one. This is to
  make it possible to figure something out about how often the
  failure happens with the other variables and whether the other
  variables make a difference. This is needed because the failures
  are nondeterministic, may not even usually happen, or may happen
  less often but still happen for some combination of the other
  variables.

(See GitoxideLabs#1790 for context.)
EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 23, 2025
The previous experiment[1][2] didn't have enough of memory-related
errors to clearly show which values of the variables have an
effect, though it *looked* like the memory-related errors in
`rustc` only happened in Ubuntu 24.04 (not 22.04) and only happened
on the stable channel (not beta). That's one reason to increase the
total number of jobs in the experiment.

Another reason is that the memory-related errors are more varied.
Not all were true memory errors involving SIGSEGV and SIGBUS
anymore. Some were, same as reported in [3]. But some others were
panics, looking like this (the index and slice vary but, in each,
the start index is much larger than the length):

    thread 'rustc' panicked at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/compiler/rustc_serialize/src/opaque.rs:269:45:
    range start index 159846347648097871 out of range for slice of length 39963722

Since the distribution of errors across jobs might also have
related to the order and times in which jobs started, for example
if there are inadvertent differences between different hosts (the
ARM64 Linux runners are in preview, so this seems plausible, though
fairly unlikely), this now expresses the repetition with two
variables: a high-order one, listed first in the matrix, and a
low-order one, listed last in the matrix.

Besides to allow more reps with the same values of the meaningful
variables, the reason to stop testing with `RUST_MIN_STACK` is that
it didn't seem to make a difference other than to change the
message shown, which suggests setting it to an even higher value.

[1]: e71b0cf
[2]: https://github.com/EliahKagan/gitoxide/actions/runs/12903958398
[3]: GitoxideLabs#1790
EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 23, 2025
When using `dtolnay/rust-toolchain` with the `toolchain` key to
specify a channel, the action version should be given as `@master`.
But I accidentally kept it at `@stable`! This caused `beta` and
`nightly` to refer to the most recent beta and nightly builds
*prior* to the current stable version. That made the conclucions
about beta and nightly builds inaccurate. This rectifies that
error and repeats the experiment.

See e71b0cf (1f3f6b5), GitoxideLabs#1790, and rust-lang/rust#135867 for context.

(I made this mistake in both experiment 1 and experiment 2, having
wrongly thought I'd changed `@stable` to `@master` for experiment
1. This commit just repeats experiment 1, but experiment 2 should
also be repeated for the same reason.)
EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 23, 2025
As noted in the preceding commit, when I ran experiments 1 and 2
the first time, I accidentally used `dtolnay/rust-toolchain@stable`
instead of `dtolnay/rust-toolchain@master`, even though the latter
is needed to use current values of the `toolchain` key rather than
the builds they referred to at the time the most recent stable
build was updated. The preceding commit redid experiment 1 with
that fixed.

This commit redoes experiment 2 with te same fix.

See 5a71963 (1b3e2cd), GitoxideLabs#1790, and rust-lang/rust#135867 for context.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 23, 2025
In case the installation method makes a difference.

Also, this brings back testing of the unstable toolchain.

This has just one job for each meaningful combination, so mistakes
in the experiment workflow can be found before doing nine times
as much work. The experiment this prepares should hopefully shed
more light on GitoxideLabs#1790 (or increase confidence in the observations so
far), but this is just preparation: variation across runs will
likely be due to the bug being nondeterministic.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 23, 2025
In case the installation method makes a difference.

Also, this brings back testing of the unstable toolchain.

This has just one job for each meaningful combination, so mistakes
in the experiment workflow can be found before doing nine times
as much work. The experiment this prepares should hopefully shed
more light on GitoxideLabs#1790 (or increase confidence in the observations so
far), but this is just preparation: variation across runs will
likely be due to the bug being nondeterministic.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 23, 2025
This varies:

- `ubuntu-22.04-arm` vs. `ubuntu-24.04.arm` GHA runner.
- Installing Rust via the `rust-toolchain` action vs. with curl.sh.
- Installing the stable vs. beta Rust toolchain.
- Installing nextest via `install-action` quickinstall/binstall.

*If* this also confirms that the only fully consistent factor in
whether errors happen is `ubuntu-22.04-arm` vs. `ubuntu-24.04.arm`,
then that will make it clearer that the problem is likely specific
to the `ubuntu-24.04.arm` runner.

See GitoxideLabs#1790 and rust-lang/rust#135867 for context.
@EliahKagan
Copy link
Member Author

EliahKagan commented Jan 23, 2025

I made a mistake in the testing described above (experiments 1 and 2) that slightly affected the results. It is surprising that the effect was only slight. The mistake was that I forgot to change @stable to @master as the version of the dtolnay/rust-toolchain action. This caused the beta from before the current stable version to be tested, instead of a beta after it.

To rectify this, I redid those two experiments with the version of that action fixed in the workflow. Oddly, the beta version continued to have very few problems. It did fail occasionally, as detailed below, which may be new, but the difference between the stable and the beta remained stark, though not as much so as the difference between 24.04 and 22.04 (as detailed below, the 22.04 runner seems to have none of these problems, with no exceptions).

I found that to be very strange, so I decided to run an "experiment 3" to test if the problem might have something to do with the way the Rust toolchain was being installed: whether it was installed using an action, versus by running the bootstrap script shown the installation command on the website. While I was at it, I also tested if it might have something to do with how nextest was installed, though in hindsight I probably didn't have to test that because not all failures have to do with nextest. In short, I found that these do not seem to matter. Experiment 3 also helps confirm and clarify some of the previous results, aside from that.

The three experiments (the first two redone, and the third) can be examined at:

As before, I reran jobs only when actions/checkout failed.

Experiment 3 seems to show:

  • No difference based on whether the Rust toolchain is installed with the rust-toolchain action vs. the curl ... | sh ... way suggested on the website.
  • No difference based on the whether cargo-nextest is installed with the install-action action vs. installed with a quickinstall/binstall approach, except that when cargo-nextest is installed in a way that involves installing something else from source first (usually cargo-quickinstall) then the memory errors sometimes happen in rustc for that installation rather than that of cargo-nextest.

The new testing also reinforces previous findings:

  • Switching from ubuntu-24.04-arm to ubuntu-22.04-arm is likely to be a full workaround. All errors of all kinds happened on the 24.04 runner, with none on the 22.04 runner.
  • Other approaches (besides switching to ubuntu-22.04-arm or removing the job) are likely still to have rustc memory errors at least occasionally. Although failures when the Rust toolchain is installed from the beta channel were rare, they sometimes happened: one, two, three
  • Other approaches (beside switching to ubuntu-22.04-arm or removing the job) are likely to fail in actions/checkout occasionally. This doesn't involve Rust, yet it happens only on the 24.04 runner and not the 22.04 runner. This was observed when rerunning previous experiments as well as in the first run of this job in experiment 3. 

I'll open a PR soon to change the runner from ubuntu-24.04-arm to ubuntu-22.04-arm. This should make the temporary changes in #1792 no longer needed, so I'll revert those at the same time.

Edit: I've opened #1802 for this.

EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 24, 2025
In the AArch64/ARM64 (64-bit, non-containerized) test-fast job,
this uses the `ubuntu-22.04-arm` runner instead of the
`ubuntu-24.04-arm` runner. This is to avoid the errors described
in GitoxideLabs#1790, i.e., to work around rust-lang/rust#135867.

Such problems have not been observed on the 22.04 runner, including
in tests intended to find them, and switching to it seems to be a
complete workaround for the problem. In contrast, continuing to use
the 24.04 runner, but attempting to work around the problem by
switching from the stable to the beta channel, looks like it would
greatly decrease the frequency of the errors but not eliminate
them. A problem with `actions/checkout` failing is likewise
observed on the 24.04 runner only, so using 22.04 avoids that too.

Because that seems like a complete workaround, this also reverts
50da7cb (GitoxideLabs#1792). That is to say that the ARM64 test-fast job is
again in the `test-fast` matrix. It is capable of cancelling or
being cancelled by the other `test-fast` checks. Code duplication
in the workflow is somewhat decreased. The job will again block PR
auto-merge.

Similar errors do not seem to have occurred in the `test-32bit`
job that runs an arm32v7 Docker image in `ubuntu-24.04-arm`, and it
is not clear that changing the runner image would help with GitoxideLabs#1780,
nor even if that issue is still happening. Therefore, it is not
changed there at this time.

This affects only ARM Linux runners. The x86-64 runners continue to
use `ubuntu-latest`, which is currently resolved to `ubuntu-24.04`,
and that does not need to be changed. Likewise, the `macos-latest`
runners use ARM processors (Apple Silicon) and they are fine.

Various experiments were done in a separate workflow. This commit
also removes that workflow, because it is not actively needed
anymore, and because, if kept, it would have to be modified to
avoid running hundreds of extra checks on each and every push.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged an issue is accepted as shortcoming to be fixed help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants