Test "Lin Bytes test with Domain" is flaky #541

glondu · 2025-03-25T08:55:29Z

The test "Lin Bytes test with Domain" does not always behave the same, causing random failures in Debian:

The failure seldom happens, and I couldn't reproduce it myself (even using the same nonce).

glondu · 2025-03-25T09:00:39Z

Reported in Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1101243

jmid · 2025-03-25T10:06:47Z

Thanks for sharing, acknowledged!
If I am reading the logs correctly, you've now observed this on both amd64 and s390x, right?

Are these run in some form of containerized environment?
Also are they running on native amd64/s390x hardware or via, e.g., qemu?

"Lin Bytes test with Domain" has been a pain point also in #520. The test was furthermore extended in #521.
As more bindings are targeted, the chance of hitting a bad combination decreases. It could be bettered by introducing weights. Alternatively, most other tests have been rewritten using STM. An alternative path could thus be to port the #521 extensions to the existing src/bytes/stm_tests.ml test.

glondu · 2025-03-25T10:20:11Z

If I am reading the logs correctly, you've now observed this on both amd64 and s390x, right?

Right.

Are these run in some form of containerized environment?

Yes. I don't know exactly the setup for the amd64 occurrence, but the s390x one surely ultimately boils down to either chroot or unshare.

Also are they running on native amd64/s390x hardware or via, e.g., qemu?

I don't know the exact setup for the amd64 occurrence, but the s390x one occured on this machine: https://db.debian.org/machines.cgi?host=zandonai (I would say "native", if that means anything for s390x).

sanvila · 2025-03-25T11:33:46Z

Hello. I am the one who filed the original report in Debian.

While we are at it: I've also noticed that the tests fail 100% of the time on the single-CPU instances where I tried to build the package.

Am I right to think, given the package name (multicoretests) that it does not make sense to run the tests on single-CPU systems? (I guess this could be fixed easily in debian/rules).

Or maybe the tests themselves could detect that they are being executed on a single-cpu system and do nothing in such case?

(Note that I can build 99.9% of all packages in Debian using machines with 1 CPU, and it's still more cost-efficient, so I see no
reason to stop supporting that, either in Debian or elsewhere).

Also: Can I expect the tests to work flawlessly (i.e. not randomly) on machines with 2 CPUs? (i.e. does multicore just mean more than 1?)

Thanks.

sanvila · 2025-03-25T12:03:58Z

Also are they running on native amd64/s390x hardware or via, e.g., qemu?

The amd64 failure I reported happened on a VM from AWS of type
r7a.large, which has 16GB of RAM and 2 vCPUs. I forgot to say that the failure there happens approximately 75% of the time (i.e. not always but often enough).

(So, the machine is virtual, but considering that the underlying hardware is probably amd64 as well, I think this would count more as "native" than "emulated").

Thanks.

jmid · 2025-03-25T12:27:55Z

Am I right to think, given the package name (multicoretests) that it does not make sense to run the tests on single-CPU systems?

Yes. The test suite has been developed to stress test OCaml5's multicore support. To do so, we have a collection of both positive and negative tests:

the positive ones confirm that some modules are safe to use in parallel (or concurrently)
the negative ones confirm that other modules are unsafe to use use in parallel (or concurrently) by searching for a counterexample

The "Lin Bytes test with Domain" belongs to the latter category, and hence doesn't really make sense to run on a single core, as it will fail to find a counterexample.

Also: Can I expect the tests to work flawlessly (i.e. not randomly) on machines with 2 CPUs? (i.e. does multicore just mean more than 1?)

Running the test suite on 2 cores may be sufficient. The macOS M1 GitHub runner offers 3 cores and run fine:
https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories

(Note that I can build 99.9% of all packages in Debian using machines with 1 CPU, and it's still more cost-efficient, so I see no reason to stop supporting that, either in Debian or elsewhere).

We have published the other 3 opam packages from this repository (qcheck-lin, qcheck-stm, qcheck-multicoretests-util) on OCaml's opam-repository as they should be of more general interest. However we have not published the multicoretests.opam test suite package, because it is (a) resource heavy to run and (b) primarily of interest to compiler devs and others interested in testing the well-being of an OCaml5 installation. I suspect you folks fall in the latter category (just FYI / making sure we are aligned). We are of course interested in getting a clean signal and eliminate flakiness 👍

jmid · 2025-03-25T12:38:25Z

Also are they running on native amd64/s390x hardware or via, e.g., qemu?

The amd64 failure I reported happened on a VM from AWS of type r7a.large, which has 16GB of RAM and 2 vCPUs. I forgot to say that the failure there happens approximately 75% of the time (i.e. not always but often enough).

Hm. Potentially the 2 (v)CPUs could be the reason then... Many of the tests run a parent thread (it is called a "Domain" in OCaml's multicore terminology) and then spawn two child threads (Domains) to wreck havoc by simultaneous parallel manipulation. If the 2 vCPUs mean that the AWS VM may take additional time to reschedule (pausing the parent thread and lending the second vCPU to the second child), this could ruin the chance of triggering sufficient parallelism to find a counterexample I suppose... 🤔

jmid · 2025-03-26T10:37:32Z

I had a last thought, which I just wanted to share:

Our recommendation in the README is to run only one test job at a time
https://github.com/ocaml-multicore/multicoretests#running-the-test-suite
This is also what our our GitHub actions CI does:

multicoretests/.github/runner.sh

Line 214 in 895f187

dune build @ci -j1 --no-buffer --display=quiet --cache=disabled --error-reporting=twice

precisely to avoid individual tests stepping over each other's toes, by running in parallel and thereby unknowingly preventing the discovery of counterexamples.

I can see in https://people.debian.org/~sanvila/build-logs/202503/ocaml-multicoretests_0.7-1_amd64-20250317T133603.533Z that you are running the test suite with 2 simultaneous jobs:

	dune runtest -j 2 -p multicoretests,qcheck-stm,qcheck-lin,qcheck-multicoretests-util

This could also explain why it is harder to discover counterexamples, e.g., with only 2 vCPUs.
Here I would recommend changing -j 2 to -j 1.

I admit that one has to read between the lines to understand the above recommendation,
so we should definitely improve the document in this regard...

Finally I can see your CI logs are lengthy from the many QCheck messages printed.
This can be limited using the QCHECK_MSG_INTERVAL environment variable (another one that could use better documentation, sorry! 😬 ):
https://github.com/c-cube/qcheck/blob/a15d5de6b7a37b4130b53ef71ae23642ae2a301f/src/runner/QCheck_base_runner.ml#L64

For example, we use an interval of 60 seconds ourselves which gives a reasonable amount of output compared to
the default which prints a message every 0.1 seconds(!):

multicoretests/.github/workflows/common.yml

Line 54 in 895f187

QCHECK_MSG_INTERVAL: '60'

jmid · 2025-03-27T10:45:22Z

This just triggered on MSVC bytecode trunk on the merge to main of #543:
https://github.com/ocaml-multicore/multicoretests/actions/runs/14094125484/job/39477704614

random seed: 418922599
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 5000     0.0s Lin Bytes test with Domain
[ ]    0    0    0    0 / 5000     0.0s Lin Bytes test with Domain (generating)
[ ]  442    0    0  442 / 5000    60.1s Lin Bytes test with Domain
[ ]  753    0    0  753 / 5000   141.6s Lin Bytes test with Domain
[ ] 1230    0    0 1230 / 5000   201.6s Lin Bytes test with Domain
[ ] 1901    0    0 1901 / 5000   261.7s Lin Bytes test with Domain
[ ] 2781    0    0 2781 / 5000   321.8s Lin Bytes test with Domain
[ ] 2873    0    0 2873 / 5000   466.2s Lin Bytes test with Domain
[ ] 3756    0    0 3756 / 5000   526.2s Lin Bytes test with Domain
[ ] 4275    0    0 4275 / 5000   591.5s Lin Bytes test with Domain
[ ] 4831    0    0 4831 / 5000   653.5s Lin Bytes test with Domain
[✗] 5000    0    0 5000 / 5000   672.0s Lin Bytes test with Domain

[ ]    0    0    0    0 /  250     0.0s Lin Bytes test with Thread
[✓]  250    0    0  250 /  250     2.3s Lin Bytes test with Thread

[ ]    0    0    0    0 / 1000     0.0s Lin Bytes stress test with Domain
[✓] 1000    0    0 1000 / 1000    15.7s Lin Bytes stress test with Domain

--- Failure --------------------------------------------------------------------

Test Lin Bytes test with Domain failed:

Negative test Lin Bytes test with Domain succeeded but was expected to fail
================================================================================
failure (1 tests failed, 0 tests errored, ran 3 tests)
File "src/bytes/dune", line 12, characters 7-16:
12 |  (name lin_tests)
            ^^^^^^^^^
(cd _build/default/src/bytes && .\lin_tests.exe --verbose)
Command exited with code 1.

With it, this has been observed on s390x, Linux amd64, and MSVC bytecode.

jmid added the test suite reliability Issue concerns tests that should behave more predictably label Mar 27, 2025

This was referenced Mar 27, 2025

Chore: Collect magic constants #543

Merged

Clarify test suite recommendations in README #545

Merged

Lin bytes test fixes #546

Merged

jmid linked a pull request Mar 28, 2025 that will close this issue

Lin bytes test fixes #546

Merged

jmid closed this as completed in #546 Mar 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test "Lin Bytes test with Domain" is flaky #541

Test "Lin Bytes test with Domain" is flaky #541

glondu commented Mar 25, 2025

glondu commented Mar 25, 2025

Uh oh!

jmid commented Mar 25, 2025

Uh oh!

glondu commented Mar 25, 2025

Uh oh!

sanvila commented Mar 25, 2025

Uh oh!

sanvila commented Mar 25, 2025

Uh oh!

jmid commented Mar 25, 2025

Uh oh!

jmid commented Mar 25, 2025

Uh oh!

jmid commented Mar 26, 2025

Uh oh!

jmid commented Mar 27, 2025

Uh oh!

Test "Lin Bytes test with Domain" is flaky #541

Test "Lin Bytes test with Domain" is flaky #541

Comments

glondu commented Mar 25, 2025

glondu commented Mar 25, 2025

Uh oh!

jmid commented Mar 25, 2025

Uh oh!

glondu commented Mar 25, 2025

Uh oh!

sanvila commented Mar 25, 2025

Uh oh!

sanvila commented Mar 25, 2025

Uh oh!

jmid commented Mar 25, 2025

Uh oh!

jmid commented Mar 25, 2025

Uh oh!

jmid commented Mar 26, 2025

Uh oh!

jmid commented Mar 27, 2025

Uh oh!