Skip to content

Test "Lin Bytes test with Domain" is flaky #541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
glondu opened this issue Mar 25, 2025 · 9 comments · Fixed by #546
Closed

Test "Lin Bytes test with Domain" is flaky #541

glondu opened this issue Mar 25, 2025 · 9 comments · Fixed by #546
Labels
test suite reliability Issue concerns tests that should behave more predictably

Comments

@glondu
Copy link

glondu commented Mar 25, 2025

The test "Lin Bytes test with Domain" does not always behave the same, causing random failures in Debian:

The failure seldom happens, and I couldn't reproduce it myself (even using the same nonce).

@glondu
Copy link
Author

glondu commented Mar 25, 2025

Reported in Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1101243

@jmid
Copy link
Collaborator

jmid commented Mar 25, 2025

Thanks for sharing, acknowledged!
If I am reading the logs correctly, you've now observed this on both amd64 and s390x, right?

Are these run in some form of containerized environment?
Also are they running on native amd64/s390x hardware or via, e.g., qemu?

"Lin Bytes test with Domain" has been a pain point also in #520. The test was furthermore extended in #521.
As more bindings are targeted, the chance of hitting a bad combination decreases. It could be bettered by introducing weights. Alternatively, most other tests have been rewritten using STM. An alternative path could thus be to port the #521 extensions to the existing src/bytes/stm_tests.ml test.

@glondu
Copy link
Author

glondu commented Mar 25, 2025

If I am reading the logs correctly, you've now observed this on both amd64 and s390x, right?

Right.

Are these run in some form of containerized environment?

Yes. I don't know exactly the setup for the amd64 occurrence, but the s390x one surely ultimately boils down to either chroot or unshare.

Also are they running on native amd64/s390x hardware or via, e.g., qemu?

I don't know the exact setup for the amd64 occurrence, but the s390x one occured on this machine: https://db.debian.org/machines.cgi?host=zandonai (I would say "native", if that means anything for s390x).

@sanvila
Copy link

sanvila commented Mar 25, 2025

Hello. I am the one who filed the original report in Debian.

While we are at it: I've also noticed that the tests fail 100% of the time on the single-CPU instances where I tried to build the package.

Am I right to think, given the package name (multicoretests) that it does not make sense to run the tests on single-CPU systems? (I guess this could be fixed easily in debian/rules).

Or maybe the tests themselves could detect that they are being executed on a single-cpu system and do nothing in such case?

(Note that I can build 99.9% of all packages in Debian using machines with 1 CPU, and it's still more cost-efficient, so I see no
reason to stop supporting that, either in Debian or elsewhere).

Also: Can I expect the tests to work flawlessly (i.e. not randomly) on machines with 2 CPUs? (i.e. does multicore just mean more than 1?)

Thanks.

@sanvila
Copy link

sanvila commented Mar 25, 2025

Also are they running on native amd64/s390x hardware or via, e.g., qemu?

The amd64 failure I reported happened on a VM from AWS of type
r7a.large, which has 16GB of RAM and 2 vCPUs. I forgot to say that the failure there happens approximately 75% of the time (i.e. not always but often enough).

(So, the machine is virtual, but considering that the underlying hardware is probably amd64 as well, I think this would count more as "native" than "emulated").

Thanks.

@jmid
Copy link
Collaborator

jmid commented Mar 25, 2025

Am I right to think, given the package name (multicoretests) that it does not make sense to run the tests on single-CPU systems?

Yes. The test suite has been developed to stress test OCaml5's multicore support. To do so, we have a collection of both positive and negative tests:

  • the positive ones confirm that some modules are safe to use in parallel (or concurrently)
  • the negative ones confirm that other modules are unsafe to use use in parallel (or concurrently) by searching for a counterexample

The "Lin Bytes test with Domain" belongs to the latter category, and hence doesn't really make sense to run on a single core, as it will fail to find a counterexample.

Also: Can I expect the tests to work flawlessly (i.e. not randomly) on machines with 2 CPUs? (i.e. does multicore just mean more than 1?)

Running the test suite on 2 cores may be sufficient. The macOS M1 GitHub runner offers 3 cores and run fine:
https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories

(Note that I can build 99.9% of all packages in Debian using machines with 1 CPU, and it's still more cost-efficient, so I see no reason to stop supporting that, either in Debian or elsewhere).

We have published the other 3 opam packages from this repository (qcheck-lin, qcheck-stm, qcheck-multicoretests-util) on OCaml's opam-repository as they should be of more general interest. However we have not published the multicoretests.opam test suite package, because it is (a) resource heavy to run and (b) primarily of interest to compiler devs and others interested in testing the well-being of an OCaml5 installation. I suspect you folks fall in the latter category (just FYI / making sure we are aligned). We are of course interested in getting a clean signal and eliminate flakiness 👍

@jmid
Copy link
Collaborator

jmid commented Mar 25, 2025

Also are they running on native amd64/s390x hardware or via, e.g., qemu?

The amd64 failure I reported happened on a VM from AWS of type r7a.large, which has 16GB of RAM and 2 vCPUs. I forgot to say that the failure there happens approximately 75% of the time (i.e. not always but often enough).

Hm. Potentially the 2 (v)CPUs could be the reason then... Many of the tests run a parent thread (it is called a "Domain" in OCaml's multicore terminology) and then spawn two child threads (Domains) to wreck havoc by simultaneous parallel manipulation. If the 2 vCPUs mean that the AWS VM may take additional time to reschedule (pausing the parent thread and lending the second vCPU to the second child), this could ruin the chance of triggering sufficient parallelism to find a counterexample I suppose... 🤔

@jmid
Copy link
Collaborator

jmid commented Mar 26, 2025

I had a last thought, which I just wanted to share:

precisely to avoid individual tests stepping over each other's toes, by running in parallel and thereby unknowingly preventing the discovery of counterexamples.

I can see in https://people.debian.org/~sanvila/build-logs/202503/ocaml-multicoretests_0.7-1_amd64-20250317T133603.533Z that you are running the test suite with 2 simultaneous jobs:

	dune runtest -j 2 -p multicoretests,qcheck-stm,qcheck-lin,qcheck-multicoretests-util

This could also explain why it is harder to discover counterexamples, e.g., with only 2 vCPUs.
Here I would recommend changing -j 2 to -j 1.

I admit that one has to read between the lines to understand the above recommendation,
so we should definitely improve the document in this regard...

Finally I can see your CI logs are lengthy from the many QCheck messages printed.
This can be limited using the QCHECK_MSG_INTERVAL environment variable (another one that could use better documentation, sorry! 😬 ):
https://github.com/c-cube/qcheck/blob/a15d5de6b7a37b4130b53ef71ae23642ae2a301f/src/runner/QCheck_base_runner.ml#L64

For example, we use an interval of 60 seconds ourselves which gives a reasonable amount of output compared to
the default which prints a message every 0.1 seconds(!):

QCHECK_MSG_INTERVAL: '60'

@jmid
Copy link
Collaborator

jmid commented Mar 27, 2025

This just triggered on MSVC bytecode trunk on the merge to main of #543:
https://github.com/ocaml-multicore/multicoretests/actions/runs/14094125484/job/39477704614

random seed: 418922599
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 5000     0.0s Lin Bytes test with Domain
[ ]    0    0    0    0 / 5000     0.0s Lin Bytes test with Domain (generating)
[ ]  442    0    0  442 / 5000    60.1s Lin Bytes test with Domain
[ ]  753    0    0  753 / 5000   141.6s Lin Bytes test with Domain
[ ] 1230    0    0 1230 / 5000   201.6s Lin Bytes test with Domain
[ ] 1901    0    0 1901 / 5000   261.7s Lin Bytes test with Domain
[ ] 2781    0    0 2781 / 5000   321.8s Lin Bytes test with Domain
[ ] 2873    0    0 2873 / 5000   466.2s Lin Bytes test with Domain
[ ] 3756    0    0 3756 / 5000   526.2s Lin Bytes test with Domain
[ ] 4275    0    0 4275 / 5000   591.5s Lin Bytes test with Domain
[ ] 4831    0    0 4831 / 5000   653.5s Lin Bytes test with Domain
[✗] 5000    0    0 5000 / 5000   672.0s Lin Bytes test with Domain

[ ]    0    0    0    0 /  250     0.0s Lin Bytes test with Thread
[✓]  250    0    0  250 /  250     2.3s Lin Bytes test with Thread

[ ]    0    0    0    0 / 1000     0.0s Lin Bytes stress test with Domain
[✓] 1000    0    0 1000 / 1000    15.7s Lin Bytes stress test with Domain

--- Failure --------------------------------------------------------------------

Test Lin Bytes test with Domain failed:

Negative test Lin Bytes test with Domain succeeded but was expected to fail
================================================================================
failure (1 tests failed, 0 tests errored, ran 3 tests)
File "src/bytes/dune", line 12, characters 7-16:
12 |  (name lin_tests)
            ^^^^^^^^^
(cd _build/default/src/bytes && .\lin_tests.exe --verbose)
Command exited with code 1.

With it, this has been observed on s390x, Linux amd64, and MSVC bytecode.

@jmid jmid added the test suite reliability Issue concerns tests that should behave more predictably label Mar 27, 2025
@jmid jmid linked a pull request Mar 28, 2025 that will close this issue
@jmid jmid closed this as completed in #546 Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test suite reliability Issue concerns tests that should behave more predictably
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants