-
Notifications
You must be signed in to change notification settings - Fork 16
[ocaml5-issue] dummy found!
in Lin Dynarray stress test with Domain
on musl trunk
#528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It shouldn’t be, and yet it seems that it is. I guess my next move, if I wanted to track this, would be to modify the test to return a Cc-ing @gasche who might be interested to know. |
It looks like you found a bug in the implementation indeed. What is the set of operations that are used in this test? Slightly more information: there is a dynamic check in the Dynarray implementation that should ensures that we fail with an exception instead of ever returning a dummy value. Here this check does not suffice to guarantee this property. Different dynarrays may have different dummy values (but only marshalling creates distinct dummies, and I suppose you don't exercise it?), and it may be the case that an implementation bug causes two different dummies to be mixed up in the same backing array, which would result in the dynamic check being incomplete. For example, the implementation of |
I suppose that the operations being exercised are the ones listed in multicoretests/src/dynarray/lin_tests.ml Lines 17 to 32 in 05a5ac9
I see in particular that |
Thanks for the feedback both! 🙏
The test just exercises the operations you list above. Indeed neither I'll see if I can create a reproducer. A few quick local experiments however indicate that this is a rare, hard-to reproduce one, which also explain why it hasn't shown up before. Further note, that it showed up on a |
dummy found!
in Lin Dynarray stress test with Domain
on musl trunkdummy found!
in Lin Dynarray stress test with Domain
on musl trunk
I am also trying at the moment; working with a modified version of |
We can leave out the "multiple-t"-accepting functions from the Dynarray API, as only the ones tested by Lin were enough to trigger the bug. Perhaps also double check that the distributions (say for int, etc.) on the Lin and STM tests agree, to avoid one generating, e.g., 0-100 and the other ints uniformly. |
I am now trying to reproduce by running https://github.com/OlivierNicole/multicoretests/blob/8bd670363da181d8c81bed0a0dff93e3622f9d95/src/dynarray/stm_tests.ml, but no luck so far. |
Update: I have been running it in a loop for about 3 days now during the day, without being able to reproduce. The trunk commit being exercised is ocaml/ocaml@b5420186c7. Either the bug is extremely unlikely to happen, or there is an error in my test, or the problem is specific to an earlier version of the compiler or to the CI machine. |
Thanks for the update! Some thoughts:
Finally, we can't rule out that this is not due to a Dynarray error but to something else. |
Oh wow - it triggered 5/100 on 5.3 and 6/100 on |
I played a bit today with either increasing the
So it seems we are dealing with a very rare bug, but a bug none the less... |
Sorry for the lack of reactivity on this one.
This is very valuable! I took the liberty of pushing to that branch my changes that attempt to make the test fail after printing a triplet of commands. |
I've not had success with the printing myself to understand what combinations could trigger this 🤷 Instead I switched tactics and tried cutting down the generated commands by hand as long as the CI could still trigger the error. On this branch https://github.com/ocaml-multicore/multicoretests/tree/focus-lin-dynarray-musl-experiments-cont [ val_ "get_check" get_check (t @-> int @-> returning_or_exc elem);
val_freq 3 "add_last" Dynarray.add_last (t @-> elem @-> returning_or_exc unit);
val_ "append_seq" Dynarray.append_seq (t @-> seq_small elem @-> returning_or_exc unit);
val_ "fit_capacity" Dynarray.fit_capacity (t @-> returning_or_exc unit);
val_freq 2 "set_capacity" Dynarray.set_capacity (t @-> int @-> returning_or_exc unit);
] and of these I believe the above is a step forward in trying to understand this issue - but not a final diagnosis... Looking at it some weeks ago I realized that the test would keep hammering a |
Very helpful, thanks! I will try to eyeball these functions again. |
I looked at the code again but have not made much progress. Here is my reasoning:
More guesswork, looking at the Array primitives used in the OCaml code of DynarrayOne way to rule out This hypothesis would also partly explain why we observe this under musl and not glibc: the functions in question are written in C in the runtime code, so it is plausible that they would have different behavior depending on the libc. More guesswork, looking at the libc primitives used in the C code of runtime/array.cAmong these functions as implemented in array.c, my intuition for what is more likely to go wrong would be the following:
After having done this analysis, I am tempted to accuse |
@OlivierNicole I don't how to run the multicoretests myself, and wouldn't know how to reproduce and experiment with these hypotheses. I am sort of assuming that you would be available to do this (you know the multicoretests enough to play with it, and the runtime enough to do experiments in the directions I suggested). If you would like us to look at this together eventually, or just to teach me how to do this stuff so that I don't try to push work towards you anymore, we can pick a date to synchronize in-person. |
I see that we have been looking at the same things today. I also suspected My next hypothesis is that atomic operations may be implemented incorrectly in musl. I’m reading here and there that it does not support
I’m happy to work more on this, but I’m not sure I will be able to do such experiments this week. |
I’m not claiming expertise (like others could) here, but I can share my understanding about Footnotes
|
I see. Probably not the reason, then. |
By working with @jmid's focusing method I was able to get a number of failures on the CI, with the sequences of operations that led to them! Here is one of the simplest ones (this is a very large code block that may look garbled in your mail reader, sorry):
To explain, the first sequence of commands is performed on a single domain; then, two domains are spawned to perform the left and right sequences on the diagram, respectively. Please ignore the The test is coded in such a way that I oberved that this happens usually when |
(Have you tried rewriting the |
I have a job that runs the tests with a patched compiler that doesn’t use memcpy in Thinking about it more, I am now convinced that you were right and this use of memcpy is wrong. Previously I said that it was fine because it’s an initializing write, but I failed to consider the reading side (from the source array). If there are concurrent writes to the source array, |
I ran a version of the tests on @OlivierNicole’s branch asking for backtraces on segfaults. I see that on a
https://github.com/shym/multicoretests/actions/runs/14361893607/job/40265268082#step:16:2287 |
I cannot reproduce the original bug of this issue with the patched compiler, nor can I reproduce the failures of my modified STM test. I’ve restarted the job for extra confidence. @shym Thanks! The minor GC segfaulting is consistent with the presence of garbage values. The invalid address that causes the segfault does not look like it was creating by collating the 4 low bytes from a pointer and the 4 high bytes from a small integer, as I would expect. Maybe it can be explained by a more complicated sequence on commands. |
Surely the |
Oh! My eyes! It seems that there is indeed something terribly wrong! I just compiled let a = Array.sub [| 1; 2; 3; 4 |] 0 3 with OCaml 5.3.0 with musl and asked
IIUC In my
|
This is consistent with the fact that doing an explicit word-per-word copy removes the bug. |
A side-remark: if this is indeed the source of the bug, and it looks more and more likely, we should be able to observe it with tests on array -- we only need Another approach would be to have an element type that contains both pointers and immediates (as we have with integer dynarrays, thanks to the dummy), for example This might be easier to reproduce, and maybe more informative to keep around as a regression test in the multicoretests codebase. |
Another test result, to check where guilt lies: |
@dustanddreams: if you have time to answer that, would you have a suggestion to make sure that |
Note that this has likely been debated in the past during memory model discussions, since other places have been updated to use loops. See for example ocaml-multicore/ocaml-multicore#822 |
OK, thank you for the reference! |
Excellent digging everyone! 👏 🎉 I had the same thought as @gasche: We should be able to observe this issue directly on |
OK, my initial tests confirm the bug hypothesis about I've now played with adjusting the existing array test
Still I would locally only observe examples of sequential inconsistency! 😬 I therefore changed the test to relax the post-condition from "model agreement" to "sub and get returns powers of 2". Here's a counterexample:
This has a long "sequential prefix" which can be ignored and then only very few calls in parallel:
I've also kicked off a run targeting @OlivierNicole's fix branch and as suspected it is not triggering any counterexamples: Well done everyone! 😃 |
I'm not sure I have enough context from the discussion, and my answer below may be completely missing the point. [To answer your exact question, though, you can always pass What this really means is that If, because of this, byte-by-byte copies can cause OCaml object headers to be incorrectly copied into nonsensical values, then you need to introduce your own |
Do I understand correctly that the Part of me wonder if we shouldn’t have a |
Not quite so simple. See https://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf "3.7.7.1 Memcpy considerations".
See also table 3-3, for sizes <128 AVX memcpy can be faster, above that To achieve that throughput internally I think it'll use larger copies in microcode, so I think you can't rely on If you need precise semantics on the size of memory reads then I think you have to avoid Just be careful to benchmark on both AMD and Intel CPUs, on AMD depending on your glibc version memcpy can hit some very significant slowdowns, see https://issues.redhat.com/browse/RHEL-25530 for details. |
Yes, these should be fine as the runtime should never try to interpret such values as pointers.
I have introduced a |
This has been fixed in ocaml/ocaml#13950 included in the upcoming 5.4 release and a tearing test has been added in #551 🎉 |
On #521 the musl trunk workflow triggered a
dummy found!
error exiting theLin Dynarray
stress test:https://github.com/ocaml-multicore/multicoretests/actions/runs/12846368030/job/35821476446?pr=521
This is caused by these lines implementing a
get
instrumented with a tag check:multicoretests/src/dynarray/lin_tests.ml
Lines 11 to 13 in a166b24
IIUC, this means the test is observing a non-int dummy value from the unboxed implementation from ocaml/ocaml#12885.
A code comment in stdlib/dynarray.ml reads:
At the same time, the module is documented to be unsafe for parallel usage:
Can we conclude that parallel usage may be type unsafe? Ping @OlivierNicole: WDYT? 🤔
The text was updated successfully, but these errors were encountered: