Skip to content

[ocaml5-issue] dummy found! in Lin Dynarray stress test with Domain on musl trunk #528

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jmid opened this issue Jan 19, 2025 · 39 comments
Closed
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime

Comments

@jmid
Copy link
Collaborator

jmid commented Jan 19, 2025

On #521 the musl trunk workflow triggered a dummy found! error exiting the Lin Dynarray stress test:
https://github.com/ocaml-multicore/multicoretests/actions/runs/12846368030/job/35821476446?pr=521

random seed: 225104823
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin Dynarray test with Domain
[ ]    0    0    0    0 / 1000     0.0s Lin Dynarray test with Domain (generating)
[✓]    2    0    1    1 / 1000     2.1s Lin Dynarray test with Domain

dummy found!
File "_build/.dune/default/src/dynarray/dune", line 3, characters 7-16:
3 |  (name lin_tests)
           ^^^^^^^^^
(cd _build/default/src/dynarray && ./lin_tests.exe --verbose)
Command exited with code 1.
[ ]    0    0    0    0 / 1000     0.0s Lin Dynarray stress test with Domain

This is caused by these lines implementing a get instrumented with a tag check:

let get_check a i =
let v = Dynarray.get a i in
if not (Obj.is_int (Obj.repr v)) then (Printf.eprintf "dummy found!\n%!"; exit 1) else v

IIUC, this means the test is observing a non-int dummy value from the unboxed implementation from ocaml/ocaml#12885.
A code comment in stdlib/dynarray.ml reads:

   [...] This dummy is *not* type-safe, it is not a valid value of type ['a], so
   we must be very careful never to return it to the user. After
   accessing any element of the array, we must check that it is not
   the dummy. In particular, this dummy must be distinct from any
   other value the user could provide -- we ensure this by using
   a dynamically-allocated mutable reference as our dummy.

At the same time, the module is documented to be unsafe for parallel usage:

Unsynchronized accesses

    Unsynchronized accesses to dynamic arrays are a programming error.

   Concurrent accesses to dynamic arrays must be synchronized
   (for instance with a {!Mutex.t}). Unsynchronized accesses to
   a dynamic array are a programming error that may lead to an invalid
   dynamic array state, on which some operations would fail with an
   [Invalid_argument] exception.

Can we conclude that parallel usage may be type unsafe? Ping @OlivierNicole: WDYT? 🤔

@OlivierNicole
Copy link
Contributor

Can we conclude that parallel usage may be type unsafe? Ping @OlivierNicole: WDYT? 🤔

It shouldn’t be, and yet it seems that it is.

I guess my next move, if I wanted to track this, would be to modify the test to return a result or something like that, so that qcheck-stm will show me the sequence of commands that led to the error rather than merely exiting; and try to work it out from there.

Cc-ing @gasche who might be interested to know.

@gasche
Copy link

gasche commented Jan 20, 2025

It looks like you found a bug in the implementation indeed. What is the set of operations that are used in this test?

Slightly more information: there is a dynamic check in the Dynarray implementation that should ensures that we fail with an exception instead of ever returning a dummy value. Here this check does not suffice to guarantee this property. Different dynarrays may have different dummy values (but only marshalling creates distinct dummies, and I suppose you don't exercise it?), and it may be the case that an implementation bug causes two different dummies to be mixed up in the same backing array, which would result in the dynamic check being incomplete. For example, the implementation of blit looks suspiciously complex and could hide such a bug.

@gasche
Copy link

gasche commented Jan 20, 2025

I suppose that the operations being exercised are the ones listed in

[ val_ "get_check" get_check (t @-> int @-> returning_or_exc elem);
val_ "set" Dynarray.set (t @-> int @-> elem @-> returning_or_exc unit);
val_ "length" Dynarray.length (t @-> returning int);
val_freq 3 "add_last" Dynarray.add_last (t @-> elem @-> returning_or_exc unit);
val_ "append_seq" Dynarray.append_seq (t @-> seq_small elem @-> returning_or_exc unit);
val_ "get_last" Dynarray.get_last (t @-> returning_or_exc elem);
val_ "pop_last" Dynarray.pop_last (t @-> returning_or_exc elem);
val_freq 2 "remove_last" Dynarray.remove_last (t @-> returning_or_exc unit);
val_ "clear" Dynarray.clear (t @-> returning_or_exc unit);
val_ "truncate" Dynarray.truncate (t @-> int @-> returning_or_exc unit);
val_ "ensure_capacity" Dynarray.ensure_capacity (t @-> int @-> returning_or_exc unit);
val_ "fit_capacity" Dynarray.fit_capacity (t @-> returning_or_exc unit);
(*val_ "blit" Dynarray.blit (t @-> int_not_too_big @-> t @-> int_not_too_big @-> int_not_too_big @-> returning_or_exc unit);*)
val_freq 2 "set_capacity" Dynarray.set_capacity (t @-> int @-> returning_or_exc unit);
val_ "reset" Dynarray.reset (t @-> returning_or_exc unit);
]

I see in particular that blit is commented out, so it should not be the one causing this issue. (It's not obvious whether there is marshalling involved in the machinery around it; could a dynarray used in the test be produced by marshalling then unmarshalling?)

@jmid
Copy link
Collaborator Author

jmid commented Jan 20, 2025

Thanks for the feedback both! 🙏

It looks like you found a bug in the implementation indeed. What is the set of operations that are used in this test?

The test just exercises the operations you list above. Indeed neither blit nor marshalling is involved.
Since the Lin test exits with the above message when encountering a dummy value, we regrettably don't know which sequence of operations caused it... 😬
As Olivier suggests, adjusting the STM test to express "unexpected value", e.g., as a result type, should allow us to print and understand what caused it.

I'll see if I can create a reproducer. A few quick local experiments however indicate that this is a rare, hard-to reproduce one, which also explain why it hasn't shown up before. Further note, that it showed up on a musl workflow, which tend to exhibit sufficiently different behaviour from glibc to surface things like these...

@jmid jmid added the ocaml5-issue A potential issue in the OCaml5 compiler/runtime label Jan 20, 2025
@jmid jmid changed the title dummy found! in Lin Dynarray stress test with Domain on musl trunk [ocaml5-issue] dummy found! in Lin Dynarray stress test with Domain on musl trunk Jan 20, 2025
@OlivierNicole
Copy link
Contributor

I am also trying at the moment; working with a modified version of src/dynarray/stm_tests that ignores every command result except if get returns a non-int value (on a int Dynarray.t). So far, working on a musl switch, I have been unable to reproduce it in many thousand runs.

@jmid
Copy link
Collaborator Author

jmid commented Jan 27, 2025

We can leave out the "multiple-t"-accepting functions from the Dynarray API, as only the ones tested by Lin were enough to trigger the bug. Perhaps also double check that the distributions (say for int, etc.) on the Lin and STM tests agree, to avoid one generating, e.g., 0-100 and the other ints uniformly.

@OlivierNicole
Copy link
Contributor

@OlivierNicole
Copy link
Contributor

Update: I have been running it in a loop for about 3 days now during the day, without being able to reproduce. The trunk commit being exercised is ocaml/ocaml@b5420186c7.

Either the bug is extremely unlikely to happen, or there is an error in my test, or the problem is specific to an earlier version of the compiler or to the CI machine.

@jmid
Copy link
Collaborator Author

jmid commented Jan 31, 2025

Thanks for the update!

Some thoughts:

  • One option is to check the versions of the problematic run of gcc, musl, ... (it could potentially be version specific)
  • If the problem is not reproducible locally, perhaps we should go back to CI - and the Lin test even?
  • Previously, we've used "focused tests" to help with rerunning a single test (mostly in CI). Here's an example I just kicked off: https://github.com/ocaml-multicore/multicoretests/tree/focus-lin-dynarray-musl
  • The seed may be important to trigger it - or it may not 🤷
  • As this may be GC-specific you may have luck reproducing more often when playing with GC parameters. I've used OCAMLRUNPARAM's s and o in the past.
  • If this is timing-specific, I've previously had luck with inserting a strategic sleep to trigger something more often, after having eyeballed the relevant code and built a suspicion
  • Perhaps this triggers more often with more spawned domains? A custom Lin stress property with 3 (or 4) domains is a possibility

Finally, we can't rule out that this is not due to a Dynarray error but to something else.
This remark on another issue indicates that there may be a rare GC bug lurking still.

@jmid
Copy link
Collaborator Author

jmid commented Jan 31, 2025

Oh wow - it triggered 5/100 on 5.3 and 6/100 on trunk apparently! 😮

@jmid
Copy link
Collaborator Author

jmid commented Feb 19, 2025

I played a bit today with either increasing the count (PBT repetitions) or the number of iterations in the outer loop.
On one of these runs, just exercising Lin Dynarray stress test with Domain (count 500, repeated 250 times) on 5.3.0 running musl one of the 250 repetitions crashed:
https://github.com/ocaml-multicore/multicoretests/actions/runs/13413928836/job/37470057862

Starting 44-th run

random seed: 349702188
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Lin Dynarray stress test with Domain
/usr/bin/bash: line 1: 2197433 Segmentation fault      (core dumped) ./focusedtest.exe -v
[ ]    0    0    0    0 /  500     0.0s Lin Dynarray stress test with Domain (generating)

So it seems we are dealing with a very rare bug, but a bug none the less...

@OlivierNicole
Copy link
Contributor

Sorry for the lack of reactivity on this one.

Oh wow - it triggered 5/100 on 5.3 and 6/100 on trunk apparently! 😮

This is very valuable! I took the liberty of pushing to that branch my changes that attempt to make the test fail after printing a triplet of commands.

@jmid
Copy link
Collaborator Author

jmid commented Apr 8, 2025

I've not had success with the printing myself to understand what combinations could trigger this 🤷

Instead I switched tactics and tried cutting down the generated commands by hand as long as the CI could still trigger the error. On this branch https://github.com/ocaml-multicore/multicoretests/tree/focus-lin-dynarray-musl-experiments-cont
I came down to 5 (the CI logs are still available):

    [ val_ "get_check" get_check (t @-> int @-> returning_or_exc elem);
      val_freq 3 "add_last" Dynarray.add_last (t @-> elem @-> returning_or_exc unit);
      val_ "append_seq" Dynarray.append_seq (t @-> seq_small elem @-> returning_or_exc unit);
      val_ "fit_capacity" Dynarray.fit_capacity (t @-> returning_or_exc unit);
      val_freq 2 "set_capacity" Dynarray.set_capacity (t @-> int @-> returning_or_exc unit);
]

and of these append_seq just calls add_last repeatedly, so we are down to a manageable number: something "get-like", "insert-like", and capacity-changing.

I believe the above is a step forward in trying to understand this issue - but not a final diagnosis...
You are very welcome to pick this up from here! 🙏

Looking at it some weeks ago I realized that the test would keep hammering a Dynarray that has already thrown an exception due to noticing parallel changes to it (because of the exception handling in returning_or_exc). As such it seemed more like a "doesn't always leave the internal state consistently when failing" than a "OMG this may crash when used in parallel"-kind-of-error... 🙂

@gasche
Copy link

gasche commented Apr 8, 2025

Very helpful, thanks! I will try to eyeball these functions again.

@gasche
Copy link

gasche commented Apr 8, 2025

I looked at the code again but have not made much progress. Here is my reasoning:

  1. A priori the error suggests that two different dummy values get mixed in the same backing array, or at least that the Dummy.is_dummy check insides Dynarray.get fails to recognize that it is getting a dummy value.

  2. I don't see a way in the Dynarray code to start mixing two different dummies using only those operations. In fact, given that the dummies used in practice (whenever serialization is not used) are a single global value, even mixing parts of different dynarrays should not result in distinct dummies.

  3. One hypothesis that may be worth considering is that we are not seeing a dummy value, but uninitialized memory within the array. The Dynarray implementation uses Array.sub (in Dummy.Array.prefix) and Array.make (in Dummy.Array.extend) a lot. In concurrent usage, could it be the case that other domains observe these arrays before all array elements have been initialized by the runtime? We also use Array.blit in Dummy.Array.extend (on resize, we create a larger array and then blit the values from the smaller array), and this might also result in invalid values in weirder scenarios (maybe some parts of the blit implementation do byte-level writes instead of full-word writes, and we observe a race where the first few bytes of a value come from a dummy pointer, and the trailing bytes come from an integer, resulting in reading a distinct pointer).

More guesswork, looking at the Array primitives used in the OCaml code of Dynarray

One way to rule out Array.blit and Array.sub (or to give support to this theory) would be to rewrite their use in Dummy.Array into pure OCaml versions of the function. If we can reproduce the issue, then probably this was not the cause. If we cannot reproduce the issue, maybe we just disturbed the code enough that we cannot reproduce anyway, but maybe this was the root cause. Of course we cannot reimplement Array.make in pure OCaml, so we cannot use the approach for that function.

This hypothesis would also partly explain why we observe this under musl and not glibc: the functions in question are written in C in the runtime code, so it is plausible that they would have different behavior depending on the libc.

More guesswork, looking at the libc primitives used in the C code of runtime/array.c

Among these functions as implemented in array.c, my intuition for what is more likely to go wrong would be the following:

  • caml_uniform_array_make does not use any weird C functions, so it is less likely to be the culprit
  • caml_uniform_array_gather, which is called by caml_array_sub, calls memcpy, which is suspicious; I would consider rewriting this function to not use the primitive, by using a plain C loop with caml_initialize instead (as is done a few lines below in another case).
  • caml_uniform_array_blit uses wo_memmove, which sounds super-suspicious but is in fact implemented to not use memmove when there are several domains, and uses a plain C loop with a proper memory barrier and atomic writes

After having done this analysis, I am tempted to accuse memcpy in Array.sub of being very suspicious here. If the problem goes away when that memcpy is replaced by a caml_initialize loop, then the next step is to ask memory-model experts (maybe @OlivierNicole himself) if there could be a synchronization issue here (in our own C code, or a bug in musl), and how to add more synchronization to make it go away.

@gasche
Copy link

gasche commented Apr 8, 2025

@OlivierNicole I don't how to run the multicoretests myself, and wouldn't know how to reproduce and experiment with these hypotheses. I am sort of assuming that you would be available to do this (you know the multicoretests enough to play with it, and the runtime enough to do experiments in the directions I suggested). If you would like us to look at this together eventually, or just to teach me how to do this stuff so that I don't try to push work towards you anymore, we can pick a date to synchronize in-person.

@OlivierNicole
Copy link
Contributor

OlivierNicole commented Apr 8, 2025

I see that we have been looking at the same things today. I also suspected memcpy in Array.sub. That being said, I didn’t find a plausible hypothesis of why memcpy would cause such issues. It’s used to perform initializing writes, and these don’t require any particular precautions, because of reasons.

My next hypothesis is that atomic operations may be implemented incorrectly in musl. I’m reading here and there that it does not support <stdatomic.h> correctly, which of course would be a deal-breaker to run parallel OCaml code correctly. I’m being told (by @fabbing and @Firobe) that @shym is be the expert to ask on this topic.

@OlivierNicole I don't how to run the multicoretests myself, and wouldn't know how to reproduce and experiment with these hypotheses. I am sort of assuming that you would be available to do this (you know the multicoretests enough to play with it, and the runtime enough to do experiments in the directions I suggested). If you would like us to look at this together eventually, or just to teach me how to do this stuff so that I don't try to push work towards you anymore, we can pick a date to synchronize in-person.

I’m happy to work more on this, but I’m not sure I will be able to do such experiments this week.

@shym
Copy link
Collaborator

shym commented Apr 9, 2025

My next hypothesis is that atomic operations may be implemented incorrectly in musl.

I’m not claiming expertise (like others could) here, but I can share my understanding about stdatomic.h.
One issue is that it mixes pure compiler builtins with features that require OS (library and/or kernel) support1, as it provides atomic accesses to both words and arbitrarily large structures. My impression is that OCaml ever only uses the first kind, where library support is not involved. I’m pretty sure that, if it were necessary, linking would fail anyway, as the C compiler would generate a call to a function missing from the library.

Footnotes

  1. This mix entails that stdatomic.h is sometimes provided in the compiler builtin headers (such as /usr/lib/gcc/**/include), sometimes in the system headers (aka /usr/lib/include), depending on the OS.

@OlivierNicole
Copy link
Contributor

I see. Probably not the reason, then.

@OlivierNicole
Copy link
Contributor

By working with @jmid's focusing method I was able to get a number of failures on the CI, with the sequences of operations that led to them!

Here is one of the simplest ones (this is a very large code block that may look garbled in your mail reader, sorry):

Messages for test STM Dynarray test parallel (int):
  Results incompatible with linearized model
                                                                         |
                                                          set_capacity (a4, 2) : Ok (())
                                                            add_last (a1, 6) : Ok (())
                                              append_seq (a63, [ 6; 0; 0; 3; 9; 6; 6; 8 ]) : Ok (())
                                                            add_last (a2, 0) : Ok (())
                                                            add_last (a10, 4) : Ok (())
                                                             fit_capacity a0 : Ok (())
                                                          set_capacity (a6, 0) : Ok (())
                                      append_seq (a5, [ 4; 5; 9; 4; 1; 1; 6; 9; 99; 5; 6; 71; 3; 8; 0; 9; 8; 7; 39;
                                                        5; 8; 5; 7; 0; 6; 5; 57; 77; 2; 6; 2; 9; 3; 7; 7; 8; 1; 8;
                                                        9; 9; 97; 6; 79; 9; 3 ]) : Ok (())
                                                             fit_capacity a8 : Ok (())
                                                          set_capacity (a3, 5) : Ok (())
                                                            add_last (a10, 4) : Ok (())
                                                            add_last (a1, 6) : Ok (())
                                                          set_capacity (a7, 0) : Ok (())
                                                            add_last (a32, 4) : Ok (())
                                                                         |
                  .-------------------------------------------------------------------------------------------------------------------------.
                  |                                                                                                                         |
 append_seq (a8, [ 5; 4 ]) : Ok (())                                                                                         set_capacity (a3, 6) : Ok (())
                                                                                                                               add_last (a0, 0) : Ok (())
                                                                                                                                  get (a4, 0) : Ok (4)
                                                                                                                               add_last (a8, 67) : Ok (())
                                                                                                                                fit_capacity a7 : Ok (())
                                                                                                                               add_last (a5, 0) : Ok (())
                                                                                                                     append_seq (a65, [ 3; 4; 4; 94; 51 ]) : Ok (())
                                                                                                                               fit_capacity a64 : Ok (())
                                                                                                                           get (a22, 2) : Error (Stdlib.Exit)
================================================================================
failure (1 tests failed, 0 tests errored, ran 1 tests)

To explain, the first sequence of commands is performed on a single domain; then, two domains are spawned to perform the left and right sequences on the diagram, respectively. Please ignore the a* indices (a10, a0, a6, etc.), all operations are performed on the same dynarray.

The test is coded in such a way that Error (Stdlib.Exit) means that Dynarray.get returned a non-integer-like value—even though Dynarray.get checks that it is not returning a dummy value.

I oberved that this happens usually when get is called right after fit_capacity.

@gasche
Copy link

gasche commented Apr 9, 2025

(Have you tried rewriting the Array.sub call in Dummy.Array.prefix into Array.make plus a loop, to see if you can still reproduce some failures?)
(I wonder the same about Array.blit.) I think it would be nice to rule out the underlying Array implementation as a cause of failure.

@OlivierNicole
Copy link
Contributor

OlivierNicole commented Apr 9, 2025

I have a job that runs the tests with a patched compiler that doesn’t use memcpy in Array.sub.

Thinking about it more, I am now convinced that you were right and this use of memcpy is wrong. Previously I said that it was fine because it’s an initializing write, but I failed to consider the reading side (from the source array). If there are concurrent writes to the source array, memcpy may cause tearing. I think this is the cause of the garbage values we’ve been observing. It probably showed up with musl because musl performs memcpy 4 bytes at a time.

@shym
Copy link
Collaborator

shym commented Apr 9, 2025

I ran a version of the tests on @OlivierNicole’s branch asking for backtraces on segfaults. I see that on a trunk run:

random seed: 358085500
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Dynarray test parallel (int)
/usr/bin/bash: line 1: 852856 Segmentation fault      (core dumped) src/dynarray/stm_tests.exe -v -s 358085500
[ ]    0    0    0    0 / 1000     0.0s STM Dynarray test parallel (int) (generating)           PID: 852856 (stm_tests.exe)
           UID: 1001 (runner)
           GID: 118 (docker)
        Signal: 11 (SEGV)
     Timestamp: Wed 2025-04-09 17:29:16 UTC (1s ago)
  Command Line: src/dynarray/stm_tests.exe -v -s 358085500
    Executable: /home/runner/work/multicoretests/multicoretests/multicoretests/_build/default/src/dynarray/stm_tests.exe
 Control Group: /system.slice/runner-provisioner.service
          Unit: runner-provisioner.service
         Slice: system.slice
       Boot ID: 7ddc18ad846845cc910dda51f57aa13f
    Machine ID: 86671f78e8e847d0b2b667109f8d6d1c
      Hostname: fv-az1988-103
       Storage: /var/lib/systemd/coredump/core.stm_tests\x2eexe.1001.7ddc18ad846845cc910dda51f57aa13f.852856.1744219756000000.zst (present)
  Size on Disk: 760.5K
       Message: Process 852856 (stm_tests.exe) of user 1001 dumped core.
                
                Module /home/runner/work/multicoretests/multicoretests/multicoretests/_build/default/src/dynarray/stm_tests.exe without build-id.
                Module /home/runner/work/multicoretests/multicoretests/multicoretests/_build/default/src/dynarray/stm_tests.exe
                Stack trace of thread 852860:
                #0  0x0000563c6a67d9b9 n/a (/home/runner/work/multicoretests/multicoretests/multicoretests/_build/default/src/dynarray/stm_tests.exe + 0x1a99b9)
                ELF object binary architecture: AMD x86-64

GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/runner/work/multicoretests/multicoretests/multicoretests/_build/default/src/dynarray/stm_tests.exe...
[New LWP 852860]
[New LWP 852856]
[New LWP 889271]
[New LWP 889272]
[New LWP 889274]
[New LWP 889273]

Core was generated by `src/dynarray/stm_tests.exe -v -s 358085500'.
warning: 177	runtime/minor_gc.c: No such file or directory
Program terminated with signal SIGSEGV, Segmentation fault.
#0  get_header_val (v=v@entry=139991388790504) at runtime/minor_gc.c:177
[Current thread is 1 (LWP 852860)]

Thread 6 (LWP 889273):
#0  0x00007f524949df08 in memset () from /lib/ld-musl-x86_64.so.1
#1  0x00007f524948ea53 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007f5238d3db74 in ?? ()
#3  0x0000000000000000 in ?? ()

Thread 5 (LWP 889274):
#0  0x00007f524948eab0 in ?? () from /lib/ld-musl-x86_64.so.1
#1  0x0000000000000000 in ?? ()

Thread 4 (LWP 889272):
#0  0x00007f524948eab0 in ?? () from /lib/ld-musl-x86_64.so.1
#1  0x0000000000000000 in ?? ()

Thread 3 (LWP 889271):
#0  0x00007f524949df08 in memset () from /lib/ld-musl-x86_64.so.1
#1  0x00007f524948ea53 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007f5238f50b74 in ?? ()
#3  0x0000000000000000 in ?? ()

Thread 2 (LWP 852856):
#0  0x00007f524948eab0 in ?? () from /lib/ld-musl-x86_64.so.1
#1  0x0000000000000000 in ?? ()

Thread 1 (LWP 852860):
#0  get_header_val (v=v@entry=139991388790504) at runtime/minor_gc.c:177
#1  0x0000563c6a67da8b in oldify_one (st_v=st_v@entry=0x7f523915ba00, v=139991388790504, p=p@entry=0x7f5238fe15d8) at runtime/minor_gc.c:257
#2  0x0000563c6a67dddc in oldify_mopup (st=st@entry=0x7f523915ba00, do_ephemerons=do_ephemerons@entry=0) at runtime/minor_gc.c:419
#3  0x0000563c6a67e32a in caml_empty_minor_heap_promote (participating=0x7f52493d7040, participating_count=3, domain=0x7f52494d8470) at runtime/minor_gc.c:664
#4  caml_stw_empty_minor_heap_no_major_slice (domain=0x7f52494d8470, participating_count=3, participating=0x7f52493d7040, unused=<optimized out>) at runtime/minor_gc.c:864
#5  0x0000563c6a66e39c in stw_handler (domain=0x7f52494d8470) at runtime/domain.c:1507
#6  handle_incoming (s=<optimized out>) at runtime/domain.c:360
#7  handle_incoming (s=<optimized out>) at runtime/domain.c:353
#8  0x0000563c6a66e44f in backup_thread_func (v=0x7f52493d8030) at runtime/domain.c:1069
#9  0x00007f524948fa4e in ?? () from /lib/ld-musl-x86_64.so.1
#10 0x0000000000000000 in ?? ()

https://github.com/shym/multicoretests/actions/runs/14361893607/job/40265268082#step:16:2287

@OlivierNicole
Copy link
Contributor

I cannot reproduce the original bug of this issue with the patched compiler, nor can I reproduce the failures of my modified STM test. I’ve restarted the job for extra confidence.

@shym Thanks! The minor GC segfaulting is consistent with the presence of garbage values. The invalid address that causes the segfault does not look like it was creating by collating the 4 low bytes from a pointer and the 4 high bytes from a small integer, as I would expect. Maybe it can be explained by a more complicated sequence on commands.

@shym
Copy link
Collaborator

shym commented Apr 10, 2025

Surely the memcpy used should rather be the 64-bit word-by-word one: https://git.musl-libc.org/cgit/musl/tree/src/string/x86_64/memcpy.s or there’s something really wrong with the GCC musl there.

@shym
Copy link
Collaborator

shym commented Apr 10, 2025

Oh! My eyes! It seems that there is indeed something terribly wrong!

I just compiled

let a = Array.sub [| 1; 2; 3; 4 |] 0 3

with OCaml 5.3.0 with musl and asked gdb to disassemble /m caml_uniform_array_gather:

...
525           /* [res] is freshly allocated, and no other domain has a reference to it.
526              Hence, a plain [memcpy] is sufficient. */
527           memcpy((value*)&Field(res, pos),
   0x00000000000424b5 <+167>:   mov    (%r14,%rdx,8),%rcx
   0x00000000000424b9 <+171>:   lea    (%rax,%r8,8),%r9
   0x00000000000424c1 <+179>:   shl    $0x3,%rcx
   0x00000000000424c9 <+187>:   mov    %r9,%rdi
   0x00000000000424cc <+190>:   rep movsb %ds:(%rsi),%es:(%rdi)
...

IIUC rep movsb means it will copy it byte by byte. (It seems that memcpy was inlined as I thought it might, but really poorly).

In my glibc switch, I see a reassuring:

525           /* [res] is freshly allocated, and no other domain has a reference to it.
526              Hence, a plain [memcpy] is sufficient. */
527           memcpy((value*)&Field(res, pos),
   0x0000000000043ac0 <+416>:   mov    0x0(%rbp),%rax
   0x0000000000043ac7 <+423>:   lea    (%rbx,%r13,8),%rdi
   0x0000000000043ad7 <+439>:   lea    0x0(,%rax,8),%rdx
   0x0000000000043ae7 <+455>:   call   0x2a7a0 <memcpy@plt>

@OlivierNicole
Copy link
Contributor

This is consistent with the fact that doing an explicit word-per-word copy removes the bug.

@gasche
Copy link

gasche commented Apr 10, 2025

A side-remark: if this is indeed the source of the bug, and it looks more and more likely, we should be able to observe it with tests on array -- we only need get, set and sub to observe the bug. For example we could populate the arrays with integers that are powers of two (from 1 to 2^62), and check that this property remains true -- it would be invalidated by half-word read-write races.

Another approach would be to have an element type that contains both pointers and immediates (as we have with integer dynarrays, thanks to the dummy), for example unit option -- Some () could be a shared, global constant, and None is an immediate.

This might be easier to reproduce, and maybe more informative to keep around as a regression test in the multicoretests codebase.

@shym
Copy link
Collaborator

shym commented Apr 10, 2025

Another test result, to check where guilt lies: musl is not really involved, as building an OCaml with glibc but setting CFLAGS=-Os gives the same behaviour, with this memcpy inlined as byte-per-byte… A quite unwelcome “optimisation”.

@shym
Copy link
Collaborator

shym commented Apr 10, 2025

@dustanddreams: if you have time to answer that, would you have a suggestion to make sure that gcc -Os doesn’t turn a memcpy of OCaml values into a rep movsb (which creates a data race on the source of the copy)? Having to go with a for loop is a bit disappointing.

@OlivierNicole
Copy link
Contributor

Note that this has likely been debated in the past during memory model discussions, since other places have been updated to use loops. See for example ocaml-multicore/ocaml-multicore#822

@shym
Copy link
Collaborator

shym commented Apr 10, 2025

OK, thank you for the reference!

@jmid
Copy link
Collaborator Author

jmid commented Apr 10, 2025

Excellent digging everyone! 👏 🎉 I had the same thought as @gasche: We should be able to observe this issue directly on Array indeed. I'll try to cook up or adjust the existing test...

@jmid
Copy link
Collaborator Author

jmid commented Apr 11, 2025

OK, my initial tests confirm the bug hypothesis about Array.sub tearing.

I've now played with adjusting the existing array test

  • changing from char elements to int to be able to observe weird byte ("sub-word") combinations,
  • setting all entries to 1 initially,
  • generating powers of 2, and
  • limiting the test to get, set, and sub

Still I would locally only observe examples of sequential inconsistency! 😬

I therefore changed the test to relax the post-condition from "model agreement" to "sub and get returns powers of 2".
On the CI this is sufficient to trigger tearing counterexamples with musl - I've not been able to trigger one locally though (also on a musl switch).

Here's a counterexample:
https://github.com/ocaml-multicore/multicoretests/actions/runs/14404492492/job/40397564071

Starting 55-th run

random seed: 444760359
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Array test tearing parallel
[ ]    0    0    0    0 / 1000     0.0s STM Array test tearing parallel (generating)
[✗]  328    0    1  327 / 1000    16.5s STM Array test tearing parallel
--- Failure --------------------------------------------------------------------
[...]
+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Messages for test STM Array test tearing parallel:
  Results incompatible with linearized model
                                                                     |                                
                                                               Get 2 : Ok (1)                         
                                                       Set (7, 4294967296) : Ok (())                  
                                                               Get 8 : Ok (1)                         
                                             Sub (2, 9) : Error (Invalid_argument("Array.sub"))       
                                                   Set (3, 144115188075855872) : Ok (())              
                                                               Get 9 : Ok (1)                         
                                                          Get 7 : Ok (4294967296)                     
                                                               Get 5 : Ok (1)                         
                                                   Set (3, 288230376151711744) : Ok (())              
                                                 Sub (6, 4) : Ok ([|1; 4294967296; 1; 1|])            
                                                         Set (0, 8388608) : Ok (())                   
                                             Sub (8, 4) : Error (Invalid_argument("Array.sub"))       
                                                       Set (9, 4294967296) : Ok (())                  
                                          Sub (4, 6) : Ok ([|1; 1; 1; 4294967296; 1; 4294967296|])    
                                                               Get 8 : Ok (1)                         
                                      Sub (0, 7) : Ok ([|8388608; 1; 1; 288230376151711744; 1; 1; 1|])
                                                           Sub (2, 0) : Ok ([||])                     
                                                               Get 8 : Ok (1)                         
                                                    Sub (8, 2) : Ok ([|1; 4294967296|])               
                                             Sub (5, 8) : Error (Invalid_argument("Array.sub"))       
                                                    Set (3, 18014398509481984) : Ok (())              
                                                               Get 6 : Ok (1)                         
                                                                     |                                
                                    .-----------------------------------------------------------------.
                                    |                                                                 |                                
                              Get 5 : Ok (1)                                                    Get 8 : Ok (1)                         
                     Set (6, 4398046511104) : Ok (())                          Sub (6, 2) : Ok ([|4398046511105; 4294967296|])         
                                                                              Sub (7, 6) : Error (Invalid_argument("Array.sub"))       
                                                                                         Set (0, 536870912) : Ok (())                  
                                                                                          Set (8, 4194304) : Ok (())                   
================================================================================
failure (1 tests failed, 0 tests errored, ran 1 tests)

This has a long "sequential prefix" which can be ignored and then only very few calls in parallel:

  • the left domain sets array entry 6 to 4398046511104 - which is 1 lsl 42 FTR
  • the right domain in parallel calls sub and observes 4398046511105 - with an additional lower 1 bit set - and hence an example of tearing. 🚬🔫

I've also kicked off a run targeting @OlivierNicole's fix branch and as suspected it is not triggering any counterexamples:
https://github.com/ocaml-multicore/multicoretests/tree/refs/heads/array-tearing-test-fix

Well done everyone! 😃

@dustanddreams
Copy link

@dustanddreams: if you have time to answer that, would you have a suggestion to make sure that gcc -Os doesn’t turn a memcpy of OCaml values into a rep movsb (which creates a data race on the source of the copy)? Having to go with a for loop is a bit disappointing.

I'm not sure I have enough context from the discussion, and my answer below may be completely missing the point. [To answer your exact question, though, you can always pass -fno-builtin-memcpy and -fno-builtin-memmove to gcc-compatible compilers to force them to always emit calls to the memcpy and memmove functions from your libc. But these functions may be implemented as rep movsb and it's out of your control.]

What this really means is that memcpy (and memmove) do not provide any guarantee that, if thread B writes to the memory being copied by memcpy in thread A at the same time, that the state of the memory copy in thread A matches the original state of memory, before thread B writes to it, or the final state of memory, after thread B writes to it.

If, because of this, byte-by-byte copies can cause OCaml object headers to be incorrectly copied into nonsensical values, then you need to introduce your own memcpy wrapper, which will copy the header (tag, size, etc) in a single operation, and then invoke memcpy for the remainder of the memory. This is the only way you'll achieve some form of cross-thread memory consistency.

@shym
Copy link
Collaborator

shym commented Apr 17, 2025

Do I understand correctly that the memcpy in caml_floatarray_gather is not a problem, as those float cannot be confused with pointers so tearing there wouldn’t break OCaml’s memory model?

Part of me wonder if we shouldn’t have a valuecpy function (inlinable, maybe a macro), or multiple functions depending on the memory synchronization required on each end, at least if that’s not the only place where a memcpy is replaced by an explicit loop. Part of the reason would be to make it easier to override it with an assembler version if one’s willing to do so, at least for the native backends (I’m biased there by how short the x86_64 version would, or could, be).

@edwintorok
Copy link
Contributor

rep movsb means it will copy it byte by byte

Not quite so simple. See https://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf "3.7.7.1 Memcpy considerations".

For processors supporting enhanced REP MOVSB/STOSB, implementing memcpy
with REP MOVSB will provide even more compact benefits in code size and better
throughput than using the combination of REP MOVSD+B.

See also table 3-3, for sizes <128 AVX memcpy can be faster, above that rep movsb is same or faster (at least on Intel, there are some perf bugs on AMD), so memcpy implemented as rep movsb is not as bad as you'd think on CPUs that support ERMS (support is advertised in CPUID, it is the same instruction, but different performance characteristics).

To achieve that throughput internally I think it'll use larger copies in microcode, so I think you can't rely on rep movsb copying byte-by-byte, and you also cannot rely that it'll be optimized and copy in larger chunks, as the exact behaviour is CPU microarchitecture dependant.

If you need precise semantics on the size of memory reads then I think you have to avoid memcpy/memmove and provide your own that guarantees a minimum read size (a memcpy version of wo_memmove?)

Just be careful to benchmark on both AMD and Intel CPUs, on AMD depending on your glibc version memcpy can hit some very significant slowdowns, see https://issues.redhat.com/browse/RHEL-25530 for details.
Also check that the compiler didn't "recognize" your implementation as a memcpy and replaced it with a call or inline expansion of memcpy anyway.

@OlivierNicole
Copy link
Contributor

Do I understand correctly that the memcpy in caml_floatarray_gather is not a problem, as those float cannot be confused with pointers so tearing there wouldn’t break OCaml’s memory model?

Yes, these should be fine as the runtime should never try to interpret such values as pointers.

Part of me wonder if we shouldn’t have a valuecpy function (inlinable, maybe a macro), or multiple functions depending on the memory synchronization required on each end, at least if that’s not the only place where a memcpy is replaced by an explicit loop. Part of the reason would be to make it easier to override it with an assembler version if one’s willing to do so, at least for the native backends (I’m biased there by how short the x86_64 version would, or could, be).

I have introduced a wo_memcpy function in the fix PR: ocaml/ocaml#13950, which was merged yesterday.

@jmid
Copy link
Collaborator Author

jmid commented May 1, 2025

This has been fixed in ocaml/ocaml#13950 included in the upcoming 5.4 release and a tearing test has been added in #551 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime
Projects
None yet
Development

No branches or pull requests

6 participants