Improve context switching chance in `Lin_thread` and `STM_thread` modes #540

jmid · 2025-03-21T18:10:49Z

This fixes #338 - or at least gives it a good kick in the right direction.

Both Lin_thread and STM_thread are currently marked as experimental - and for good reason:
The chance of them triggering unfortunate, error-triggering Thread context switches is currently slim.
In hindsight, releasing them earlier - even as experimental - was probably premature.

Recall that a Thread context switch may happen

at allocation points and
at safepoints https://github.com/ocaml/ocaml/blob/trunk/asmcomp/polling.ml

IIUC:

The latter is the case for the native code backend.
The bytecode backend instead uses a CHECK_SIGNALS instruction inserted in while and for loops.
Both backends utilize a "tick thread" to context switch every 50ms to avoid starvation https://github.com/ocaml/ocaml/blob/trunk/otherlibs/systhreads/st_stubs.c

This PR improves on the situation by utilizing a terrific hack by @edwintorok in the form of a Gc.Memprof callback:
Upon allocations, with a certain probability, Memprof will trigger a callback which can be utilized to trigger an explicit Thread.yield.

To get an idea how well this works, I've run stats on our usual {int,int64} x {ref,CList} negative tests, repeating each one 10.000 times, and recording how many of these failed the "interface is Thread-safe" property. Since the Thread failures are still relatively rare, each of the 10.000 test inputs is Util.repeated 25 times, to bring up the probability of a failure.

For CList, the Domain-unsafe add_node does not do any allocations in the central "unsafe window", effectively making it Thread safe. I've therefore introduced an even worse add_node_thread by moving an allocation to this part,
for the purpose of testing...

Each test is run on both 5.3.0, 5.2.0, and 4.14.2 - with only 5.2.0 not having Gc.Memprof support: It is available in 4.14 and was restored in 5.3. By running on 5.2 the Memprof.{start,stop} calls thereby degrade to essentially noops, and enable an easy comparison with/without the improvement.

STM numbers with `Util.repeat 25`:

                        5.3.0  5.3.0-bc  5.2.0  5.2.0-bc  4.14.2  4.14.2-bc

int ref STM Thread         0       0        0        0        0       0

int64 ref STM Thread     925       0(!)    34        0      899       0(!)

CList int STM.thread     260     260        0        0      275     268

CList int64 STM.thread   277     274        0        0      274     280

Lin numbers with `Util.repeat 25`:

                        5.3.0  5.3.0-bc  5.2.0  5.2.0-bc  4.14.2  4.14.2-bc

int ref Lin Thread         0       0        0        0        0       0

int64 ref Lin Thread     796       0(!)     1        0      804       0(!)

int CList Lin Thread     161     122        0        0      115     123

int64 CList Lin Thread   124     139        0        1      143     137

From the above it is clear that 5.3 and 4.14 works remarkably better than the current version.
Even in bytecode mode this represents a good step forward. However, for some reason, in bytecode mode
both Lin_thread and STM_thread fail to trigger an int64 ref issue. I've not yet figured out why this is the case.

Overview of commits

The first two commits makes the above change to the two Lin and STM modes
Because the fix works relatively well, the next commits removes the previous experimental alert attributes
OCaml 5.0-5.1 raise an exception upon calling Memprof.{start,stop} so the 5th commit wraps them in a handler
The next 2 commits improves documentation
A bunch of commits adjust the src/neg_tests tests to adjust for the improved behaviour
For anyone interested commits ad1d5e4 and a4a635c contain the source for the above stats - I plan to remove these before merging.

jmid · 2025-03-25T11:45:42Z

FTR, the PR initially used the magic constant of 1e-3 samples per word, after earlier experimenting with higher frequencies of interrupts. Especially in bytecode mode, those seemed to result in too big of a slowdown.
Feedback from @edwintorok however made me realize that I hadn't explored the trade off between the frequency of interrupts (more Thread.yield's) and the Util.repeat count (raising the error rate by repetition). A second look at the code furthermore revealed that all but the stats were still using Util.repeat 100 - a good catch.

I have experimented a bit more by timing src/neg_tests/stm_tests_clist_thread_stats.exe under different sampling_rates and rep_counts. Turning up either of these two increases the error rate (and hence bug finding ability) however doing so increases the test running time (because of more callbacks and more repeated executions, respectively), which will be noticeable for positive tests that run, e.g., 1000 iterations. To keep the experiment manageable I've measured on native code where the stat benchmark takes ~5min compared to ~30min under bytecode...

The initial ~sampling_rate:1e-3 rep_count:25

CList int STM.thread 313 / 10000
CList int64 STM.thread 294 / 10000
265.02user 38.18system 4:52.81elapsed 103%CPU (0avgtext+0avgdata 31680maxresident)k

Now, with ~sampling_rate:1e-1 rep_count:3 the wall clock time is just ~7s slower out of ~5m but with a ~5x better error rate:

CList int STM.thread 1564 / 10000
CList int64 STM.thread 1613 / 10000
294.49user 7.35system 5:01.87elapsed 99%CPU (0avgtext+0avgdata 31880maxresident)k

The above seems like a good compromise between not affecting time usage significantly, while giving the error rate a significant bump upwards 🙂

Both of the above were run with OCaml 5.3.0. For comparison, I've timed the same run on OCaml 5.2.0 (where Memprof is not supported) with rep_count:100 to mimic the state of affairs before this PR:

CList int STM.thread 2 / 10000
CList int64 STM.thread 2 / 10000
412.03user 181.22system 8:57.95elapsed 110%CPU (0avgtext+0avgdata 31732maxresident)k

Going from ~9m to ~5m wall clock time shows the Memprof solution is not only better error-rate-wise but also much faster (the many repetitions is able to trigger issues in 2 out of 10000 test cases).

I also timed the run on OCaml 5.2.0 with rep_count:3 to assess the overhead of Memprof:

CList int STM.thread 0 / 10000
CList int64 STM.thread 0 / 10000
179.91user 5.55system 3:05.20elapsed 100%CPU (0avgtext+0avgdata 31676maxresident)k

so the Memprof callbacks add a ~2m (~5m - ~3m) ~66% overhead on this particular test. Non-allocating tests will naturally incur less of an overhead, while heavily allocating tests may incur a higher one.

Finally, for completeness, below follows the repeated stats from the initial PR description.
These confirm that the error rate further improves across the board - except on 5.2 (and 5.0+5.1),
due to the reduced repetition count and no Memprof support.

Revised STM numbers with `sampling_rate:1e-1` and `Util.repeat 3`

                        5.3.0  5.3.0-bc  5.2.0  5.2.0-bc  4.14.2  4.14.2-bc

int ref STM Thread         0       0        0        0        0       0

int64 ref STM Thread    3826       0        7        0     3891       0

CList int STM.thread    1561    1576        0        0     1639    1650

CList int64 STM.thread  1558    1572        0        0     1579    1662

Revised Lin numbers with `sampling_rate:1e-1` and `Util.repeat 3`

                        5.3.0  5.3.0-bc  5.2.0  5.2.0-bc  4.14.2  4.14.2-bc

int ref Lin Thread         0       0        0        0        0       0

int64 ref Lin Thread    3407       0        0        0     3395       0

int CList Lin Thread     702     710        0        0      700     725

int64 CList Lin Thread   721     665        0        0      677     686

jmid · 2025-03-25T16:25:58Z

CI summary for ca10290: All 36 workflows passed.

CI summary for 292a3e0:

Cygwin trunk timed out in Lin Bytes test with Thread Deadlock in Lin Bytes test with Thread on Cygwin #526

Out of 36 workflows, 1 failed with a genuine (supposedly Cygwin) error

jmid · 2025-03-25T20:58:32Z

CI summary for 0ef30bb: all 36 workflows passed.
I'll merge.

jmid · 2025-03-26T06:17:27Z

CI summary for merge to main: all 37 workflows passed

jmid added 11 commits March 12, 2025 15:05

Patch Lin_thread with Gc.Memprof hack to increase interleavings

dafe9f9

Patch STM_thread with Gc.Memprof hack to increase interleavings

c663f51

Remove experimental alert from {STM,lin}_thread interfaces

ba16710

Remove attributes to silence experimental alerts

5c4348c

Wrap Gc.Memprof.{start,stop} with exception handlers

822843e

Add {STM,Lin}_thread warnings about OCaml 5.0-5.2

127baba

Capitalize functor in documentation

5a48c47

Refactor CList STM test into separate spec and test drivers

f7b4777

Add a CList STM thread test - and make it negative

3a4bf29

Make Lin_internal CList test negative

8c87e0a

Make Lin CList test negative

d70c950

jmid added 6 commits March 25, 2025 17:58

Reenable Lin.thread test, disable the Lin_internal.thread test

56fe8dd

Toggle {Lin,STM}.thread int64 ref difference under bytecode

c79372b

Add a comment to explain negative CList {Lin,STM}.thread test

de58de0

Disable src/negtest thread tests on 5.2 and earlier from OCaml

0828d3e

Add a CHANGES entry

3a128e9

Use sampling_rate:1e-1, rep_count:3, and retries:25 in {Lin,STM}_thread

0ef30bb

jmid force-pushed the improve-thread-mode branch from 292a3e0 to 0ef30bb Compare March 25, 2025 17:04

jmid merged commit 895f187 into main Mar 25, 2025
36 checks passed

jmid deleted the improve-thread-mode branch March 25, 2025 20:58

This was referenced Mar 26, 2025

Chore: Rename src/neg_tests files to share common prefixes #542

Merged

Chore: Collect magic constants #543

Merged

"Lin Bytes test with Thread" may fail because it is Thread unsafe #544

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve context switching chance in `Lin_thread` and `STM_thread` modes #540

Improve context switching chance in `Lin_thread` and `STM_thread` modes #540

Uh oh!

jmid commented Mar 21, 2025

Uh oh!

jmid commented Mar 25, 2025

Uh oh!

jmid commented Mar 25, 2025

Uh oh!

jmid commented Mar 25, 2025

Uh oh!

Uh oh!

jmid commented Mar 26, 2025

Uh oh!

Uh oh!

Improve context switching chance in Lin_thread and STM_thread modes #540

Improve context switching chance in Lin_thread and STM_thread modes #540

Uh oh!

Conversation

jmid commented Mar 21, 2025

STM numbers with Util.repeat 25:

Lin numbers with Util.repeat 25:

Uh oh!

jmid commented Mar 25, 2025

Revised STM numbers with sampling_rate:1e-1 and Util.repeat 3

Revised Lin numbers with sampling_rate:1e-1 and Util.repeat 3

Uh oh!

jmid commented Mar 25, 2025

Uh oh!

jmid commented Mar 25, 2025

Uh oh!

Uh oh!

jmid commented Mar 26, 2025

Uh oh!

Uh oh!

Improve context switching chance in `Lin_thread` and `STM_thread` modes #540

Improve context switching chance in `Lin_thread` and `STM_thread` modes #540

STM numbers with `Util.repeat 25`:

Lin numbers with `Util.repeat 25`:

Revised STM numbers with `sampling_rate:1e-1` and `Util.repeat 3`

Revised Lin numbers with `sampling_rate:1e-1` and `Util.repeat 3`