-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
condition_variable related crash on macOS #33
Comments
pytorchmergebot
pushed a commit
to pytorch/pytorch
that referenced
this issue
Nov 22, 2022
This has been flaky on macOS for a while ([hud](https://hud.pytorch.org/failure/RuntimeError%3A%20test_ops_fwd_gradients%20failed)) and I can reproduce this locally. The issue was raised by #66033 and it seems to point to macos itself graphia-app/graphia#33. So switching to single thread when running `test_ops_fwd_gradients` on macOS as a mitigation for the flaky tests. ### Testing `pytest test_ops_fwd_gradients.py -k test_fn_fwgrad_bwgrad -vv --flake-finder` to run all `test_fn_fwgrad_bwgrad` tests 50 times to make sure they all pass (no flaky anymore) https://hud.pytorch.org/tests shows that `test_ops_fwd_gradients` on macOS takes about 15m to finish or 8 minute if using 2 shards like in the test. There is no obvious difference in the test duration: ``` 2022-11-21T21:34:18.6078080Z Running test_ops_fwd_gradients ... [2022-11-21 21:34:18.600663] 2022-11-21T21:34:21.6805770Z Executing ['/Users/runner/work/_temp/conda_environment_3517515737/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2', '--shard-id=0', '--num-shards=2', '-k=not _linalg_cholesky_', '--import-slow-tests', '--import-disabled-tests'] ... [2022-11-21 21:34:21.680156] 2022-11-21T21:34:21.6806380Z Ignoring disabled issues: [] 2022-11-21T21:34:21.6815250Z Executing ['/Users/runner/work/_temp/conda_environment_3517515737/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2', '--shard-id=1', '--num-shards=2', '-k=not _linalg_cholesky_', '--import-slow-tests', '--import-disabled-tests'] ... [2022-11-21 21:34:21.681174] 2022-11-21T21:34:21.6815830Z Ignoring disabled issues: [] ..... 2022-11-21T21:40:42.2422700Z =============================== warnings summary =============================== ..... 2022-11-21T21:40:42.2424670Z - generated xml file: /Users/runner/work/pytorch/pytorch/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-47b619449ea7db1f.xml - 2022-11-21T21:40:42.2424850Z = 831 passed, 596 skipped, 5 deselected, 17 xfailed, 1 warning in 374.54s (0:06:14) = ..... 2022-11-21T21:42:00.1923310Z =============================== warnings summary =============================== ..... 2022-11-21T21:42:00.1925370Z - generated xml file: /Users/runner/work/pytorch/pytorch/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-d24ee6419a602a6e.xml - 2022-11-21T21:42:00.1925540Z = 828 passed, 603 skipped, 7 deselected, 20 xfailed, 1 warning in 452.94s (0:07:32) = .... 2022-11-21T21:42:09.9035670Z FINISHED PRINTING LOG FILE of test_ops_fwd_gradients (/Users/runner/work/pytorch/pytorch/test/test-reports/test_ops_fwd_gradients_ha_3rfhb) ``` Pull Request resolved: #89410 Approved by: https://github.com/soulitzer
kulinseth
pushed a commit
to kulinseth/pytorch
that referenced
this issue
Dec 10, 2022
This has been flaky on macOS for a while ([hud](https://hud.pytorch.org/failure/RuntimeError%3A%20test_ops_fwd_gradients%20failed)) and I can reproduce this locally. The issue was raised by pytorch#66033 and it seems to point to macos itself graphia-app/graphia#33. So switching to single thread when running `test_ops_fwd_gradients` on macOS as a mitigation for the flaky tests. ### Testing `pytest test_ops_fwd_gradients.py -k test_fn_fwgrad_bwgrad -vv --flake-finder` to run all `test_fn_fwgrad_bwgrad` tests 50 times to make sure they all pass (no flaky anymore) https://hud.pytorch.org/tests shows that `test_ops_fwd_gradients` on macOS takes about 15m to finish or 8 minute if using 2 shards like in the test. There is no obvious difference in the test duration: ``` 2022-11-21T21:34:18.6078080Z Running test_ops_fwd_gradients ... [2022-11-21 21:34:18.600663] 2022-11-21T21:34:21.6805770Z Executing ['/Users/runner/work/_temp/conda_environment_3517515737/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2', '--shard-id=0', '--num-shards=2', '-k=not _linalg_cholesky_', '--import-slow-tests', '--import-disabled-tests'] ... [2022-11-21 21:34:21.680156] 2022-11-21T21:34:21.6806380Z Ignoring disabled issues: [] 2022-11-21T21:34:21.6815250Z Executing ['/Users/runner/work/_temp/conda_environment_3517515737/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2', '--shard-id=1', '--num-shards=2', '-k=not _linalg_cholesky_', '--import-slow-tests', '--import-disabled-tests'] ... [2022-11-21 21:34:21.681174] 2022-11-21T21:34:21.6815830Z Ignoring disabled issues: [] ..... 2022-11-21T21:40:42.2422700Z =============================== warnings summary =============================== ..... 2022-11-21T21:40:42.2424670Z - generated xml file: /Users/runner/work/pytorch/pytorch/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-47b619449ea7db1f.xml - 2022-11-21T21:40:42.2424850Z = 831 passed, 596 skipped, 5 deselected, 17 xfailed, 1 warning in 374.54s (0:06:14) = ..... 2022-11-21T21:42:00.1923310Z =============================== warnings summary =============================== ..... 2022-11-21T21:42:00.1925370Z - generated xml file: /Users/runner/work/pytorch/pytorch/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-d24ee6419a602a6e.xml - 2022-11-21T21:42:00.1925540Z = 828 passed, 603 skipped, 7 deselected, 20 xfailed, 1 warning in 452.94s (0:07:32) = .... 2022-11-21T21:42:09.9035670Z FINISHED PRINTING LOG FILE of test_ops_fwd_gradients (/Users/runner/work/pytorch/pytorch/test/test-reports/test_ops_fwd_gradients_ha_3rfhb) ``` Pull Request resolved: pytorch#89410 Approved by: https://github.com/soulitzer
This was referenced Oct 11, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The
ThreadPool
class uses acondition_variable
to pause worker threads when there is no work, and wake them up again when there is. On macOS there is a (relatively) rare crash that occurs whencondition_variable::wait()
is called:On macOS
condition_variable::wait()
is basically a wrapper forpthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
, which it turns out returnsEINVAL
in the crash case, triggering an exception, which causes the application to terminate. Incidentally, this is printed to stdout:The "official" reasons for
EINVAL
are as follows:None of these apply.
After spending many, many hours single stepping the debugger through disassembly of macOS's pthread code, I've come to the conclusion that the source of this
EINVAL
is from here:https://github.com/apple/darwin-libpthread/blob/2b46cbcc56ba33791296cd9714b2c90dae185ec7/src/pthread_cond.c#L647
The
__psynch_cvwait
syscall returns -1 and it sets errno toEBUSY
. This makes no sense at all as the worker thread obviously already owns the mutex at this point so it can't possibly be busy. FYI, errno ends up in the%esi
register (this is/usr/lib/system/libsystem_pthread.dylib
on darwin 19.6.0/macOS 10.15.7):The source of the
EBUSY
is potentially the following, but that's difficult to verify as I obviously can't go stepping into kernel code:https://github.com/apple/darwin-libpthread/blob/2b46cbcc56ba33791296cd9714b2c90dae185ec7/kern/kern_synch.c#L1199
Possibly this is indicative of some kind of starvation state or hitting a buffer limit? I'm not sure. In any case at this point I'm leaning towards it actually being some kind of bug in macOS; I can't obviously see any misuse of threading primitives on my part.
Reproducing the problem basically involves running layout continuously until it happens, using a script to reopen a file periodically to restart the layout. It usually takes about 30-60 minutes to happen.
As a direct response to this issue I've broadened the scope of the task queuing lock so now it only locks and releases once per task set rather than for each individual task (745b0b7). It's early days, but this appears to have made it harder to reproduce the problem at least (as in, I haven't been able to yet). Nevertheless this change hasn't actually fixed anything that was verified broken in the first place, hence this essay, which I'm writing down in case I need to come back to it and refresh my memory. Hope that's helpful future Tim.
The text was updated successfully, but these errors were encountered: