Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

condition_variable related crash on macOS #33

Open
timangus opened this issue Nov 3, 2021 · 1 comment
Open

condition_variable related crash on macOS #33

timangus opened this issue Nov 3, 2021 · 1 comment

Comments

@timangus
Copy link
Contributor

timangus commented Nov 3, 2021

The ThreadPool class uses a condition_variable to pause worker threads when there is no work, and wake them up again when there is. On macOS there is a (relatively) rare crash that occurs when condition_variable::wait() is called:

[ 00 ] __pthread_kill
[ 01 ] pthread_kill
[ 02 ] 0x7fff6ac8a808
[ 03 ] abort_message
[ 04 ] demangling_terminate_handler()
[ 05 ] initializeNonMetaClass
[ 06 ] std::__terminate(void (*)())
[ 07 ] std::terminate()
[ 08 ] __thread_proxy<…> ‎/Users/runner/work/graphia/graphia/source/shared/utils/threadpool.cpp:56
[ 09 ] _pthread_start
[ 10 ] thread_start
[ 11 ] 0x10148f070

On macOS condition_variable::wait() is basically a wrapper for pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex), which it turns out returns EINVAL in the crash case, triggering an exception, which causes the application to terminate. Incidentally, this is printed to stdout:

libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: condition_variable wait failed: Invalid argument

The "official" reasons for EINVAL are as follows:

[EINVAL]
    The value specified by cond, mutex, or abstime is invalid.
[EINVAL]
    Different mutexes were supplied for concurrent pthread_cond_wait() or pthread_cond_timedwait() operations on the same condition variable.
[EINVAL]
    The mutex was not owned by the current thread at the time of the call.

None of these apply.

After spending many, many hours single stepping the debugger through disassembly of macOS's pthread code, I've come to the conclusion that the source of this EINVAL is from here:

https://github.com/apple/darwin-libpthread/blob/2b46cbcc56ba33791296cd9714b2c90dae185ec7/src/pthread_cond.c#L647

	// SUSv3 requires pthread_cond_wait to be a cancellation point
	if (conforming == PTHREAD_CONFORM_UNIX03_CANCELABLE) {
		pthread_cleanup_push(_pthread_psynch_cond_cleanup, (void *)cond);
		updateval = __psynch_cvwait(cond, cvlsgen, ucntval, (pthread_mutex_t *)npmtx, mugen, flags, (int64_t)(then->tv_sec), (int32_t)(then->tv_nsec));
		pthread_testcancel();
		pthread_cleanup_pop(0);
	} else {
		updateval = __psynch_cvwait(cond, cvlsgen, ucntval, (pthread_mutex_t *)npmtx, mugen, flags, (int64_t)(then->tv_sec), (int32_t)(then->tv_nsec));
	}

	if (updateval == (uint32_t)-1) {
		int err = errno;
		switch (err & 0xff) {
		case ETIMEDOUT:
			res = ETIMEDOUT;
			break;
		case EINTR:
			// spurious wakeup (unless canceled)
			res = 0;
			break;
		default:
-->			res = EINVAL;
			break;
		}

The __psynch_cvwait syscall returns -1 and it sets errno to EBUSY. This makes no sense at all as the worker thread obviously already owns the mutex at this point so it can't possibly be busy. FYI, errno ends up in the %esi register (this is /usr/lib/system/libsystem_pthread.dylib on darwin 19.6.0/macOS 10.15.7):

    6420:       e8 99 28 00 00  callq   0x8cbe ## symbol stub for: ___psynch_cvwait
    6425:       48 83 c4 10     addq    $0x10, %rsp
    6429:       41 89 c5        movl    %eax, %r13d
    642c:       89 df   movl    %ebx, %edi
    642e:       e8 ef fc ff ff  callq   __pthread_testcancel
    6433:       48 8d 45 88     leaq    -0x78(%rbp), %rax
    6437:       48 8b 40 10     movq    0x10(%rax), %rax
    643b:       49 89 47 08     movq    %rax, 0x8(%r15)
    643f:       eb 1d   jmp     0x645e
    6441:       48 8b 4d c8     movq    -0x38(%rbp), %rcx
    6445:       4c 89 e7        movq    %r12, %rdi
    6448:       4c 89 f6        movq    %r14, %rsi
    644b:       89 da   movl    %ebx, %edx
    644d:       41 53   pushq   %r11
    644f:       ff 75 a8        pushq   -0x58(%rbp)
    6452:       e8 67 28 00 00  callq   0x8cbe ## symbol stub for: ___psynch_cvwait
    6457:       48 83 c4 10     addq    $0x10, %rsp
    645b:       41 89 c5        movl    %eax, %r13d
    645e:       45 85 ed        testl   %r13d, %r13d
    6461:       48 8b 7d c0     movq    -0x40(%rbp), %rdi
    6465:       74 4d   je      0x64b4
    6467:       41 83 fd ff     cmpl    $-0x1, %r13d
    646b:       75 33   jne     0x64a0
    646d:       e8 22 28 00 00  callq   0x8c94 ## symbol stub for: ___error
--> 6472:       8b 30   movl    (%rax), %esi
    6474:       31 c0   xorl    %eax, %eax
    6476:       40 80 fe 04     cmpb    $0x4, %sil
    647a:       b9 16 00 00 00  movl    $0x16, %ecx
    647f:       0f 44 c8        cmovel  %eax, %ecx
    6482:       40 80 fe 3c     cmpb    $0x3c, %sil
    6486:       41 bd 3c 00 00 00       movl    $0x3c, %r13d
    648c:       44 0f 45 e9     cmovnel %ecx, %r13d
    6490:       4c 89 e7        movq    %r12, %rdi
    6493:       31 d2   xorl    %edx, %edx
    6495:       e8 02 d2 ff ff  callq   __pthread_cond_updateval

The source of the EBUSY is potentially the following, but that's difficult to verify as I obviously can't go stepping into kernel code:

https://github.com/apple/darwin-libpthread/blob/2b46cbcc56ba33791296cd9714b2c90dae185ec7/kern/kern_synch.c#L1199

Possibly this is indicative of some kind of starvation state or hitting a buffer limit? I'm not sure. In any case at this point I'm leaning towards it actually being some kind of bug in macOS; I can't obviously see any misuse of threading primitives on my part.

Reproducing the problem basically involves running layout continuously until it happens, using a script to reopen a file periodically to restart the layout. It usually takes about 30-60 minutes to happen.

As a direct response to this issue I've broadened the scope of the task queuing lock so now it only locks and releases once per task set rather than for each individual task (745b0b7). It's early days, but this appears to have made it harder to reproduce the problem at least (as in, I haven't been able to yet). Nevertheless this change hasn't actually fixed anything that was verified broken in the first place, hence this essay, which I'm writing down in case I need to come back to it and refresh my memory. Hope that's helpful future Tim.

@timangus
Copy link
Contributor Author

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Nov 22, 2022
This has been flaky on macOS for a while ([hud](https://hud.pytorch.org/failure/RuntimeError%3A%20test_ops_fwd_gradients%20failed)) and I can reproduce this locally. The issue was raised by #66033 and it seems to point to macos itself graphia-app/graphia#33.  So switching to single thread when running `test_ops_fwd_gradients` on macOS as a mitigation for the flaky tests.

### Testing

`pytest test_ops_fwd_gradients.py -k test_fn_fwgrad_bwgrad -vv --flake-finder` to run all `test_fn_fwgrad_bwgrad` tests 50 times to make sure they all pass (no flaky anymore)

https://hud.pytorch.org/tests shows that `test_ops_fwd_gradients` on macOS takes about 15m to finish or 8 minute if using 2 shards like in the test.  There is no obvious difference in the test duration:

```
2022-11-21T21:34:18.6078080Z Running test_ops_fwd_gradients ... [2022-11-21 21:34:18.600663]
2022-11-21T21:34:21.6805770Z Executing ['/Users/runner/work/_temp/conda_environment_3517515737/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2', '--shard-id=0', '--num-shards=2', '-k=not _linalg_cholesky_', '--import-slow-tests', '--import-disabled-tests'] ... [2022-11-21 21:34:21.680156]
2022-11-21T21:34:21.6806380Z Ignoring disabled issues:  []
2022-11-21T21:34:21.6815250Z Executing ['/Users/runner/work/_temp/conda_environment_3517515737/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2', '--shard-id=1', '--num-shards=2', '-k=not _linalg_cholesky_', '--import-slow-tests', '--import-disabled-tests'] ... [2022-11-21 21:34:21.681174]
2022-11-21T21:34:21.6815830Z Ignoring disabled issues:  []
.....
2022-11-21T21:40:42.2422700Z =============================== warnings summary ===============================
.....
2022-11-21T21:40:42.2424670Z - generated xml file: /Users/runner/work/pytorch/pytorch/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-47b619449ea7db1f.xml -
2022-11-21T21:40:42.2424850Z = 831 passed, 596 skipped, 5 deselected, 17 xfailed, 1 warning in 374.54s (0:06:14) =
.....
2022-11-21T21:42:00.1923310Z =============================== warnings summary ===============================
.....
2022-11-21T21:42:00.1925370Z - generated xml file: /Users/runner/work/pytorch/pytorch/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-d24ee6419a602a6e.xml -
2022-11-21T21:42:00.1925540Z = 828 passed, 603 skipped, 7 deselected, 20 xfailed, 1 warning in 452.94s (0:07:32) =
....
2022-11-21T21:42:09.9035670Z FINISHED PRINTING LOG FILE of test_ops_fwd_gradients (/Users/runner/work/pytorch/pytorch/test/test-reports/test_ops_fwd_gradients_ha_3rfhb)
```

Pull Request resolved: #89410
Approved by: https://github.com/soulitzer
kulinseth pushed a commit to kulinseth/pytorch that referenced this issue Dec 10, 2022
This has been flaky on macOS for a while ([hud](https://hud.pytorch.org/failure/RuntimeError%3A%20test_ops_fwd_gradients%20failed)) and I can reproduce this locally. The issue was raised by pytorch#66033 and it seems to point to macos itself graphia-app/graphia#33.  So switching to single thread when running `test_ops_fwd_gradients` on macOS as a mitigation for the flaky tests.

### Testing

`pytest test_ops_fwd_gradients.py -k test_fn_fwgrad_bwgrad -vv --flake-finder` to run all `test_fn_fwgrad_bwgrad` tests 50 times to make sure they all pass (no flaky anymore)

https://hud.pytorch.org/tests shows that `test_ops_fwd_gradients` on macOS takes about 15m to finish or 8 minute if using 2 shards like in the test.  There is no obvious difference in the test duration:

```
2022-11-21T21:34:18.6078080Z Running test_ops_fwd_gradients ... [2022-11-21 21:34:18.600663]
2022-11-21T21:34:21.6805770Z Executing ['/Users/runner/work/_temp/conda_environment_3517515737/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2', '--shard-id=0', '--num-shards=2', '-k=not _linalg_cholesky_', '--import-slow-tests', '--import-disabled-tests'] ... [2022-11-21 21:34:21.680156]
2022-11-21T21:34:21.6806380Z Ignoring disabled issues:  []
2022-11-21T21:34:21.6815250Z Executing ['/Users/runner/work/_temp/conda_environment_3517515737/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2', '--shard-id=1', '--num-shards=2', '-k=not _linalg_cholesky_', '--import-slow-tests', '--import-disabled-tests'] ... [2022-11-21 21:34:21.681174]
2022-11-21T21:34:21.6815830Z Ignoring disabled issues:  []
.....
2022-11-21T21:40:42.2422700Z =============================== warnings summary ===============================
.....
2022-11-21T21:40:42.2424670Z - generated xml file: /Users/runner/work/pytorch/pytorch/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-47b619449ea7db1f.xml -
2022-11-21T21:40:42.2424850Z = 831 passed, 596 skipped, 5 deselected, 17 xfailed, 1 warning in 374.54s (0:06:14) =
.....
2022-11-21T21:42:00.1923310Z =============================== warnings summary ===============================
.....
2022-11-21T21:42:00.1925370Z - generated xml file: /Users/runner/work/pytorch/pytorch/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-d24ee6419a602a6e.xml -
2022-11-21T21:42:00.1925540Z = 828 passed, 603 skipped, 7 deselected, 20 xfailed, 1 warning in 452.94s (0:07:32) =
....
2022-11-21T21:42:09.9035670Z FINISHED PRINTING LOG FILE of test_ops_fwd_gradients (/Users/runner/work/pytorch/pytorch/test/test-reports/test_ops_fwd_gradients_ha_3rfhb)
```

Pull Request resolved: pytorch#89410
Approved by: https://github.com/soulitzer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant