Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing workers times out on nightly #6

Open
JamesWrigley opened this issue Nov 14, 2024 · 0 comments
Open

Removing workers times out on nightly #6

JamesWrigley opened this issue Nov 14, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@JamesWrigley
Copy link
Collaborator

Making this to track an issue first seen in #4 : some of the tests will call rmprocs(), and after changing CI to run with JULIA_NUM_THREADS=4 the workers can hang until rmprocs() times out and sends SIGQUIT.

Example backtrace:

Backtrace
      From worker 21:	
      From worker 21:	[2110] signal 3: Quit          # Timeout, rmprocs() sends SIGQUIT
      From worker 21:	in expression starting at none:1
      From worker 21:	unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6
      From worker 21:	pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
      From worker 21:	uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822
      From worker 21:	jl_parallel_gc_threadfun at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:3550
      From worker 21:	unknown function (ip: 0x7ff13a094ac2) at /lib/x86_64-linux-gnu/libc.so.6
      From worker 21:	unknown function (ip: 0x7ff13a12684f) at /lib/x86_64-linux-gnu/libc.so.6
      From worker 21:	unknown function (ip: (nil)) at (unknown file)
      From worker 21:	unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6
      From worker 21:	pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
      From worker 21:	uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822
      From worker 21:	jl_parallel_gc_threadfun at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:3550
      From worker 21:	unknown function (ip: 0x7ff13a094ac2) at /lib/x86_64-linux-gnu/libc.so.6
      From worker 21:	unknown function (ip: 0x7ff13a12684f) at /lib/x86_64-linux-gnu/libc.so.6
      From worker 21:	unknown function (ip: (nil)) at (unknown file)
      From worker 21:	unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6
      From worker 21:	pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
      From worker 21:	uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822
      From worker 21:	jl_parallel_gc_threadfun at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:3550
      From worker 21:	unknown function (ip: 0x7ff13a094ac2) at /lib/x86_64-linux-gnu/libc.so.6
      From worker 21:	unknown function (ip: 0x7ff13a12684f) at /lib/x86_64-linux-gnu/libc.so.6
      From worker 21:	unknown function (ip: (nil)) at (unknown file)
      From worker 21:	unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6
      From worker 21:	pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
      From worker 21:	wait at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia_locks.h:130 [inlined]
      From worker 21:	operator() at /cache/build/builder-amdci4-4/julialang/julia-master/src/engine.cpp:97 [inlined]
      From worker 21:	jl_engine_reserve at /cache/build/builder-amdci4-4/julialang/julia-master/src/engine.cpp:100
      From worker 21:	engine_reserve at ./compiler/types.jl:408 [inlined]
      From worker 21:	engine_reserve at ./compiler/types.jl:407 [inlined]
      From worker 21:	typeinf_ext at ./compiler/typeinfer.jl:1080
      From worker 21:	typeinf_ext_toplevel at ./compiler/typeinfer.jl:1176 [inlined]
      From worker 21:	typeinf_ext_toplevel at ./compiler/typeinfer.jl:1174     # Start compilation and get stuck in the GC
      From worker 21:	jfptr_typeinf_ext_toplevel_48134.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line)
      From worker 21:	jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined]
      From worker 21:	jl_type_infer at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:394
      From worker 21:	jl_compile_method_internal at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:2820
      From worker 21:	_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:3299 [inlined]
      From worker 21:	ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-master/src/gf.c:3495
      From worker 21:	show_exception_stack at ./errorshow.jl:1015    # Something in an errormonitor fails and we try to print the exception
      From worker 21:	display_error at ./client.jl:117
      From worker 21:	#errormonitor##0 at ./task.jl:734
      From worker 21:	jfptr_YY.errormonitorYY.YY.0_74460.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line)
      From worker 21:	jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined]
      From worker 21:	start_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:1263   # Switches to one of the remaining tasks
      From worker 21:	unknown function (ip: (nil)) at (unknown file)
      From worker 21:	unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6
      From worker 21:	pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
      From worker 21:	uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822
      From worker 21:	jl_safepoint_wait_thread_resume at /cache/build/builder-amdci4-4/julialang/julia-master/src/safepoint.c:271
      From worker 21:	segv_handler at /cache/build/builder-amdci4-4/julialang/julia-master/src/signals-unix.c:395 [inlined]
      From worker 21:	segv_handler at /cache/build/builder-amdci4-4/julialang/julia-master/src/signals-unix.c:381
      From worker 21:	unknown function (ip: 0x7ff13a04251f) at /lib/x86_64-linux-gnu/libc.so.6
      From worker 21:	jl_gc_state_set at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia_threads.h:275 [inlined]
      From worker 21:	maybe_collect at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia_threads.h:268 [inlined]
      From worker 21:	jl_gc_small_alloc_inner at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:737 [inlined]
      From worker 21:	jl_gc_small_alloc_noinline at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:795 [inlined]
      From worker 21:	jl_gc_alloc_ at /cache/build/builder-amdci4-4/julialang/julia-master/src/gc-stock.c:809
      From worker 21:	unknown function (ip: (nil)) at (unknown file)
      From worker 21:	unknown function (ip: 0x7ff13a091115) at /lib/x86_64-linux-gnu/libc.so.6
      From worker 21:	pthread_cond_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
      From worker 21:	uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:822
      From worker 21:	ijl_task_get_next at /cache/build/builder-amdci4-4/julialang/julia-master/src/scheduler.c:520
      From worker 21:	poptask at ./task.jl:1158
      From worker 21:	wait at ./task.jl:1167
      From worker 21:	task_done_hook at ./task.jl:839
      From worker 21:	jfptr_task_done_hook_74488.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line)
      From worker 21:	jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined]
      From worker 21:	jl_finish_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:338
      From worker 21:	start_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:1274
      From worker 21:	unknown function (ip: (nil)) at (unknown file)
      From worker 21:	pthread_cond_destroy at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
      From worker 21:	__cxa_finalize at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)     # Running finalizers and atexit() handlers?
      From worker 21:	__do_global_dtors_aux at /opt/hostedtoolcache/julia/nightly/x64/bin/../lib/julia/libjulia-internal.so.1.12 (unknown line)
      From worker 21:	_fini at /opt/hostedtoolcache/julia/nightly/x64/bin/../lib/julia/libjulia-internal.so.1.12 (unknown line)
      From worker 21:	unknown function (ip: 0x7ff13a045494) at /lib/x86_64-linux-gnu/libc.so.6
      From worker 21:	exit at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
      From worker 21:	ijl_exit at /cache/build/builder-amdci4-4/julialang/julia-master/src/init.c:199
      From worker 21:	jlplt_ijl_exit_77448.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line)
      From worker 21:	exit at ./initdefs.jl:28
      From worker 21:	exit at ./initdefs.jl:29           # exit() is called
      From worker 21:	jfptr_exit_77443.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line)
      From worker 21:	jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined]
      From worker 21:	jl_f__call_latest at /cache/build/builder-amdci4-4/julialang/julia-master/src/builtins.c:883
      From worker 21:	#invokelatest#1 at ./essentials.jl:1049 [inlined]
      From worker 21:	invokelatest at ./essentials.jl:1046
      From worker 21:	jfptr_invokelatest_62384.1 at /opt/hostedtoolcache/julia/nightly/x64/lib/julia/sys.so (unknown line)
      From worker 21:	jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined]
      From worker 21:	do_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/builtins.c:839
      From worker 21:	#handle_msg##12 at /home/runner/work/DistributedNext.jl/DistributedNext.jl/src/process_messages.jl:312   # Worker gets call to `exit()` from the master
      From worker 21:	run_work_thunk at /home/runner/work/DistributedNext.jl/DistributedNext.jl/src/process_messages.jl:72
      From worker 21:	#handle_msg##10 at /home/runner/work/DistributedNext.jl/DistributedNext.jl/src/process_messages.jl:312
      From worker 21:	unknown function (ip: 0x7ff0fb7455bf) at (unknown file)
      From worker 21:	jl_apply at /cache/build/builder-amdci4-4/julialang/julia-master/src/julia.h:2243 [inlined]
      From worker 21:	start_task at /cache/build/builder-amdci4-4/julialang/julia-master/src/task.c:1263
      From worker 21:	unknown function (ip: (nil)) at (unknown file)
      From worker 21:	Allocations: 9179557 (Pool: 9179436; Big: 121); GC: 8

I've only observed this on nightly, almost always on Ubuntu/OSX, almost never on Windows. A couple of times the workers have segfaulted somewhere in LLVM, but I don't have a backtrace for that.

It doesn't happen every time rmprocs() is called. The most reliable trigger is the topology.jl tests, though once or twice I've seen other tests failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant