Greetings, @yatisht ! This is Johannes from Gill's lab. usher-sampled is overloading one of our servers (hgwdev) at UCSC unfortunately. See Claude's report below. Hope you're doing well!
Summary
usher-sampled's -T/--threads flag only caps Intel TBB's worker pool. The four tf::Executor instances introduced with the Taskflow integration are default-constructed, so they each spawn std::thread::hardware_concurrency() workers regardless of -T. On a 192-core host this produces ~9,000 threads per usher-sampled invocation, which dominates the system load average and inflates kernel CPU time.
Affected lines
All four instantiate tf::Executor with no argument, against master at commit 9d7ecf7a:
| File |
Line |
src/usher-sampled/sampler.cpp |
91 |
src/usher-sampled/main_mapper.cpp |
492 |
src/usher-sampled/place_sample.cpp |
703 |
src/usher-sampled/place_sample_follower.cpp |
212 |
From Taskflow's header (taskflow/core/executor.hpp):
explicit Executor(
size_t N = std::thread::hardware_concurrency(),
std::shared_ptr<WorkerInterface> wix = nullptr
);
Observed behavior on a 192-core host (UCSC hgwdev)
Two concurrent usher-sampled runs invoked with -T 16:
$ ps -o pid,user,nlwp,pcpu,cmd -C usher-sampled
PID USER NLWP %CPU CMD
1254718 angie 9824 2103 .../usher-sampled -T 16 -A -e 5 -t emptyTree.nwk -v ... -o ... --optimization_radius 0 --batch_size_per_process 100
3734113 angie 9679 2285 .../usher-sampled -T 16 -A -e 5 -t emptyTree.nwk -v ... -o ... --optimization_radius 0 --batch_size_per_process 100
Per-process thread state breakdown (only ~95 R + ~200 D, ~14k S — most threads are pool workers parked in futex waits, not actively useful):
PID 1254718: 49 R, 94 D, 7488 S
PID 3734113: 46 R, 103 D, 6733 S
Systemwide load avg ~420 with 192 cores, %idle ~50%, %iowait 0%, %sys ~22% (kernel time approaching user time, indicative of scheduler pressure). Context switches ~2M/s systemwide, ~10–11k per core per second — anomalous for what is supposed to be a CPU-bound tree-placement workload.
Cause
-T is correctly honored by TBB at src/usher-sampled/driver/main.cpp:506:
tbb::global_control global_limit(tbb::global_control::max_allowed_parallelism, num_threads);
…but Taskflow has its own thread pool that is not subject to this limit, and the Executor instances are default-constructed.
Suggested fix
num_threads is a global declared in src/usher-sampled/driver/main.cpp:42. Pass it to each Executor constructor:
- tf::Executor executor;
+ tf::Executor executor(num_threads);
at all four sites listed above.
A cleaner alternative would be a single global Taskflow executor configured once at startup (mirroring how TBB is configured), but the in-place fix is the minimal change to restore -T semantics.
Repro
Any host where std::thread::hardware_concurrency() is much larger than the value passed to -T. The discrepancy is visible immediately via:
ps -o pid,nlwp,cmd -C usher-sampled
The NLWP column will be roughly 4 × hardware_concurrency() rather than ~T.
Environment
usher-sampled master at 9d7ecf7a (2026-04-30). The four affected sites were introduced in commits 5c32e0aa ("Adding taskflow"), 782937c2 ("Adding taskflow: place_sample_follower"), and 915325e3 ("Updating files to be consistent with latest usher codebase").
- Linux 5.14, 192 logical CPUs.
- Invocation:
usher-sampled -T 16 -A -e 5 -t emptyTree.nwk -v <vcf.gz> -o <output.pb> --optimization_radius 0 --batch_size_per_process 100.
Greetings, @yatisht ! This is Johannes from Gill's lab. usher-sampled is overloading one of our servers (hgwdev) at UCSC unfortunately. See Claude's report below. Hope you're doing well!
Summary
usher-sampled's-T/--threadsflag only caps Intel TBB's worker pool. The fourtf::Executorinstances introduced with the Taskflow integration are default-constructed, so they each spawnstd::thread::hardware_concurrency()workers regardless of-T. On a 192-core host this produces ~9,000 threads perusher-sampledinvocation, which dominates the system load average and inflates kernel CPU time.Affected lines
All four instantiate
tf::Executorwith no argument, againstmasterat commit9d7ecf7a:src/usher-sampled/sampler.cppsrc/usher-sampled/main_mapper.cppsrc/usher-sampled/place_sample.cppsrc/usher-sampled/place_sample_follower.cppFrom Taskflow's header (
taskflow/core/executor.hpp):Observed behavior on a 192-core host (UCSC
hgwdev)Two concurrent
usher-sampledruns invoked with-T 16:Per-process thread state breakdown (only ~95 R + ~200 D, ~14k S — most threads are pool workers parked in futex waits, not actively useful):
Systemwide load avg ~420 with 192 cores,
%idle~50%,%iowait0%,%sys~22% (kernel time approaching user time, indicative of scheduler pressure). Context switches ~2M/s systemwide, ~10–11k per core per second — anomalous for what is supposed to be a CPU-bound tree-placement workload.Cause
-Tis correctly honored by TBB atsrc/usher-sampled/driver/main.cpp:506:tbb::global_control global_limit(tbb::global_control::max_allowed_parallelism, num_threads);…but Taskflow has its own thread pool that is not subject to this limit, and the
Executorinstances are default-constructed.Suggested fix
num_threadsis a global declared insrc/usher-sampled/driver/main.cpp:42. Pass it to each Executor constructor:at all four sites listed above.
A cleaner alternative would be a single global Taskflow executor configured once at startup (mirroring how TBB is configured), but the in-place fix is the minimal change to restore
-Tsemantics.Repro
Any host where
std::thread::hardware_concurrency()is much larger than the value passed to-T. The discrepancy is visible immediately via:The
NLWPcolumn will be roughly4 × hardware_concurrency()rather than~T.Environment
usher-sampledmasterat9d7ecf7a(2026-04-30). The four affected sites were introduced in commits5c32e0aa("Adding taskflow"),782937c2("Adding taskflow: place_sample_follower"), and915325e3("Updating files to be consistent with latest usher codebase").usher-sampled -T 16 -A -e 5 -t emptyTree.nwk -v <vcf.gz> -o <output.pb> --optimization_radius 0 --batch_size_per_process 100.