Skip to content

usher-sampled: -T flag not respected by Taskflow executors → ~9000 threads on multi-core hosts #433

@Colossus

Description

@Colossus

Greetings, @yatisht ! This is Johannes from Gill's lab. usher-sampled is overloading one of our servers (hgwdev) at UCSC unfortunately. See Claude's report below. Hope you're doing well!

Summary

usher-sampled's -T/--threads flag only caps Intel TBB's worker pool. The four tf::Executor instances introduced with the Taskflow integration are default-constructed, so they each spawn std::thread::hardware_concurrency() workers regardless of -T. On a 192-core host this produces ~9,000 threads per usher-sampled invocation, which dominates the system load average and inflates kernel CPU time.

Affected lines

All four instantiate tf::Executor with no argument, against master at commit 9d7ecf7a:

File Line
src/usher-sampled/sampler.cpp 91
src/usher-sampled/main_mapper.cpp 492
src/usher-sampled/place_sample.cpp 703
src/usher-sampled/place_sample_follower.cpp 212

From Taskflow's header (taskflow/core/executor.hpp):

explicit Executor(
    size_t N = std::thread::hardware_concurrency(),
    std::shared_ptr<WorkerInterface> wix = nullptr
);

Observed behavior on a 192-core host (UCSC hgwdev)

Two concurrent usher-sampled runs invoked with -T 16:

$ ps -o pid,user,nlwp,pcpu,cmd -C usher-sampled
    PID USER     NLWP %CPU CMD
1254718 angie    9824  2103 .../usher-sampled -T 16 -A -e 5 -t emptyTree.nwk -v ... -o ... --optimization_radius 0 --batch_size_per_process 100
3734113 angie    9679  2285 .../usher-sampled -T 16 -A -e 5 -t emptyTree.nwk -v ... -o ... --optimization_radius 0 --batch_size_per_process 100

Per-process thread state breakdown (only ~95 R + ~200 D, ~14k S — most threads are pool workers parked in futex waits, not actively useful):

PID 1254718: 49 R, 94 D, 7488 S
PID 3734113: 46 R, 103 D, 6733 S

Systemwide load avg ~420 with 192 cores, %idle ~50%, %iowait 0%, %sys ~22% (kernel time approaching user time, indicative of scheduler pressure). Context switches ~2M/s systemwide, ~10–11k per core per second — anomalous for what is supposed to be a CPU-bound tree-placement workload.

Cause

-T is correctly honored by TBB at src/usher-sampled/driver/main.cpp:506:

tbb::global_control global_limit(tbb::global_control::max_allowed_parallelism, num_threads);

…but Taskflow has its own thread pool that is not subject to this limit, and the Executor instances are default-constructed.

Suggested fix

num_threads is a global declared in src/usher-sampled/driver/main.cpp:42. Pass it to each Executor constructor:

- tf::Executor executor;
+ tf::Executor executor(num_threads);

at all four sites listed above.

A cleaner alternative would be a single global Taskflow executor configured once at startup (mirroring how TBB is configured), but the in-place fix is the minimal change to restore -T semantics.

Repro

Any host where std::thread::hardware_concurrency() is much larger than the value passed to -T. The discrepancy is visible immediately via:

ps -o pid,nlwp,cmd -C usher-sampled

The NLWP column will be roughly 4 × hardware_concurrency() rather than ~T.

Environment

  • usher-sampled master at 9d7ecf7a (2026-04-30). The four affected sites were introduced in commits 5c32e0aa ("Adding taskflow"), 782937c2 ("Adding taskflow: place_sample_follower"), and 915325e3 ("Updating files to be consistent with latest usher codebase").
  • Linux 5.14, 192 logical CPUs.
  • Invocation: usher-sampled -T 16 -A -e 5 -t emptyTree.nwk -v <vcf.gz> -o <output.pb> --optimization_radius 0 --batch_size_per_process 100.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions