-
-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Summary
While tracing GPU benchmark and solve paths, I noticed step_collect() dispatches work into one DP slot and then immediately blocks on reading that same slot. This means the next batch is not queued until readback and decode finish, so the current double-buffer setup does not actually overlap GPU and CPU work.
Why this matters
- Every GPU batch pays a synchronous readback stall on the hot path
- Benchmarks and normal solves report lower throughput than the architecture suggests
- Multi-GPU workers hit the same per-batch serialization in each solver thread
Evidence
In src/solver.rs:515, step_collect() dispatches and copies for slot = self.current_slot, submits the encoder, then immediately calls:
self.read_slot_dps(slot)?atsrc/solver.rs:561self.reset_dp_count(slot)?atsrc/solver.rs:562- only then flips slot at
src/solver.rs:564
In src/solver.rs:605, read_slot_dps() maps staging and blocks via device.poll(wgpu::PollType::wait_indefinitely()) at src/solver.rs:618.
This is inconsistent with src/gpu/buffers.rs:3, which states CPU readback of the previous dispatch overlaps with the next GPU dispatch.
Repro
- Run
kangaroo --benchmark. - Follow
src/benchmark.rs:82intosolver.step()and thenstep_collect(). - Confirm that each call reads and waits on the same slot that was just submitted.
- Confirm next slot is selected only after readback completes.
Suggested fix
Implement real producer-consumer slot pipelining:
- Keep separate write and read slots
- First call: dispatch only, no readback yet
- Steady state: dispatch new work to write slot, then read and reset previous slot
- Swap roles after dispatch is queued
Optionally, consider CommandEncoder::map_buffer_on_submit and non-blocking polling patterns from current wgpu guidance to reduce explicit wait points.
References
src/solver.rs:515src/solver.rs:561src/solver.rs:605src/solver.rs:618src/gpu/buffers.rs:3src/benchmark.rs:82