Skip to content

GPU DP readback blocks each batch, so double buffering does not overlap work #78

@oritwoen

Description

@oritwoen

Summary

While tracing GPU benchmark and solve paths, I noticed step_collect() dispatches work into one DP slot and then immediately blocks on reading that same slot. This means the next batch is not queued until readback and decode finish, so the current double-buffer setup does not actually overlap GPU and CPU work.

Why this matters

  • Every GPU batch pays a synchronous readback stall on the hot path
  • Benchmarks and normal solves report lower throughput than the architecture suggests
  • Multi-GPU workers hit the same per-batch serialization in each solver thread

Evidence

In src/solver.rs:515, step_collect() dispatches and copies for slot = self.current_slot, submits the encoder, then immediately calls:

  • self.read_slot_dps(slot)? at src/solver.rs:561
  • self.reset_dp_count(slot)? at src/solver.rs:562
  • only then flips slot at src/solver.rs:564

In src/solver.rs:605, read_slot_dps() maps staging and blocks via device.poll(wgpu::PollType::wait_indefinitely()) at src/solver.rs:618.

This is inconsistent with src/gpu/buffers.rs:3, which states CPU readback of the previous dispatch overlaps with the next GPU dispatch.

Repro

  1. Run kangaroo --benchmark.
  2. Follow src/benchmark.rs:82 into solver.step() and then step_collect().
  3. Confirm that each call reads and waits on the same slot that was just submitted.
  4. Confirm next slot is selected only after readback completes.

Suggested fix

Implement real producer-consumer slot pipelining:

  • Keep separate write and read slots
  • First call: dispatch only, no readback yet
  • Steady state: dispatch new work to write slot, then read and reset previous slot
  • Swap roles after dispatch is queued

Optionally, consider CommandEncoder::map_buffer_on_submit and non-blocking polling patterns from current wgpu guidance to reduce explicit wait points.

References

  • src/solver.rs:515
  • src/solver.rs:561
  • src/solver.rs:605
  • src/solver.rs:618
  • src/gpu/buffers.rs:3
  • src/benchmark.rs:82

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePerformance improvements

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions