GPU DP readback blocks each batch, so double buffering does not overlap work

## Summary
While tracing GPU benchmark and solve paths, I noticed `step_collect()` dispatches work into one DP slot and then immediately blocks on reading that same slot. This means the next batch is not queued until readback and decode finish, so the current double-buffer setup does not actually overlap GPU and CPU work.

## Why this matters
- Every GPU batch pays a synchronous readback stall on the hot path
- Benchmarks and normal solves report lower throughput than the architecture suggests
- Multi-GPU workers hit the same per-batch serialization in each solver thread

## Evidence
In `src/solver.rs:515`, `step_collect()` dispatches and copies for `slot = self.current_slot`, submits the encoder, then immediately calls:
- `self.read_slot_dps(slot)?` at `src/solver.rs:561`
- `self.reset_dp_count(slot)?` at `src/solver.rs:562`
- only then flips slot at `src/solver.rs:564`

In `src/solver.rs:605`, `read_slot_dps()` maps staging and blocks via `device.poll(wgpu::PollType::wait_indefinitely())` at `src/solver.rs:618`.

This is inconsistent with `src/gpu/buffers.rs:3`, which states CPU readback of the previous dispatch overlaps with the next GPU dispatch.

## Repro
1. Run `kangaroo --benchmark`.
2. Follow `src/benchmark.rs:82` into `solver.step()` and then `step_collect()`.
3. Confirm that each call reads and waits on the same slot that was just submitted.
4. Confirm next slot is selected only after readback completes.

## Suggested fix
Implement real producer-consumer slot pipelining:
- Keep separate write and read slots
- First call: dispatch only, no readback yet
- Steady state: dispatch new work to write slot, then read and reset previous slot
- Swap roles after dispatch is queued

Optionally, consider `CommandEncoder::map_buffer_on_submit` and non-blocking polling patterns from current `wgpu` guidance to reduce explicit wait points.

## References
- `src/solver.rs:515`
- `src/solver.rs:561`
- `src/solver.rs:605`
- `src/solver.rs:618`
- `src/gpu/buffers.rs:3`
- `src/benchmark.rs:82`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU DP readback blocks each batch, so double buffering does not overlap work #78

Summary

Why this matters

Evidence

Repro

Suggested fix

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

GPU DP readback blocks each batch, so double buffering does not overlap work #78

Description

Summary

Why this matters

Evidence

Repro

Suggested fix

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions