Fix windows again#72
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 6324c19. Configure here.
| let device_is_lost = DEVICE_LOST.with(|lost| *lost.borrow()); | ||
| if device_is_lost { | ||
| // Device was lost in a previous call - don't attempt any GPU operations | ||
| return EngineStatus::Cancelled { hash_count: 0 }; |
There was a problem hiding this comment.
Dead GPU worker spins forever
High Severity
After device loss, DEVICE_LOST makes search_range return EngineStatus::Cancelled with zero hashes, but miner-service GPU workers treat Cancelled like a normal job change and keep taking work. The thread never exits, so that GPU worker contributes 0 H/s indefinitely while still consuming jobs.
Additional Locations (1)
Triggered by learned rule: Fail-early on persistent GPU device loss
Reviewed by Cursor Bugbot for commit 6324c19. Configure here.
| } | ||
| true | ||
| }); | ||
| } |
There was a problem hiding this comment.
Integrated GPU skipped when discrete init fails
Medium Severity
When a discrete GPU is enumerated, integrated adapters are removed from select_adapters before initialization. If the discrete request_device times out or errors and is skipped, init never tries the integrated GPU, so try_new can fail with no usable GPU even though a working integrated adapter was available.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 6324c19. Configure here.
n13
left a comment
There was a problem hiding this comment.
Verdict: Request changes (close — two correctness gaps)
Solid, well-tested fix for the iGPU+dGPU device-loss crash on Windows, and preferring discrete GPUs by default is reasonable. Requesting changes for two issues (both flagged by Bugbot and independently confirmed below), plus a scope/description mismatch.
What this PR actually changes (diff vs main)
- Threads
allow_integratedthrough CLI ->ServiceConfig->resolve_gpu_configuration->GpuEngine::try_new->select_adapters. select_adaptersdropsIntegratedGpuwhen anyDiscreteGpuis present (unlessallow_integrated).DEVICE_LOSTthread-local: afterBatchResult::DeviceLost, mark the worker dead, clearWORKER_RESOURCES, and short-circuitsearch_rangetoCancelled { 0 }.- Test updates + new tests.
Heads up: the description also lists "Async Runtime" (tokio/block_in_place/30s timeout), "GPU Tier Detection" (Adreno/RDNA regex, once_cell -> LazyLock), and the solution emoji. None of those are in this diff — they're already on main (gpu_tiers.rs already uses std::sync::LazyLock, Cargo.toml already has tokio time + futures, miner-service already logs the emoji). Please trim the description to what this PR actually changes.
1) Dead GPU worker becomes a silent no-op (High) — engine-gpu/src/lib.rs + miner-service/src/lib.rs
After device loss, search_range returns Cancelled { hash_count: 0 } on every subsequent call. In worker_loop, Cancelled is treated as a normal "new block" (info log), so the worker blocks on job_rx.recv(), wakes on each job, returns 0, and sleeps again — indefinitely. Not a busy-spin, but the thread never exits and contributes 0 H/s permanently. Nothing downstream notices: handle_connection only sums hash_count and there is no all-workers-dead health check, so a single-GPU rig silently mines at 0 H/s while still looking "alive".
This conflicts with the repo's fail-early rule ("never have a silent failure; all failures must produce error logging"). It is also an observability regression vs. the prior behavior, where the worker thread died on the panic and made start_job's try_send fail loudly every block. Now it is one error log, then silence.
Suggestion: keep surfacing the lost device (error log each time a job is handed to a lost worker), and for the single-GPU / all-GPU-workers-lost case, exit the process so a supervisor (systemd/docker) restarts it instead of running idle.
2) Integrated GPU excluded before init, so discrete-init failure leaves no GPU (Medium) — engine-gpu/src/lib.rs
select_adapters removes integrated adapters whenever a discrete adapter is enumerated, before init runs. In init, if the discrete request_device times out (the 30s path) or errors, that adapter is continue-skipped; if it was the only selected adapter, contexts.is_empty() and try_new fails with "No GPU adapters could be initialized" — even though a working integrated GPU was present. That is exactly the flaky-Windows-driver case this PR targets.
Suggestion: only exclude integrated adapters after at least one discrete device initializes successfully (or fall back to integrated if none do).
Minor
- Default behavior change: iGPU+dGPU users lose the iGPU unless they pass
--allow-integrated. Documented and reasonable — just calling it out. - Resource clearing on device loss looks correct: the
DeviceLostpaths inrun_single_batchnever unmap (buffer was not mapped), so droppingWORKER_RESOURCESavoids the double-map / "buffer already mapped" panic. Good.
What's good
- Clean parameter threading and config wiring.
- Excellent test coverage, including the exact RX 560X + Vega 8 repro and the
--allow-integratedopt-in path. - The device-loss fix genuinely removes the panic.
Happy to approve once (1) is addressed (it is the one that violates fail-early); (2) is worth fixing in the same pass since it serves the same Windows-stability goal.
n13
left a comment
There was a problem hiding this comment.
Verdict: Approve
Both issues from the previous review are properly fixed. Verified locally on 90a274ba: cargo clippy --workspace --all-targets -- -D warnings is clean, and the 7 adapter_selection_tests pass.
Issue 1 (dead GPU worker silent no-op) — resolved
The new EngineStatus::DeviceLost variant cleanly separates device loss from normal cancellation. search_range now returns DeviceLost, and worker_loop logs an error and breaks out of the loop so the worker thread exits permanently (then runs the existing GPU resource cleanup). run_benchmark also breaks on DeviceLost. This satisfies fail-early — the failure is loud and the dead worker stops consuming jobs. All exhaustive match sites were updated for the new variant; the only un-updated EngineStatus match (engine-cpu test, line 242) uses a wildcard arm, so it still compiles.
Issue 2 (integrated excluded before init) — resolved
select_adapters no longer filters integrated GPUs; it returns all non-CPU adapters on the best backend (discrete first). The integrated-vs-discrete decision now happens in init() after initialization, based on what actually succeeded: integrated GPUs are dropped only if at least one discrete GPU initialized (unless --allow-integrated). A discrete-init failure now correctly falls back to a working integrated GPU. Logic confirmed by the updated select_adapters tests.
Minor follow-ups (non-blocking)
- The post-init filtering in
init()isn't unit-tested (it lives in the async hardware path). Consider extracting it into a small pure helper (input: device types that initialized; output: which to keep) so both "discrete present -> drop integrated" and "discrete failed -> keep integrated" get direct coverage. - After a GPU worker exits,
WorkerPool::start_jobkeepstry_send-ing to the now-disconnected channel each block and logs "Failed to send job ... Channel full - worker may be stuck". It is loud (good) but the message is misleading for a dead/disconnected worker, andmetrics::set_gpu_devicesstays at the original count. For a GPU-only rig the process keeps running idle after the worker exits; a future enhancement could detect "all workers dead" and exit so a supervisor restarts it. - The PR description still lists changes not in this diff (Async Runtime /
block_in_place, GPU Tier Detection,once_cell->LazyLock) — those are already onmain. Worth trimming so the changelog matches the diff, but not blocking.
Nice work — DeviceLost and post-init filtering are the right shapes for both fixes.


Overview
Fixes GPU device selection and stability issues on Windows systems with both discrete and integrated GPUs.
What changed
GPU Adapter Selection (
engine-gpu)--allow-integratedflag - Opt-in to use integrated GPUs alongside discrete GPUs if desiredDEVICE_LOSTthread-local flag - Marks GPU workers as permanently dead after device loss to prevent "Buffer is already mapped" panics on retryAsync Runtime (
engine-gpu)futures::executor::block_onwith tokio - Proper async timeout support for GPU initializationrequest_device()- Prevents infinite hangs if driver is unresponsiveblock_in_placewhen called from within existing tokio runtimeGPU Tier Detection (
engine-gpu)\b5[56]00\bandrx 5\d{3}into single tieronce_cellwithstd::sync::LazyLock- Reduced dependency surface (requires Rust 1.80+)CLI (
miner-cli)--allow-integrated/MINER_ALLOW_INTEGRATED- Available on bothserveandbenchmarkcommandsValidation
cargo test -p engine-gpu- 14 tests pass including:windows_amd_discrete_plus_apu_uses_only_discrete- Exact repro of the failing scenarioallow_integrated_flag_keeps_both_gpu_types- Verifies opt-in workscargo clippy --workspace --all-targets -- -D warnings- Cleancargo build --workspace- Builds successfullyTest Scenario
The failing Windows system had:
Before: Both GPUs initialized and used → Vega 8 hit device loss → "Buffer is already mapped" panic
After:
--allow-integrated: Both used (at user's risk)Risks and Mitigations
--allow-integratedflag provides opt-inDEVICE_LOSTflag prevents panic on retryFollow-ups
--gpu-init-timeout)Note
Medium Risk
Changes which GPUs are selected by default (behavioral change for iGPU+dGPU rigs). Device-loss handling alters worker lifecycle but reduces crash risk on failed GPUs.
Overview
Improves Windows hybrid-GPU stability by changing default adapter selection: when a discrete GPU is present, integrated/APU adapters are no longer used unless the user opts in.
select_adaptersgains anallow_integratedflag, andGpuEngine::try_new/resolve_gpu_configurationthread that through init.--allow-integrated(andMINER_ALLOW_INTEGRATED) is added onserveandbenchmark, wired viaServiceConfiginto GPU engine setup. Benchmarks, examples, and tests calltry_new(..., false)by default.After device loss, a per-worker
DEVICE_LOSTflag makes further searches returnCancelledimmediately, and thread-local GPU buffers are cleared to avoid retry panics (e.g. “buffer already mapped”). Adapter-selection unit tests are updated for the new default (single discrete on best backend) and expanded for RX 560X + Vega 8 and the opt-in path.Reviewed by Cursor Bugbot for commit 6324c19. Configure here.