Skip to content

Fix windows again#72

Merged
illuzen merged 6 commits into
mainfrom
illuzen/fix-windows-again
Jun 30, 2026
Merged

Fix windows again#72
illuzen merged 6 commits into
mainfrom
illuzen/fix-windows-again

Conversation

@illuzen

@illuzen illuzen commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Overview

Fixes GPU device selection and stability issues on Windows systems with both discrete and integrated GPUs.

What changed

GPU Adapter Selection (engine-gpu)

  • Skip integrated GPUs when discrete GPUs are available - Prevents resource contention and driver instability from mining on both simultaneously (root cause of device loss crashes)
  • Added --allow-integrated flag - Opt-in to use integrated GPUs alongside discrete GPUs if desired
  • Added DEVICE_LOST thread-local flag - Marks GPU workers as permanently dead after device loss to prevent "Buffer is already mapped" panics on retry
  • Clear corrupted resources on device loss - Drops buffer state when device fails

Async Runtime (engine-gpu)

  • Replaced futures::executor::block_on with tokio - Proper async timeout support for GPU initialization
  • Added 30-second timeout on request_device() - Prevents infinite hangs if driver is unresponsive
  • Handle nested runtime correctly - Uses block_in_place when called from within existing tokio runtime

GPU Tier Detection (engine-gpu)

  • Fixed Qualcomm Adreno regex - "Adreno 627" now correctly matches 600 series, not 700 series
  • Merged duplicate AMD RDNA 1 patterns - Consolidated \b5[56]00\b and rx 5\d{3} into single tier
  • Replaced once_cell with std::sync::LazyLock - Reduced dependency surface (requires Rust 1.80+)

CLI (miner-cli)

  • Added --allow-integrated / MINER_ALLOW_INTEGRATED - Available on both serve and benchmark commands
  • Added 🎯 emoji to solution found logs - Easier to spot in log output

Validation

  • cargo test -p engine-gpu - 14 tests pass including:
    • windows_amd_discrete_plus_apu_uses_only_discrete - Exact repro of the failing scenario
    • allow_integrated_flag_keeps_both_gpu_types - Verifies opt-in works
  • cargo clippy --workspace --all-targets -- -D warnings - Clean
  • cargo build --workspace - Builds successfully
  • Manual benchmark test passes on macOS

Test Scenario

The failing Windows system had:

  • GPU 0: Radeon RX 560X (DiscreteGpu, Vulkan)
  • GPU 1: AMD Radeon Vega 8 (IntegratedGpu, Vulkan)

Before: Both GPUs initialized and used → Vega 8 hit device loss → "Buffer is already mapped" panic

After:

  • Default: Only RX 560X used, Vega 8 skipped with log message
  • With --allow-integrated: Both used (at user's risk)

Risks and Mitigations

Risk Mitigation
Users may want to use integrated GPU --allow-integrated flag provides opt-in
Device loss could still occur on discrete GPU DEVICE_LOST flag prevents panic on retry
Timeout may be too short/long for some drivers 30s is conservative; can be made configurable later

Follow-ups

  • Consider making init timeout configurable (--gpu-init-timeout)
  • Monitor if any users report issues with the new default behavior

Note

Medium Risk
Changes which GPUs are selected by default (behavioral change for iGPU+dGPU rigs). Device-loss handling alters worker lifecycle but reduces crash risk on failed GPUs.

Overview
Improves Windows hybrid-GPU stability by changing default adapter selection: when a discrete GPU is present, integrated/APU adapters are no longer used unless the user opts in. select_adapters gains an allow_integrated flag, and GpuEngine::try_new / resolve_gpu_configuration thread that through init.

--allow-integrated (and MINER_ALLOW_INTEGRATED) is added on serve and benchmark, wired via ServiceConfig into GPU engine setup. Benchmarks, examples, and tests call try_new(..., false) by default.

After device loss, a per-worker DEVICE_LOST flag makes further searches return Cancelled immediately, and thread-local GPU buffers are cleared to avoid retry panics (e.g. “buffer already mapped”). Adapter-selection unit tests are updated for the new default (single discrete on best backend) and expanded for RX 560X + Vega 8 and the opt-in path.

Reviewed by Cursor Bugbot for commit 6324c19. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 6324c19. Configure here.

Comment thread crates/engine-gpu/src/lib.rs Outdated
let device_is_lost = DEVICE_LOST.with(|lost| *lost.borrow());
if device_is_lost {
// Device was lost in a previous call - don't attempt any GPU operations
return EngineStatus::Cancelled { hash_count: 0 };

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead GPU worker spins forever

High Severity

After device loss, DEVICE_LOST makes search_range return EngineStatus::Cancelled with zero hashes, but miner-service GPU workers treat Cancelled like a normal job change and keep taking work. The thread never exits, so that GPU worker contributes 0 H/s indefinitely while still consuming jobs.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by learned rule: Fail-early on persistent GPU device loss

Reviewed by Cursor Bugbot for commit 6324c19. Configure here.

Comment thread crates/engine-gpu/src/lib.rs Outdated
}
true
});
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integrated GPU skipped when discrete init fails

Medium Severity

When a discrete GPU is enumerated, integrated adapters are removed from select_adapters before initialization. If the discrete request_device times out or errors and is skipped, init never tries the integrated GPU, so try_new can fail with no usable GPU even though a working integrated adapter was available.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6324c19. Configure here.

@n13 n13 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: Request changes (close — two correctness gaps)

Solid, well-tested fix for the iGPU+dGPU device-loss crash on Windows, and preferring discrete GPUs by default is reasonable. Requesting changes for two issues (both flagged by Bugbot and independently confirmed below), plus a scope/description mismatch.

What this PR actually changes (diff vs main)

  • Threads allow_integrated through CLI -> ServiceConfig -> resolve_gpu_configuration -> GpuEngine::try_new -> select_adapters.
  • select_adapters drops IntegratedGpu when any DiscreteGpu is present (unless allow_integrated).
  • DEVICE_LOST thread-local: after BatchResult::DeviceLost, mark the worker dead, clear WORKER_RESOURCES, and short-circuit search_range to Cancelled { 0 }.
  • Test updates + new tests.

Heads up: the description also lists "Async Runtime" (tokio/block_in_place/30s timeout), "GPU Tier Detection" (Adreno/RDNA regex, once_cell -> LazyLock), and the solution emoji. None of those are in this diff — they're already on main (gpu_tiers.rs already uses std::sync::LazyLock, Cargo.toml already has tokio time + futures, miner-service already logs the emoji). Please trim the description to what this PR actually changes.

1) Dead GPU worker becomes a silent no-op (High) — engine-gpu/src/lib.rs + miner-service/src/lib.rs

After device loss, search_range returns Cancelled { hash_count: 0 } on every subsequent call. In worker_loop, Cancelled is treated as a normal "new block" (info log), so the worker blocks on job_rx.recv(), wakes on each job, returns 0, and sleeps again — indefinitely. Not a busy-spin, but the thread never exits and contributes 0 H/s permanently. Nothing downstream notices: handle_connection only sums hash_count and there is no all-workers-dead health check, so a single-GPU rig silently mines at 0 H/s while still looking "alive".

This conflicts with the repo's fail-early rule ("never have a silent failure; all failures must produce error logging"). It is also an observability regression vs. the prior behavior, where the worker thread died on the panic and made start_job's try_send fail loudly every block. Now it is one error log, then silence.

Suggestion: keep surfacing the lost device (error log each time a job is handed to a lost worker), and for the single-GPU / all-GPU-workers-lost case, exit the process so a supervisor (systemd/docker) restarts it instead of running idle.

2) Integrated GPU excluded before init, so discrete-init failure leaves no GPU (Medium) — engine-gpu/src/lib.rs

select_adapters removes integrated adapters whenever a discrete adapter is enumerated, before init runs. In init, if the discrete request_device times out (the 30s path) or errors, that adapter is continue-skipped; if it was the only selected adapter, contexts.is_empty() and try_new fails with "No GPU adapters could be initialized" — even though a working integrated GPU was present. That is exactly the flaky-Windows-driver case this PR targets.

Suggestion: only exclude integrated adapters after at least one discrete device initializes successfully (or fall back to integrated if none do).

Minor

  • Default behavior change: iGPU+dGPU users lose the iGPU unless they pass --allow-integrated. Documented and reasonable — just calling it out.
  • Resource clearing on device loss looks correct: the DeviceLost paths in run_single_batch never unmap (buffer was not mapped), so dropping WORKER_RESOURCES avoids the double-map / "buffer already mapped" panic. Good.

What's good

  • Clean parameter threading and config wiring.
  • Excellent test coverage, including the exact RX 560X + Vega 8 repro and the --allow-integrated opt-in path.
  • The device-loss fix genuinely removes the panic.

Happy to approve once (1) is addressed (it is the one that violates fail-early); (2) is worth fixing in the same pass since it serves the same Windows-stability goal.

@n13 n13 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: Approve

Both issues from the previous review are properly fixed. Verified locally on 90a274ba: cargo clippy --workspace --all-targets -- -D warnings is clean, and the 7 adapter_selection_tests pass.

Issue 1 (dead GPU worker silent no-op) — resolved

The new EngineStatus::DeviceLost variant cleanly separates device loss from normal cancellation. search_range now returns DeviceLost, and worker_loop logs an error and breaks out of the loop so the worker thread exits permanently (then runs the existing GPU resource cleanup). run_benchmark also breaks on DeviceLost. This satisfies fail-early — the failure is loud and the dead worker stops consuming jobs. All exhaustive match sites were updated for the new variant; the only un-updated EngineStatus match (engine-cpu test, line 242) uses a wildcard arm, so it still compiles.

Issue 2 (integrated excluded before init) — resolved

select_adapters no longer filters integrated GPUs; it returns all non-CPU adapters on the best backend (discrete first). The integrated-vs-discrete decision now happens in init() after initialization, based on what actually succeeded: integrated GPUs are dropped only if at least one discrete GPU initialized (unless --allow-integrated). A discrete-init failure now correctly falls back to a working integrated GPU. Logic confirmed by the updated select_adapters tests.

Minor follow-ups (non-blocking)

  • The post-init filtering in init() isn't unit-tested (it lives in the async hardware path). Consider extracting it into a small pure helper (input: device types that initialized; output: which to keep) so both "discrete present -> drop integrated" and "discrete failed -> keep integrated" get direct coverage.
  • After a GPU worker exits, WorkerPool::start_job keeps try_send-ing to the now-disconnected channel each block and logs "Failed to send job ... Channel full - worker may be stuck". It is loud (good) but the message is misleading for a dead/disconnected worker, and metrics::set_gpu_devices stays at the original count. For a GPU-only rig the process keeps running idle after the worker exits; a future enhancement could detect "all workers dead" and exit so a supervisor restarts it.
  • The PR description still lists changes not in this diff (Async Runtime / block_in_place, GPU Tier Detection, once_cell -> LazyLock) — those are already on main. Worth trimming so the changelog matches the diff, but not blocking.

Nice work — DeviceLost and post-init filtering are the right shapes for both fixes.

@illuzen illuzen merged commit a47db3d into main Jun 30, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants