Fix windows again by illuzen · Pull Request #72 · Quantus-Network/quantus-miner

illuzen · 2026-06-30T07:41:28Z

Overview

Fixes GPU device selection and stability issues on Windows systems with both discrete and integrated GPUs.

What changed

GPU Adapter Selection (`engine-gpu`)

Skip integrated GPUs when discrete GPUs are available - Prevents resource contention and driver instability from mining on both simultaneously (root cause of device loss crashes)
Added --allow-integrated flag - Opt-in to use integrated GPUs alongside discrete GPUs if desired
Added DEVICE_LOST thread-local flag - Marks GPU workers as permanently dead after device loss to prevent "Buffer is already mapped" panics on retry
Clear corrupted resources on device loss - Drops buffer state when device fails

Async Runtime (`engine-gpu`)

Replaced futures::executor::block_on with tokio - Proper async timeout support for GPU initialization
Added 30-second timeout on request_device() - Prevents infinite hangs if driver is unresponsive
Handle nested runtime correctly - Uses block_in_place when called from within existing tokio runtime

GPU Tier Detection (`engine-gpu`)

Fixed Qualcomm Adreno regex - "Adreno 627" now correctly matches 600 series, not 700 series
Merged duplicate AMD RDNA 1 patterns - Consolidated \b5[56]00\b and rx 5\d{3} into single tier
Replaced once_cell with std::sync::LazyLock - Reduced dependency surface (requires Rust 1.80+)

CLI (`miner-cli`)

Added --allow-integrated / MINER_ALLOW_INTEGRATED - Available on both serve and benchmark commands
Added 🎯 emoji to solution found logs - Easier to spot in log output

Validation

cargo test -p engine-gpu - 14 tests pass including:
- windows_amd_discrete_plus_apu_uses_only_discrete - Exact repro of the failing scenario
- allow_integrated_flag_keeps_both_gpu_types - Verifies opt-in works
cargo clippy --workspace --all-targets -- -D warnings - Clean
cargo build --workspace - Builds successfully
Manual benchmark test passes on macOS

Test Scenario

The failing Windows system had:

GPU 0: Radeon RX 560X (DiscreteGpu, Vulkan)
GPU 1: AMD Radeon Vega 8 (IntegratedGpu, Vulkan)

Before: Both GPUs initialized and used → Vega 8 hit device loss → "Buffer is already mapped" panic

After:

Default: Only RX 560X used, Vega 8 skipped with log message
With --allow-integrated: Both used (at user's risk)

Risks and Mitigations

Risk	Mitigation
Users may want to use integrated GPU	`--allow-integrated` flag provides opt-in
Device loss could still occur on discrete GPU	`DEVICE_LOST` flag prevents panic on retry
Timeout may be too short/long for some drivers	30s is conservative; can be made configurable later

Follow-ups

Consider making init timeout configurable (--gpu-init-timeout)
Monitor if any users report issues with the new default behavior

Note

Medium Risk
Changes which GPUs are selected by default (behavioral change for iGPU+dGPU rigs). Device-loss handling alters worker lifecycle but reduces crash risk on failed GPUs.

Overview
Improves Windows hybrid-GPU stability by changing default adapter selection: when a discrete GPU is present, integrated/APU adapters are no longer used unless the user opts in. select_adapters gains an allow_integrated flag, and GpuEngine::try_new / resolve_gpu_configuration thread that through init.

--allow-integrated (and MINER_ALLOW_INTEGRATED) is added on serve and benchmark, wired via ServiceConfig into GPU engine setup. Benchmarks, examples, and tests call try_new(..., false) by default.

After device loss, a per-worker DEVICE_LOST flag makes further searches return Cancelled immediately, and thread-local GPU buffers are cleared to avoid retry panics (e.g. “buffer already mapped”). Adapter-selection unit tests are updated for the new default (single discrete on best backend) and expanded for RX 560X + Vega 8 and the opt-in path.

^{Reviewed by Cursor Bugbot for commit 6324c19. Configure here.}

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 6324c19. Configure here.}

cursor · 2026-06-30T07:42:33Z

+        let device_is_lost = DEVICE_LOST.with(|lost| *lost.borrow());
+        if device_is_lost {
+            // Device was lost in a previous call - don't attempt any GPU operations
+            return EngineStatus::Cancelled { hash_count: 0 };


Dead GPU worker spins forever

High Severity

After device loss, DEVICE_LOST makes search_range return EngineStatus::Cancelled with zero hashes, but miner-service GPU workers treat Cancelled like a normal job change and keep taking work. The thread never exits, so that GPU worker contributes 0 H/s indefinitely while still consuming jobs.

Additional Locations (1)

crates/engine-gpu/src/lib.rs#L610-L625

^{Triggered by learned rule: Fail-early on persistent GPU device loss}

^{Reviewed by Cursor Bugbot for commit 6324c19. Configure here.}

cursor · 2026-06-30T07:42:33Z

+            }
+            true
+        });
+    }


Integrated GPU skipped when discrete init fails

Medium Severity

When a discrete GPU is enumerated, integrated adapters are removed from select_adapters before initialization. If the discrete request_device times out or errors and is skipped, init never tries the integrated GPU, so try_new can fail with no usable GPU even though a working integrated adapter was available.

Additional Locations (1)

crates/engine-gpu/src/lib.rs#L324-L334

^{Reviewed by Cursor Bugbot for commit 6324c19. Configure here.}

n13

Verdict: Request changes (close — two correctness gaps)

Solid, well-tested fix for the iGPU+dGPU device-loss crash on Windows, and preferring discrete GPUs by default is reasonable. Requesting changes for two issues (both flagged by Bugbot and independently confirmed below), plus a scope/description mismatch.

What this PR actually changes (diff vs `main`)

Threads allow_integrated through CLI -> ServiceConfig -> resolve_gpu_configuration -> GpuEngine::try_new -> select_adapters.
select_adapters drops IntegratedGpu when any DiscreteGpu is present (unless allow_integrated).
DEVICE_LOST thread-local: after BatchResult::DeviceLost, mark the worker dead, clear WORKER_RESOURCES, and short-circuit search_range to Cancelled { 0 }.
Test updates + new tests.

Heads up: the description also lists "Async Runtime" (tokio/block_in_place/30s timeout), "GPU Tier Detection" (Adreno/RDNA regex, once_cell -> LazyLock), and the solution emoji. None of those are in this diff — they're already on main (gpu_tiers.rs already uses std::sync::LazyLock, Cargo.toml already has tokio time + futures, miner-service already logs the emoji). Please trim the description to what this PR actually changes.

1) Dead GPU worker becomes a silent no-op (High) — `engine-gpu/src/lib.rs` + `miner-service/src/lib.rs`

After device loss, search_range returns Cancelled { hash_count: 0 } on every subsequent call. In worker_loop, Cancelled is treated as a normal "new block" (info log), so the worker blocks on job_rx.recv(), wakes on each job, returns 0, and sleeps again — indefinitely. Not a busy-spin, but the thread never exits and contributes 0 H/s permanently. Nothing downstream notices: handle_connection only sums hash_count and there is no all-workers-dead health check, so a single-GPU rig silently mines at 0 H/s while still looking "alive".

This conflicts with the repo's fail-early rule ("never have a silent failure; all failures must produce error logging"). It is also an observability regression vs. the prior behavior, where the worker thread died on the panic and made start_job's try_send fail loudly every block. Now it is one error log, then silence.

Suggestion: keep surfacing the lost device (error log each time a job is handed to a lost worker), and for the single-GPU / all-GPU-workers-lost case, exit the process so a supervisor (systemd/docker) restarts it instead of running idle.

2) Integrated GPU excluded before init, so discrete-init failure leaves no GPU (Medium) — `engine-gpu/src/lib.rs`

select_adapters removes integrated adapters whenever a discrete adapter is enumerated, before init runs. In init, if the discrete request_device times out (the 30s path) or errors, that adapter is continue-skipped; if it was the only selected adapter, contexts.is_empty() and try_new fails with "No GPU adapters could be initialized" — even though a working integrated GPU was present. That is exactly the flaky-Windows-driver case this PR targets.

Suggestion: only exclude integrated adapters after at least one discrete device initializes successfully (or fall back to integrated if none do).

Minor

Default behavior change: iGPU+dGPU users lose the iGPU unless they pass --allow-integrated. Documented and reasonable — just calling it out.
Resource clearing on device loss looks correct: the DeviceLost paths in run_single_batch never unmap (buffer was not mapped), so dropping WORKER_RESOURCES avoids the double-map / "buffer already mapped" panic. Good.

What's good

Clean parameter threading and config wiring.
Excellent test coverage, including the exact RX 560X + Vega 8 repro and the --allow-integrated opt-in path.
The device-loss fix genuinely removes the panic.

Happy to approve once (1) is addressed (it is the one that violates fail-early); (2) is worth fixing in the same pass since it serves the same Windows-stability goal.

n13

Verdict: Approve

Both issues from the previous review are properly fixed. Verified locally on 90a274ba: cargo clippy --workspace --all-targets -- -D warnings is clean, and the 7 adapter_selection_tests pass.

Issue 1 (dead GPU worker silent no-op) — resolved

The new EngineStatus::DeviceLost variant cleanly separates device loss from normal cancellation. search_range now returns DeviceLost, and worker_loop logs an error and breaks out of the loop so the worker thread exits permanently (then runs the existing GPU resource cleanup). run_benchmark also breaks on DeviceLost. This satisfies fail-early — the failure is loud and the dead worker stops consuming jobs. All exhaustive match sites were updated for the new variant; the only un-updated EngineStatus match (engine-cpu test, line 242) uses a wildcard arm, so it still compiles.

Issue 2 (integrated excluded before init) — resolved

select_adapters no longer filters integrated GPUs; it returns all non-CPU adapters on the best backend (discrete first). The integrated-vs-discrete decision now happens in init() after initialization, based on what actually succeeded: integrated GPUs are dropped only if at least one discrete GPU initialized (unless --allow-integrated). A discrete-init failure now correctly falls back to a working integrated GPU. Logic confirmed by the updated select_adapters tests.

Minor follow-ups (non-blocking)

The post-init filtering in init() isn't unit-tested (it lives in the async hardware path). Consider extracting it into a small pure helper (input: device types that initialized; output: which to keep) so both "discrete present -> drop integrated" and "discrete failed -> keep integrated" get direct coverage.
After a GPU worker exits, WorkerPool::start_job keeps try_send-ing to the now-disconnected channel each block and logs "Failed to send job ... Channel full - worker may be stuck". It is loud (good) but the message is misleading for a dead/disconnected worker, and metrics::set_gpu_devices stays at the original count. For a GPU-only rig the process keeps running idle after the worker exits; a future enhancement could detect "all workers dead" and exit so a supervisor restarts it.
The PR description still lists changes not in this diff (Async Runtime / block_in_place, GPU Tier Detection, once_cell -> LazyLock) — those are already on main. Worth trimming so the changelog matches the diff, but not blocking.

Nice work — DeviceLost and post-init filtering are the right shapes for both fixes.

illuzen added 3 commits June 30, 2026 15:18

handle lost device correctly

929a61a

Update lib.rs

e7dd03e

allow-integrated flag otherwise exclude

6324c19

cursor Bot reviewed Jun 30, 2026

View reviewed changes

DeviceLost is a status

cbfab54

n13 requested changes Jun 30, 2026

View reviewed changes

Integrated GPU skipped when discrete init fails

90a274b

n13 approved these changes Jun 30, 2026

View reviewed changes

nits

ce250e1

illuzen merged commit a47db3d into main Jun 30, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix windows again#72

Fix windows again#72
illuzen merged 6 commits into
mainfrom
illuzen/fix-windows-again

illuzen commented Jun 30, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 30, 2026

Uh oh!

cursor Bot Jun 30, 2026

Uh oh!

n13 left a comment

Uh oh!

n13 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

illuzen commented Jun 30, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What changed

GPU Adapter Selection (engine-gpu)

Async Runtime (engine-gpu)

GPU Tier Detection (engine-gpu)

CLI (miner-cli)

Validation

Test Scenario

Risks and Mitigations

Follow-ups

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 30, 2026

Choose a reason for hiding this comment

Dead GPU worker spins forever

Uh oh!

cursor Bot Jun 30, 2026

Choose a reason for hiding this comment

Integrated GPU skipped when discrete init fails

Uh oh!

n13 left a comment

Choose a reason for hiding this comment

Verdict: Request changes (close — two correctness gaps)

What this PR actually changes (diff vs main)

1) Dead GPU worker becomes a silent no-op (High) — engine-gpu/src/lib.rs + miner-service/src/lib.rs

2) Integrated GPU excluded before init, so discrete-init failure leaves no GPU (Medium) — engine-gpu/src/lib.rs

Minor

What's good

Uh oh!

n13 left a comment

Choose a reason for hiding this comment

Verdict: Approve

Issue 1 (dead GPU worker silent no-op) — resolved

Issue 2 (integrated excluded before init) — resolved

Minor follow-ups (non-blocking)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

illuzen commented Jun 30, 2026 •

edited by cursor Bot

Loading

GPU Adapter Selection (`engine-gpu`)

Async Runtime (`engine-gpu`)

GPU Tier Detection (`engine-gpu`)

CLI (`miner-cli`)

What this PR actually changes (diff vs `main`)

1) Dead GPU worker becomes a silent no-op (High) — `engine-gpu/src/lib.rs` + `miner-service/src/lib.rs`

2) Integrated GPU excluded before init, so discrete-init failure leaves no GPU (Medium) — `engine-gpu/src/lib.rs`