Skip to content

Select GPU adapters by backend to fix multi-backend duplication and OOM#67

Merged
illuzen merged 3 commits into
mainfrom
fix/gpu-adapter-selection
Jun 30, 2026
Merged

Select GPU adapters by backend to fix multi-backend duplication and OOM#67
illuzen merged 3 commits into
mainfrom
fix/gpu-adapter-selection

Conversation

@n13

@n13 n13 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Overview

Clean reimplementation of the fix attempted in #62, addressing the issues raised in its review. Closes #61.

On Windows, wgpu enumerates each physical GPU once per backend (Vulkan + DX12) plus a CPU-emulated software fallback (Microsoft Basic Render Driver). GpuEngine::init builds a mining context for every entry, so a hybrid laptop yields 5 contexts competing for the same VRAM and the process OOMs during benchmark/serve startup.

What changed (engine-gpu only)

Rather than deduplicating adapters by (vendor, device) PCI IDs — which collapses rigs with multiple identical cards into one context, including on Linux where no cross-backend duplicates exist — adapter selection works by backend:

  1. Drop DeviceType::Cpu adapters (software fallbacks are never useful for PoW, and this also filters lavapipe/llvmpipe on Linux).
  2. Keep all adapters from the highest-ranked backend present (Vulkan/Metal, then DX12, then others). Within a single backend each physical GPU appears exactly once, so identical cards survive by construction — no physical-ID matching needed, and backends that report vendor/device = 0 (e.g. Metal) can't cause false merges.
  3. Order the selection discrete-first, so --gpu-devices 1 on a hybrid laptop picks the discrete card instead of the first enumerated iGPU (also raised in benchmark/serve OOM under wgpu when auto-detect picks all adapters on multi-GPU Windows #61).

Each skipped adapter is logged at info level, each selected device gets an info log with name/type/backend, and init fails with an explicit error (pointing at --gpu-devices 0) if nothing usable remains. The selection logic is a pure index-based function (select_adapters) over AdapterInfo, unit-tested without GPU hardware. Also removes the dead adapter_infos local from init.

No public API changes; try_new and device_count() signatures are untouched.

Validation

  • cargo fmt --all -- --check, cargo clippy --workspace --all-targets -- -D warnings: clean
  • cargo test --workspace --locked: all pass, including 5 new unit tests covering the exact benchmark/serve OOM under wgpu when auto-detect picks all adapters on multi-GPU Windows #61 enumeration (5 entries → 2 contexts, discrete first), identical multi-GPU rigs (all cards kept), DX12-only machines, software-only environments, and empty enumeration
  • Runtime sanity on macOS (Apple M5 Pro / Metal): single context selected, benchmark mines normally

Not validated on multi-GPU Windows hardware — @adamtpang, if you can run your #61 repro on this branch, that would confirm the fix end-to-end (expect 2 contexts: GPU device 0: NVIDIA ... (DiscreteGpu, Vulkan), GPU device 1: AMD ... (IntegratedGpu, Vulkan)).

Risks and mitigations

  • A GPU exposed only through a lower-ranked backend (e.g. discrete card with broken Vulkan driver, iGPU with working one) would be skipped. Rare, explicitly logged, and recoverable by fixing the driver; a --gpu-adapter selector remains a possible follow-up.
  • Setups that intentionally mine on a software adapter now get an init error instead — the CPU engine is the right tool there, and --gpu-devices 0 silences GPU probing.

Follow-ups


Note

Medium Risk
Changes which GPUs get mining contexts at startup (behavioral fix on Windows); a GPU only exposed via a lower-ranked backend could be skipped, though that is logged.

Overview
Fixes Windows multi-backend enumeration where wgpu listed the same GPUs on Vulkan and DX12 plus CPU fallbacks, causing multiple mining contexts per card and VRAM OOM (#61).

select_adapters replaces “init every enumerated adapter + skip by name” with: drop DeviceType::Cpu, keep only adapters on the highest-priority backend (Vulkan/Metal, then DX12), and order indices discrete GPU first so limited --gpu-devices picks the dGPU. Init now builds contexts only for those indices, logs skips at info, and fails with a clearer no usable adapters message (hinting --gpu-devices 0). Five unit tests cover the #61 case, multi-identical-GPU rigs, DX12-only, and software-only setups.

miner-service: adds a 🎯 prefix to the CPU/GPU “found solution” log line only.

Reviewed by Cursor Bugbot for commit 2225161. Configure here.

On Windows wgpu enumerates each physical GPU once per backend (Vulkan +
DX12) plus a CPU-emulated fallback ("Microsoft Basic Render Driver").
Building a mining context for every entry causes VRAM contention and
OOMs the process during benchmark/serve startup (#61).

Instead of deduplicating by (vendor, device) PCI IDs - which would
collapse rigs with multiple identical cards into a single context - drop
CPU-emulated adapters and keep all adapters from the highest-ranked
backend present (Vulkan/Metal, then DX12). Within a single backend each
physical GPU appears exactly once, so identical cards are preserved by
construction and no physical-ID matching is needed.

Selected adapters are ordered discrete-first so `--gpu-devices 1` picks
the discrete card on hybrid laptops. Skipped adapters are logged at info
level; if nothing usable remains, init fails with an explicit error.

Selection logic is a pure index-based function unit-tested against the
exact enumeration reported in #61, identical multi-GPU rigs, DX12-only
machines, and software-only environments.
@illuzen illuzen merged commit d99be29 into main Jun 30, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

benchmark/serve OOM under wgpu when auto-detect picks all adapters on multi-GPU Windows

2 participants