Comfy-Org · Kosinkadink · Apr 9, 2026 · Apr 9, 2026 · Apr 10, 2026 · Apr 10, 2026
diff --git a/MULTIGPU_PLAN.md b/MULTIGPU_PLAN.md
@@ -0,0 +1,139 @@
+# comfy-aimdo Multi-GPU Support Plan
+
+## Problem
+aimdo's VRAM management assumes a single GPU. With 2+ GPUs, `budget_deficit()` uses global `total_vram_usage` (sum of ALL GPUs) against one GPU's `vram_capacity`, creating a phantom deficit that triggers constant unnecessary eviction. Result: 2-GPU is **slower** than 1-GPU (79s vs 56s), while no-aimdo 2-GPU runs at 37s.
+
+## Benchmark Baseline (2×RTX 4090, Qwen-Image 38GB, CFG=7, 20 steps)
+| Config | Time | vs 1-GPU | Status |
+|--------|------|----------|--------|
+| 1-GPU + aimdo | 56.39s | 1.00× | ✅ stable |
+| 2-GPU + aimdo (original) | — | — | 💥 segfault |
+| 2-GPU + aimdo (mutex only) | 95.39s | 0.59× | ✅ stable |
+| 2-GPU + aimdo (dev-aware eviction) | 78.85s | 0.72× | ✅ stable |
+| 2-GPU no aimdo | 37.32s | 1.51× | ❌ CUDA errors after ~2 runs |
+| **2-GPU + aimdo (Phases 1–5)** | **41.29s** | **1.37×** | ✅ stable (7/7 runs) |
+
+## Branch
+`multigpu-thread-safety` — contains mutex + device-aware eviction (current stable baseline).
+
+---
+
+## Phase 1: Per-Device State Object
+**Status:** ✅ Done
+
+Create `AimdoDeviceState` struct with per-device fields currently stored as globals:
+```c
+typedef struct {
+    bool inited;
+    uint64_t vram_capacity;
+    CUcontext ctx;
+    uint64_t usage_last_check;
+    ssize_t deficit_sync;
+    uint64_t last_check_tick;
+    const char *prevailing_method;
+} AimdoDeviceState;
+
+extern AimdoDeviceState g_dev[AIMDO_MAX_DEVICES];
+```
+
+Changes:
+- **`control.c`**: Define `g_dev[]`. Split `init()` into global-once + per-device init. Add `ensure_device_init(device)` for lazy init.
+- **`plat.h`**: Declare the struct and extern. Keep `total_vram_usage` for diagnostics only.
+- **`model-vbar.c`**: Call `ensure_device_init(mv->device)` in `vbar_allocate()`.
+- **`vbar_allocate()`**: Use `g_dev[device].vram_capacity` instead of global `vram_capacity` for VBAR sizing.
+
+Notes:
+- ComfyUI currently calls `init_device(device_0_index)` once. We don't need to change the Python API — lazy init handles other devices.
+- Windows: store WDDM adapter/node per device (future — only if testing on multi-GPU Windows).
+
+---
+
+## Phase 2: Context Safety
+**Status:** ✅ Done
+
+Add save/restore context helpers:
+```c
+bool with_device_ctx(int device, CUcontext *prev);
+void restore_ctx(CUcontext prev);
+```
+
+Wrap all context-sensitive CUDA calls:
+- `cuMemGetInfo` in `poll_budget_deficit` / `cuda_budget_deficit`
+- `cuCtxSynchronize` in `vbars_free_locked_dev`, `vbars_free_for_vbar`, `vbar_fault`, `vbar_unpin`, `vbar_free`, `vbar_free_memory`
+- Fix Linux `ensure_ctx()` to not override pytorch's context when it's already set for device 1
+
+Key principle: **never leave a different context active than what the caller had**.
+
+---
+
+## Phase 3: Per-Device Hybrid Budget
+**Status:** ✅ Done
+
+Replace global `budget_deficit()` in `vbar_fault` with per-device version using `AimdoDeviceState`:
+```c
+size_t budget_deficit_dev(size_t size, int device) {
+    AimdoDeviceState *s = &g_dev[device];
+    uint64_t usage = dev_vram_usage[device];
+    poll_budget_deficit_dev(device);  // cuMemGetInfo with correct ctx
+    ssize_t simple = (ssize_t)(usage + HEADROOM + size) - (ssize_t)s->vram_capacity;
+    ssize_t delta = s->deficit_sync + (ssize_t)usage - (ssize_t)s->usage_last_check + size;
+    return (size_t)MAX(MAX(simple, delta), 0);
+}
+```
+
+**Critical**: Must keep the `cuMemGetInfo` backstop (hybrid approach). Pure accounting OOM'd in testing because `cudaFreeAsync` decrements counters before memory is actually reusable.
+
+Also make `poll_budget_deficit_dev(device)` — calls `cuMemGetInfo` with the correct per-device context (depends on Phase 2).
+
+---
+
+## Phase 4: Allocator Hooks Device-Aware
+**Status:** ✅ Done
+
+In `aimdo_cuda_malloc` / `aimdo_cuda_malloc_async`:
+- Determine device via `current_cuda_device()`
+- Call `ensure_device_init(dev)`
+- Use `vbars_free_dev(budget_deficit_dev(size, dev), dev)` instead of global
+- OOM retry path: also evict from same device only
+
+Expose `vbars_free_dev()` as the device-filtered wrapper (we already have `vbars_free_locked_dev` internally).
+
+---
+
+## Phase 5: Counter Thread Safety
+**Status:** ✅ Done
+
+`dev_vram_add/sub` run under different locks or no lock. Fix with atomics:
+```c
+static inline void dev_vram_add(int device, size_t size) {
+    __atomic_add_fetch(&total_vram_usage, size, __ATOMIC_RELAXED);
+    if (device >= 0 && device < AIMDO_MAX_DEVICES)
+        __atomic_add_fetch(&dev_vram_usage[device], size, __ATOMIC_RELAXED);
+}
+```
+Windows equivalent: `InterlockedExchangeAdd64`.
+
+---
+
+## Phase 6 (Future): Per-Device Locking
+**Status:** ⬜ Not started — only pursue if performance still >1.3× worse than no-aimdo after Phases 1–5
+
+The global `vbar_lock` serializes both GPUs. `cuCtxSynchronize()` is called while holding it, blocking the other GPU's `vbar_fault`.
+
+Options:
+- Per-device lock for device-local operations
+- Global lock only for list mutations (insert/remove)
+- Avoid `cuCtxSynchronize` on hot path if possible
+
+---
+
+## Implementation Order
+Phase 1 → 2 → 3 → 4 → 5 → **benchmark** → decide Phase 6
+
+## Expected Outcome
+Each GPU sees its own ~20GB usage vs 24GB capacity. Phantom deficit eliminated. Should approach the no-aimdo 37s target while remaining crash-free.
+
+## Key Learnings from Earlier Attempts
+1. **Can't remove `budget_deficit` pre-check entirely** — aimdo consuming too much VRAM causes `cudaErrorLaunchFailure` crash in pytorch's async allocator.
+2. **Pure per-device accounting (without cuMemGetInfo backstop) causes OOM** — `cudaFreeAsync` decrements counters before memory is actually freed, so accounting under-reports real usage.
+3. **No-aimdo 2-GPU is unstable** — `cudaErrorInvalidValue` after ~2 runs, even with async offload disabled. Aimdo provides stability that pytorch's default offloading doesn't.
diff --git a/build-linux-docker b/build-linux-docker
@@ -5,6 +5,7 @@ SRCS="src/*.c src-posix/*.c"
 docker build -t manylinux-cuda -f docker/cuda-on-manylinux.Dockerfile .
 docker run --rm -v $(pwd):/project -w /project manylinux-cuda \
     gcc -shared -o comfy_aimdo/aimdo.so -fPIC -Werror \
+    -Isrc \
     -I/usr/local/cuda/include \
     -L/usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/  \
     ${SRCS} -lcuda
diff --git a/src/control.c b/src/control.c
@@ -4,10 +4,14 @@
 uint64_t vram_capacity;
 uint64_t total_vram_usage;
 uint64_t total_vram_last_check;
+uint64_t dev_vram_usage[AIMDO_MAX_DEVICES];
 ssize_t deficit_sync;
 const char *prevailing_deficit_method;
 CUcontext aimdo_cuda_ctx;
 
+/* Phase 1: per-device state */
+AimdoDeviceState g_dev[AIMDO_MAX_DEVICES];
+
 bool cuda_budget_deficit() {
     uint64_t now = GET_TICK();
     static uint64_t last_check = 0;
@@ -27,6 +31,117 @@ bool cuda_budget_deficit() {
     return true;
 }
 
+/* Phase 3: per-device cuMemGetInfo with context switching (Phase 2) */
+bool poll_budget_deficit_dev(int device) {
+    if (device < 0 || device >= AIMDO_MAX_DEVICES || !g_dev[device].inited)
+        return false;
+
+    AimdoDeviceState *s = &g_dev[device];
+    uint64_t now = GET_TICK();
+
+    if (now - s->last_check_tick < 2000) {
+        return true;
+    }
+
+    CUcontext prev;
+    if (!with_device_ctx(device, &prev))
+        return false;
+
+    size_t free_vram = 0, total_vram = 0;
+    bool ok = CHECK_CU(cuMemGetInfo(&free_vram, &total_vram));
+
+    restore_ctx(prev);
+
+    if (!ok)
+        return false;
+
+    s->last_check_tick = now;
+    s->usage_last_check = dev_vram_load(device);
+    s->deficit_sync = (ssize_t)VRAM_HEADROOM - (ssize_t)free_vram;
+    s->prevailing_method = "cuMemGetInfo (per-dev)";
+    return true;
+}
+
+/* Phase 1: lazy per-device init — thread-safe via init_lock */
+#if defined(_WIN32) || defined(_WIN64)
+#include <windows.h>
+static CRITICAL_SECTION dev_init_lock;
+static volatile LONG dev_init_lock_ready;
+
+static inline void dev_init_lock_acquire(void) {
+    if (!InterlockedCompareExchange(&dev_init_lock_ready, 1, 0)) {
+        InitializeCriticalSection(&dev_init_lock);
+        InterlockedExchange(&dev_init_lock_ready, 2);
+    }
+    while (dev_init_lock_ready != 2) { /* spin until init done */ }
+    EnterCriticalSection(&dev_init_lock);
+}
+static inline void dev_init_lock_release(void) { LeaveCriticalSection(&dev_init_lock); }
+#else
+#include <pthread.h>
+static pthread_mutex_t dev_init_lock = PTHREAD_MUTEX_INITIALIZER;
+static inline void dev_init_lock_acquire(void) { pthread_mutex_lock(&dev_init_lock); }
+static inline void dev_init_lock_release(void) { pthread_mutex_unlock(&dev_init_lock); }
+#endif
+
+void ensure_device_init(int device) {
+    if (device < 0 || device >= AIMDO_MAX_DEVICES)
+        return;
+
+    AimdoDeviceState *s = &g_dev[device];
+    if (s->inited)
+        return;
+
+    dev_init_lock_acquire();
+
+    /* Double-check after acquiring lock */
+    if (s->inited) {
+        dev_init_lock_release();
+        return;
+    }
+
+    CUdevice dev;
+    if (!CHECK_CU(cuDeviceGet(&dev, device))) {
+        dev_init_lock_release();
+        return;
+    }
+
+    uint64_t cap = 0;
+    if (!CHECK_CU(cuDeviceTotalMem(&cap, dev))) {
+        dev_init_lock_release();
+        return;
+    }
+
+    CUcontext ctx = NULL;
+    if (!CHECK_CU(cuDevicePrimaryCtxRetain(&ctx, dev))) {
+        dev_init_lock_release();
+        return;
+    }
+
+    s->vram_capacity = cap;
+    s->ctx = ctx;
+    s->prevailing_method = "none";
+
+    /* Write inited last with a store barrier so other threads see
+     * fully initialized fields before they see inited == true.
+     */
+#if defined(_WIN32) || defined(_WIN64)
+    MemoryBarrier();
+    s->inited = true;
+#else
+    __atomic_store_n(&s->inited, true, __ATOMIC_RELEASE);
+#endif
+
+    dev_init_lock_release();
+
+    char dev_name[256];
+    if (!CHECK_CU(cuDeviceGetName(dev_name, sizeof(dev_name), dev)))
+        sprintf(dev_name, "<unknown>");
+
+    log(INFO, "comfy-aimdo device %d init: %s (VRAM: %zu MB)\n",
+        device, dev_name, (size_t)(cap / (1024 * 1024)));
+}
+
 SHARED_EXPORT
 void aimdo_analyze() {
     size_t free_bytes = 0, total_bytes = 0;
@@ -37,6 +152,13 @@ void aimdo_analyze() {
     log(DEBUG, "  Aimdo Recorded Usage:  %7zu MB\n", total_vram_usage / M);
     log(DEBUG, "  Cuda:  %7zu MB / %7zu MB Free\n", free_bytes / M, total_bytes / M);
 
+    for (int i = 0; i < AIMDO_MAX_DEVICES; i++) {
+        if (dev_vram_usage[i])
+            log(DEBUG, "  Device %d Usage:  %7zu MB (cap %zu MB)\n",
+                i, (size_t)(dev_vram_usage[i] / M),
+                g_dev[i].inited ? (size_t)(g_dev[i].vram_capacity / M) : 0);
+    }
+
     vbars_analyze(true);
     allocations_analyze();
 }
@@ -65,6 +187,9 @@ bool init(int cuda_device_id) {
         sprintf(dev_name, "<unknown>");
     }
 
+    /* Also populate g_dev for the primary device */
+    ensure_device_init(cuda_device_id);
+
     log(INFO, "comfy-aimdo inited for GPU: %s (VRAM: %zu MB)\n",
         dev_name, (size_t)(vram_capacity / (1024 * 1024)));
     return true;