diff --git a/MULTIGPU_PLAN.md b/MULTIGPU_PLAN.md
new file mode 100644
index 0000000..7ba763c
--- /dev/null
+++ b/MULTIGPU_PLAN.md
@@ -0,0 +1,139 @@
+# comfy-aimdo Multi-GPU Support Plan
+
+## Problem
+aimdo's VRAM management assumes a single GPU. With 2+ GPUs, `budget_deficit()` uses global `total_vram_usage` (sum of ALL GPUs) against one GPU's `vram_capacity`, creating a phantom deficit that triggers constant unnecessary eviction. Result: 2-GPU is **slower** than 1-GPU (79s vs 56s), while no-aimdo 2-GPU runs at 37s.
+
+## Benchmark Baseline (2×RTX 4090, Qwen-Image 38GB, CFG=7, 20 steps)
+| Config | Time | vs 1-GPU | Status |
+|--------|------|----------|--------|
+| 1-GPU + aimdo | 56.39s | 1.00× | ✅ stable |
+| 2-GPU + aimdo (original) | — | — | 💥 segfault |
+| 2-GPU + aimdo (mutex only) | 95.39s | 0.59× | ✅ stable |
+| 2-GPU + aimdo (dev-aware eviction) | 78.85s | 0.72× | ✅ stable |
+| 2-GPU no aimdo | 37.32s | 1.51× | ❌ CUDA errors after ~2 runs |
+| **2-GPU + aimdo (Phases 1–5)** | **41.29s** | **1.37×** | ✅ stable (7/7 runs) |
+
+## Branch
+`multigpu-thread-safety` — contains mutex + device-aware eviction (current stable baseline).
+
+---
+
+## Phase 1: Per-Device State Object
+**Status:** ✅ Done
+
+Create `AimdoDeviceState` struct with per-device fields currently stored as globals:
+```c
+typedef struct {
+    bool inited;
+    uint64_t vram_capacity;
+    CUcontext ctx;
+    uint64_t usage_last_check;
+    ssize_t deficit_sync;
+    uint64_t last_check_tick;
+    const char *prevailing_method;
+} AimdoDeviceState;
+
+extern AimdoDeviceState g_dev[AIMDO_MAX_DEVICES];
+```
+
+Changes:
+- **`control.c`**: Define `g_dev[]`. Split `init()` into global-once + per-device init. Add `ensure_device_init(device)` for lazy init.
+- **`plat.h`**: Declare the struct and extern. Keep `total_vram_usage` for diagnostics only.
+- **`model-vbar.c`**: Call `ensure_device_init(mv->device)` in `vbar_allocate()`.
+- **`vbar_allocate()`**: Use `g_dev[device].vram_capacity` instead of global `vram_capacity` for VBAR sizing.
+
+Notes:
+- ComfyUI currently calls `init_device(device_0_index)` once. We don't need to change the Python API — lazy init handles other devices.
+- Windows: store WDDM adapter/node per device (future — only if testing on multi-GPU Windows).
+
+---
+
+## Phase 2: Context Safety
+**Status:** ✅ Done
+
+Add save/restore context helpers:
+```c
+bool with_device_ctx(int device, CUcontext *prev);
+void restore_ctx(CUcontext prev);
+```
+
+Wrap all context-sensitive CUDA calls:
+- `cuMemGetInfo` in `poll_budget_deficit` / `cuda_budget_deficit`
+- `cuCtxSynchronize` in `vbars_free_locked_dev`, `vbars_free_for_vbar`, `vbar_fault`, `vbar_unpin`, `vbar_free`, `vbar_free_memory`
+- Fix Linux `ensure_ctx()` to not override pytorch's context when it's already set for device 1
+
+Key principle: **never leave a different context active than what the caller had**.
+
+---
+
+## Phase 3: Per-Device Hybrid Budget
+**Status:** ✅ Done
+
+Replace global `budget_deficit()` in `vbar_fault` with per-device version using `AimdoDeviceState`:
+```c
+size_t budget_deficit_dev(size_t size, int device) {
+    AimdoDeviceState *s = &g_dev[device];
+    uint64_t usage = dev_vram_usage[device];
+    poll_budget_deficit_dev(device);  // cuMemGetInfo with correct ctx
+    ssize_t simple = (ssize_t)(usage + HEADROOM + size) - (ssize_t)s->vram_capacity;
+    ssize_t delta = s->deficit_sync + (ssize_t)usage - (ssize_t)s->usage_last_check + size;
+    return (size_t)MAX(MAX(simple, delta), 0);
+}
+```
+
+**Critical**: Must keep the `cuMemGetInfo` backstop (hybrid approach). Pure accounting OOM'd in testing because `cudaFreeAsync` decrements counters before memory is actually reusable.
+
+Also make `poll_budget_deficit_dev(device)` — calls `cuMemGetInfo` with the correct per-device context (depends on Phase 2).
+
+---
+
+## Phase 4: Allocator Hooks Device-Aware
+**Status:** ✅ Done
+
+In `aimdo_cuda_malloc` / `aimdo_cuda_malloc_async`:
+- Determine device via `current_cuda_device()`
+- Call `ensure_device_init(dev)`
+- Use `vbars_free_dev(budget_deficit_dev(size, dev), dev)` instead of global
+- OOM retry path: also evict from same device only
+
+Expose `vbars_free_dev()` as the device-filtered wrapper (we already have `vbars_free_locked_dev` internally).
+
+---
+
+## Phase 5: Counter Thread Safety
+**Status:** ✅ Done
+
+`dev_vram_add/sub` run under different locks or no lock. Fix with atomics:
+```c
+static inline void dev_vram_add(int device, size_t size) {
+    __atomic_add_fetch(&total_vram_usage, size, __ATOMIC_RELAXED);
+    if (device >= 0 && device < AIMDO_MAX_DEVICES)
+        __atomic_add_fetch(&dev_vram_usage[device], size, __ATOMIC_RELAXED);
+}
+```
+Windows equivalent: `InterlockedExchangeAdd64`.
+
+---
+
+## Phase 6 (Future): Per-Device Locking
+**Status:** ⬜ Not started — only pursue if performance still >1.3× worse than no-aimdo after Phases 1–5
+
+The global `vbar_lock` serializes both GPUs. `cuCtxSynchronize()` is called while holding it, blocking the other GPU's `vbar_fault`.
+
+Options:
+- Per-device lock for device-local operations
+- Global lock only for list mutations (insert/remove)
+- Avoid `cuCtxSynchronize` on hot path if possible
+
+---
+
+## Implementation Order
+Phase 1 → 2 → 3 → 4 → 5 → **benchmark** → decide Phase 6
+
+## Expected Outcome
+Each GPU sees its own ~20GB usage vs 24GB capacity. Phantom deficit eliminated. Should approach the no-aimdo 37s target while remaining crash-free.
+
+## Key Learnings from Earlier Attempts
+1. **Can't remove `budget_deficit` pre-check entirely** — aimdo consuming too much VRAM causes `cudaErrorLaunchFailure` crash in pytorch's async allocator.
+2. **Pure per-device accounting (without cuMemGetInfo backstop) causes OOM** — `cudaFreeAsync` decrements counters before memory is actually freed, so accounting under-reports real usage.
+3. **No-aimdo 2-GPU is unstable** — `cudaErrorInvalidValue` after ~2 runs, even with async offload disabled. Aimdo provides stability that pytorch's default offloading doesn't.
diff --git a/build-linux-docker b/build-linux-docker
index 0ffd8e4..f02eae8 100644
--- a/build-linux-docker
+++ b/build-linux-docker
@@ -5,6 +5,7 @@ SRCS="src/*.c src-posix/*.c"
 docker build -t manylinux-cuda -f docker/cuda-on-manylinux.Dockerfile .
 docker run --rm -v $(pwd):/project -w /project manylinux-cuda \
     gcc -shared -o comfy_aimdo/aimdo.so -fPIC -Werror \
+    -Isrc \
     -I/usr/local/cuda/include \
     -L/usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/  \
     ${SRCS} -lcuda
diff --git a/src/control.c b/src/control.c
index 1ed6c54..c3075f1 100644
--- a/src/control.c
+++ b/src/control.c
@@ -4,10 +4,14 @@
 uint64_t vram_capacity;
 uint64_t total_vram_usage;
 uint64_t total_vram_last_check;
+uint64_t dev_vram_usage[AIMDO_MAX_DEVICES];
 ssize_t deficit_sync;
 const char *prevailing_deficit_method;
 CUcontext aimdo_cuda_ctx;
 
+/* Phase 1: per-device state */
+AimdoDeviceState g_dev[AIMDO_MAX_DEVICES];
+
 bool cuda_budget_deficit() {
     uint64_t now = GET_TICK();
     static uint64_t last_check = 0;
@@ -27,6 +31,117 @@ bool cuda_budget_deficit() {
     return true;
 }
 
+/* Phase 3: per-device cuMemGetInfo with context switching (Phase 2) */
+bool poll_budget_deficit_dev(int device) {
+    if (device < 0 || device >= AIMDO_MAX_DEVICES || !g_dev[device].inited)
+        return false;
+
+    AimdoDeviceState *s = &g_dev[device];
+    uint64_t now = GET_TICK();
+
+    if (now - s->last_check_tick < 2000) {
+        return true;
+    }
+
+    CUcontext prev;
+    if (!with_device_ctx(device, &prev))
+        return false;
+
+    size_t free_vram = 0, total_vram = 0;
+    bool ok = CHECK_CU(cuMemGetInfo(&free_vram, &total_vram));
+
+    restore_ctx(prev);
+
+    if (!ok)
+        return false;
+
+    s->last_check_tick = now;
+    s->usage_last_check = dev_vram_load(device);
+    s->deficit_sync = (ssize_t)VRAM_HEADROOM - (ssize_t)free_vram;
+    s->prevailing_method = "cuMemGetInfo (per-dev)";
+    return true;
+}
+
+/* Phase 1: lazy per-device init — thread-safe via init_lock */
+#if defined(_WIN32) || defined(_WIN64)
+#include <windows.h>
+static CRITICAL_SECTION dev_init_lock;
+static volatile LONG dev_init_lock_ready;
+
+static inline void dev_init_lock_acquire(void) {
+    if (!InterlockedCompareExchange(&dev_init_lock_ready, 1, 0)) {
+        InitializeCriticalSection(&dev_init_lock);
+        InterlockedExchange(&dev_init_lock_ready, 2);
+    }
+    while (dev_init_lock_ready != 2) { /* spin until init done */ }
+    EnterCriticalSection(&dev_init_lock);
+}
+static inline void dev_init_lock_release(void) { LeaveCriticalSection(&dev_init_lock); }
+#else
+#include <pthread.h>
+static pthread_mutex_t dev_init_lock = PTHREAD_MUTEX_INITIALIZER;
+static inline void dev_init_lock_acquire(void) { pthread_mutex_lock(&dev_init_lock); }
+static inline void dev_init_lock_release(void) { pthread_mutex_unlock(&dev_init_lock); }
+#endif
+
+void ensure_device_init(int device) {
+    if (device < 0 || device >= AIMDO_MAX_DEVICES)
+        return;
+
+    AimdoDeviceState *s = &g_dev[device];
+    if (s->inited)
+        return;
+
+    dev_init_lock_acquire();
+
+    /* Double-check after acquiring lock */
+    if (s->inited) {
+        dev_init_lock_release();
+        return;
+    }
+
+    CUdevice dev;
+    if (!CHECK_CU(cuDeviceGet(&dev, device))) {
+        dev_init_lock_release();
+        return;
+    }
+
+    uint64_t cap = 0;
+    if (!CHECK_CU(cuDeviceTotalMem(&cap, dev))) {
+        dev_init_lock_release();
+        return;
+    }
+
+    CUcontext ctx = NULL;
+    if (!CHECK_CU(cuDevicePrimaryCtxRetain(&ctx, dev))) {
+        dev_init_lock_release();
+        return;
+    }
+
+    s->vram_capacity = cap;
+    s->ctx = ctx;
+    s->prevailing_method = "none";
+
+    /* Write inited last with a store barrier so other threads see
+     * fully initialized fields before they see inited == true.
+     */
+#if defined(_WIN32) || defined(_WIN64)
+    MemoryBarrier();
+    s->inited = true;
+#else
+    __atomic_store_n(&s->inited, true, __ATOMIC_RELEASE);
+#endif
+
+    dev_init_lock_release();
+
+    char dev_name[256];
+    if (!CHECK_CU(cuDeviceGetName(dev_name, sizeof(dev_name), dev)))
+        sprintf(dev_name, "<unknown>");
+
+    log(INFO, "comfy-aimdo device %d init: %s (VRAM: %zu MB)\n",
+        device, dev_name, (size_t)(cap / (1024 * 1024)));
+}
+
 SHARED_EXPORT
 void aimdo_analyze() {
     size_t free_bytes = 0, total_bytes = 0;
@@ -37,6 +152,13 @@ void aimdo_analyze() {
     log(DEBUG, "  Aimdo Recorded Usage:  %7zu MB\n", total_vram_usage / M);
     log(DEBUG, "  Cuda:  %7zu MB / %7zu MB Free\n", free_bytes / M, total_bytes / M);
 
+    for (int i = 0; i < AIMDO_MAX_DEVICES; i++) {
+        if (dev_vram_usage[i])
+            log(DEBUG, "  Device %d Usage:  %7zu MB (cap %zu MB)\n",
+                i, (size_t)(dev_vram_usage[i] / M),
+                g_dev[i].inited ? (size_t)(g_dev[i].vram_capacity / M) : 0);
+    }
+
     vbars_analyze(true);
     allocations_analyze();
 }
@@ -65,6 +187,9 @@ bool init(int cuda_device_id) {
         sprintf(dev_name, "<unknown>");
     }
 
+    /* Also populate g_dev for the primary device */
+    ensure_device_init(cuda_device_id);
+
     log(INFO, "comfy-aimdo inited for GPU: %s (VRAM: %zu MB)\n",
         dev_name, (size_t)(vram_capacity / (1024 * 1024)));
     return true;
diff --git a/src/model-vbar.c b/src/model-vbar.c
index 26db1c4..d24cd29 100644
--- a/src/model-vbar.c
+++ b/src/model-vbar.c
@@ -5,6 +5,28 @@
 #define VBAR_GET_PAGE_NR(x) ((x) / VBAR_PAGE_SIZE)
 #define VBAR_GET_PAGE_NR_UP(x) VBAR_GET_PAGE_NR((x) + VBAR_PAGE_SIZE - 1)
 
+/* ---- global lock for the priority linked list and vbars_dirty ---- */
+#if defined(_WIN32) || defined(_WIN64)
+#include <windows.h>
+static CRITICAL_SECTION vbar_lock;
+static volatile LONG vbar_lock_init;
+
+static inline void vbar_list_lock(void) {
+    if (!InterlockedCompareExchange(&vbar_lock_init, 1, 0)) {
+        InitializeCriticalSection(&vbar_lock);
+        InterlockedExchange(&vbar_lock_init, 2);
+    }
+    while (vbar_lock_init != 2) { /* spin until init done */ }
+    EnterCriticalSection(&vbar_lock);
+}
+static inline void vbar_list_unlock(void) { LeaveCriticalSection(&vbar_lock); }
+#else
+#include <pthread.h>
+static pthread_mutex_t vbar_lock = PTHREAD_MUTEX_INITIALIZER;
+static inline void vbar_list_lock(void) { pthread_mutex_lock(&vbar_lock); }
+static inline void vbar_list_unlock(void) { pthread_mutex_unlock(&vbar_lock); }
+#endif
+
 typedef struct ResidentPage {
     CUmemGenericAllocationHandle handle;
     bool pinned;
@@ -44,8 +66,10 @@ SHARED_EXPORT
 uint64_t vbars_analyze(bool only_dirty) {
     size_t calculated_total_vram = 0;
 
+    vbar_list_lock();
     one_time_setup();
     if (only_dirty && !vbars_dirty) {
+        vbar_list_unlock();
         return 0;
     }
     vbars_dirty = false;
@@ -83,6 +107,7 @@ uint64_t vbars_analyze(bool only_dirty) {
     }
 
     log(DEBUG, "Total VRAM for VBARs: %zu MB\n", calculated_total_vram / M);
+    vbar_list_unlock();
     return (uint64_t)calculated_total_vram;
 }
 
@@ -94,7 +119,7 @@ static inline bool mod1(ModelVBAR *mv, size_t page_nr, bool do_free, bool do_unp
     if (do_free) {
         CHECK_CU(cuMemUnmap(vaddr, VBAR_PAGE_SIZE));
         CHECK_CU(cuMemRelease(rp->handle));
-        total_vram_usage -= VBAR_PAGE_SIZE;
+        dev_vram_sub(mv->device, VBAR_PAGE_SIZE);
         rp->handle = 0;
         mv->resident_count--;
     }
@@ -104,9 +129,12 @@ static inline bool mod1(ModelVBAR *mv, size_t page_nr, bool do_free, bool do_unp
     return do_free;
 }
 
-size_t vbars_free(size_t size) {
+/* Must be called with vbar_list_lock held.
+ * device_filter: -1 = evict from any device, >= 0 = only evict from that device.
+ */
+static size_t vbars_free_locked_dev(size_t size, int device_filter) {
     size_t pages_needed = VBAR_GET_PAGE_NR_UP(size);
-    bool dirty = false;
+    bool synced[AIMDO_MAX_DEVICES] = {false};
 
     one_time_setup();
     vbars_dirty = true;
@@ -117,10 +145,16 @@ size_t vbars_free(size_t size) {
 
     for (ModelVBAR *i = lowest_priority.higher; pages_needed && i != &highest_priority;
          i = i->higher) {
+        if (device_filter >= 0 && i->device != device_filter)
+            continue;
         for (;pages_needed && i->watermark > i->watermark_limit; i->watermark--) {
-            if (!dirty) {
+            int dev = i->device;
+            if (dev >= 0 && dev < AIMDO_MAX_DEVICES && !synced[dev]) {
+                CUcontext prev;
+                with_device_ctx(dev, &prev);
                 CHECK_CU(cuCtxSynchronize());
-                dirty = true;
+                restore_ctx(prev);
+                synced[dev] = true;
             }
             if (mod1(i, i->watermark - 1, true, false)) {
                 pages_needed--;
@@ -131,6 +165,24 @@ size_t vbars_free(size_t size) {
     return pages_needed;
 }
 
+size_t vbars_free(size_t size) {
+    size_t ret;
+
+    vbar_list_lock();
+    ret = vbars_free_locked_dev(size, -1);
+    vbar_list_unlock();
+    return ret;
+}
+
+size_t vbars_free_dev(size_t size, int device) {
+    size_t ret;
+
+    vbar_list_lock();
+    ret = vbars_free_locked_dev(size, device);
+    vbar_list_unlock();
+    return ret;
+}
+
 static inline size_t move_cursor_to_absent(ModelVBAR *mv, size_t cursor) {
     while (cursor < mv->watermark && mv->residency_map[cursor].handle) {
         cursor++;
@@ -140,12 +192,18 @@ static inline size_t move_cursor_to_absent(ModelVBAR *mv, size_t cursor) {
 
 static void vbars_free_for_vbar(ModelVBAR *mv, size_t target) {
     size_t cursor = move_cursor_to_absent(mv, 0);
+    int device_filter = mv->device;
 
+    CUcontext prev;
+    with_device_ctx(device_filter, &prev);
     CHECK_CU(cuCtxSynchronize());
+    restore_ctx(prev);
 
     for (ModelVBAR *i = lowest_priority.higher;
          cursor < target && cursor < mv->watermark && i != &highest_priority;
          i = i->higher) {
+        if (i->device != device_filter)
+            continue;
         for (; cursor < target && cursor < mv->watermark && i->watermark > i->watermark_limit;
              i->watermark--) {
             if (mod1(i, i->watermark - 1, true, false)) {
@@ -178,13 +236,18 @@ SHARED_EXPORT
 void *vbar_allocate(uint64_t size, int device) {
     ModelVBAR *mv;
 
-    one_time_setup();
     log_reset_shots();
     log(DEBUG, "%s (start): size=%zuM, device=%d\n", __func__, size / M, device);
-    vbars_dirty = true;
+
+    /* Phase 1: lazy init for this device */
+    ensure_device_init(device);
+
+    /* Use per-device capacity if available, else global */
+    uint64_t cap = (device >= 0 && device < AIMDO_MAX_DEVICES && g_dev[device].inited)
+                   ? g_dev[device].vram_capacity : vram_capacity;
 
     size_t nr_pages = VBAR_GET_PAGE_NR_UP(size);
-    size_t nr_pages_max = VBAR_GET_PAGE_NR(vram_capacity);
+    size_t nr_pages_max = VBAR_GET_PAGE_NR(cap);
     if (nr_pages_max < nr_pages) {
         nr_pages = nr_pages_max;
     }
@@ -204,8 +267,12 @@ void *vbar_allocate(uint64_t size, int device) {
 
     mv->device = device;
     mv->nr_pages = mv->watermark = nr_pages;
-    
+
+    vbar_list_lock();
+    one_time_setup();
+    vbars_dirty = true;
     insert_vbar(mv);
+    vbar_list_unlock();
 
     log(DEBUG, "%s (return): vbar=%p\n", __func__, (void *)mv);
     return mv;
@@ -216,17 +283,21 @@ void vbar_set_watermark_limit(void *vbar, uint64_t size) {
     ModelVBAR *mv = (ModelVBAR *)vbar;
 
     log(DEBUG, "%s: size=%zu\n", __func__, size);
+    vbar_list_lock();
     mv->watermark_limit = VBAR_GET_PAGE_NR_UP(size);
+    vbar_list_unlock();
 }
 
 SHARED_EXPORT
 void vbars_reset_watermark_limits() {
+    vbar_list_lock();
     one_time_setup();
     log(VERBOSE, "%s\n", __func__);
 
     for (ModelVBAR *i = lowest_priority.higher; i && i != &highest_priority; i = i->higher) {
         i->watermark_limit = 0;
     }
+    vbar_list_unlock();
 }
 
 SHARED_EXPORT
@@ -234,14 +305,15 @@ void vbar_prioritize(void *vbar) {
     ModelVBAR *mv = (ModelVBAR *)vbar;
 
     log(DEBUG, "%s vbar=%p\n", __func__, vbar);
-    vbars_dirty = true;
 
     log_reset_shots();
 
+    vbar_list_lock();
+    vbars_dirty = true;
     remove_vbar(mv);
     insert_vbar(mv);
-
     mv->watermark = mv->nr_pages;
+    vbar_list_unlock();
 }
 
 SHARED_EXPORT
@@ -249,12 +321,14 @@ void vbar_deprioritize(void *vbar) {
     ModelVBAR *mv = (ModelVBAR *)vbar;
 
     log(DEBUG, "%s vbar=%p\n", __func__, vbar);
-    vbars_dirty = true;
 
     log_reset_shots();
 
+    vbar_list_lock();
+    vbars_dirty = true;
     remove_vbar(mv);
     insert_vbar_last(mv);
+    vbar_list_unlock();
 }
 
 SHARED_EXPORT
@@ -276,16 +350,18 @@ int vbar_fault(void *vbar, uint64_t offset, uint64_t size, uint32_t *signature)
     size_t page_end = VBAR_GET_PAGE_NR_UP(offset + size);
 
     log(VVERBOSE, "%s (start): offset=%lldk, size=%lldk\n", __func__, (ull)(offset / K), (ull)(size / K));
+
+    vbar_list_lock();
     vbars_dirty = true;
 
-    /* Stopgap. If the we get a bad shared memory spike, collect it here on the next layer
-     * as the allocator is unreliable as it may not actually be called reliably when you
-     * really need to know you have spilled.
+    /* Stopgap: use per-device budget to avoid phantom cross-GPU deficit.
+     * Only evict pages from the same device to avoid cross-GPU thrashing.
      */
-    vbars_free(budget_deficit(0));
+    vbars_free_locked_dev(budget_deficit_dev(0, mv->device), mv->device);
 
     if (page_end > mv->watermark) {
         log(VVERBOSE, "VBAR Allocation is above watermark\n");
+        vbar_list_unlock();
         return VBAR_FAULT_OOM;
     }
 
@@ -301,20 +377,23 @@ int vbar_fault(void *vbar, uint64_t offset, uint64_t size, uint32_t *signature)
 
         log(VERBOSE, "VBAR needs to allocate VRAM for page %d\n", (int)page_nr);
 
-        if (budget_deficit(VBAR_PAGE_SIZE) ||
+        if (budget_deficit_dev(VBAR_PAGE_SIZE, mv->device) ||
             (err = three_stooges(vaddr, VBAR_PAGE_SIZE, mv->device, &rp->handle)) != CUDA_SUCCESS) {
             if (err != CUDA_ERROR_OUT_OF_MEMORY) {
                 log(ERROR, "VRAM Allocation failed (non OOM)\n");
+                vbar_list_unlock();
                 return VBAR_FAULT_ERROR;
             }
             log(DEBUG, "VBAR allocator attempt exceeds available VRAM ...\n");
             vbars_free_for_vbar(mv, page_end);
             if (page_nr >= mv->watermark) {
                 log(DEBUG, "VBAR allocation cancelled due to watermark reduction\n");
+                vbar_list_unlock();
                 return VBAR_FAULT_OOM;
             }
             if ((err = three_stooges(vaddr, VBAR_PAGE_SIZE, mv->device, &rp->handle)) != CUDA_SUCCESS) {
                 log(ERROR, "VRAM Allocation failed\n");
+                vbar_list_unlock();
                 return VBAR_FAULT_ERROR;
             }
         }
@@ -330,6 +409,7 @@ int vbar_fault(void *vbar, uint64_t offset, uint64_t size, uint32_t *signature)
         rp->pinned = true;
     }
 
+    vbar_list_unlock();
     log(VVERBOSE, "%s (return) %d\n", __func__, ret);
     return ret;
 }
@@ -339,16 +419,22 @@ void vbar_unpin(void *vbar, uint64_t offset, uint64_t size) {
     ModelVBAR *mv = (ModelVBAR *)vbar;
 
     log(VVERBOSE, "%s (start): offset=%lldk, size=%lldk\n", __func__, (ull)(offset / K), (ull)(size / K));
+
+    vbar_list_lock();
     vbars_dirty = true;
     size_t page_end = VBAR_GET_PAGE_NR_UP(offset + size);
 
     if (page_end > mv->watermark) {
+        CUcontext prev;
+        with_device_ctx(mv->device, &prev);
         CHECK_CU(cuCtxSynchronize());
+        restore_ctx(prev);
     }
 
     for (uint64_t page_nr = VBAR_GET_PAGE_NR(offset); page_nr < page_end && page_nr < mv->nr_pages; page_nr++) {
         mod1(mv, page_nr, page_nr >= mv->watermark, true);
     }
+    vbar_list_unlock();
 }
 
 SHARED_EXPORT
@@ -356,14 +442,23 @@ void vbar_free(void *vbar) {
     ModelVBAR *mv = (ModelVBAR *)vbar;
 
     log(DEBUG, "%s: vbar=%p\n", __func__, vbar);
+
+    vbar_list_lock();
     vbars_dirty = true;
 
-    CHECK_CU(cuCtxSynchronize());
+    {
+        CUcontext prev;
+        with_device_ctx(mv->device, &prev);
+        CHECK_CU(cuCtxSynchronize());
+        restore_ctx(prev);
+    }
 
     for (uint64_t page_nr = 0; page_nr < mv->nr_pages; page_nr++) {
         mod1(mv, page_nr, true, true);
     }
     remove_vbar(mv);
+    vbar_list_unlock();
+
     CHECK_CU(cuMemAddressFree(mv->vbar, (size_t)mv->nr_pages * VBAR_PAGE_SIZE));
     free(mv);
 }
@@ -405,9 +500,16 @@ uint64_t vbar_free_memory(void *vbar, uint64_t size) {
     size_t pages_freed = 0;
 
     log(DEBUG, "%s (start): size=%lldk\n", __func__, (ull)size);
+
+    vbar_list_lock();
     vbars_dirty = true;
 
-    CHECK_CU(cuCtxSynchronize());
+    {
+        CUcontext prev;
+        with_device_ctx(mv->device, &prev);
+        CHECK_CU(cuCtxSynchronize());
+        restore_ctx(prev);
+    }
 
     for (;pages_to_free && mv->watermark > mv->watermark_limit; mv->watermark--) {
         /* In theory we should never have pins here, but
@@ -419,5 +521,6 @@ uint64_t vbar_free_memory(void *vbar, uint64_t size) {
         }
     }
 
+    vbar_list_unlock();
     return (uint64_t)pages_freed * VBAR_PAGE_SIZE;
 }
diff --git a/src/plat.h b/src/plat.h
index 2339ebc..0b1df82 100644
--- a/src/plat.h
+++ b/src/plat.h
@@ -25,6 +25,23 @@ bool cuda_budget_deficit();
 #define SHARED_EXPORT __declspec(dllexport)
 
 #include <BaseTsd.h>
+#include <intrin.h>
+
+/* MSVC C mode: the non-prefixed InterlockedXxx64 names are macros
+ * defined in <winnt.h> (via <windows.h>), but we can't include
+ * <windows.h> here because it #defines ERROR/DEBUG which clash
+ * with our DebugLevels enum.  Use the underscore-prefixed intrinsics
+ * directly and provide our own macro aliases.
+ */
+#ifndef InterlockedExchangeAdd64
+#define InterlockedExchangeAdd64 _InterlockedExchangeAdd64
+#endif
+#ifndef InterlockedOr64
+#define InterlockedOr64 _InterlockedOr64
+#endif
+#ifndef InterlockedCompareExchange
+#define InterlockedCompareExchange _InterlockedCompareExchange
+#endif
 
 typedef SSIZE_T ssize_t;
 
@@ -109,6 +126,21 @@ void log_reset_shots();
 /* The default VRAM headroom. Different deficit methods with BYO headroom */
 #define VRAM_HEADROOM (256 * 1024 * 1024)
 
+#define AIMDO_MAX_DEVICES 16
+
+/* ---- Per-device state (Phase 1) ---- */
+typedef struct {
+    bool inited;
+    uint64_t vram_capacity;
+    CUcontext ctx;
+    uint64_t usage_last_check;
+    ssize_t deficit_sync;
+    uint64_t last_check_tick;
+    const char *prevailing_method;
+} AimdoDeviceState;
+
+extern AimdoDeviceState g_dev[AIMDO_MAX_DEVICES];
+
 /* control.c */
 extern uint64_t vram_capacity;
 extern uint64_t total_vram_usage;
@@ -116,6 +148,73 @@ extern uint64_t total_vram_last_check;
 extern ssize_t deficit_sync;
 extern const char *prevailing_deficit_method;
 
+void ensure_device_init(int device);
+
+/* Per-device VRAM accounting (control.c) — Phase 5: atomic counters */
+extern uint64_t dev_vram_usage[AIMDO_MAX_DEVICES];
+
+#if defined(_WIN32) || defined(_WIN64)
+
+static inline uint64_t dev_vram_load(int device) {
+    return (uint64_t)InterlockedOr64((volatile LONG64 *)&dev_vram_usage[device], 0);
+}
+
+static inline void dev_vram_add(int device, size_t size) {
+    InterlockedExchangeAdd64((volatile LONG64 *)&total_vram_usage, (LONG64)size);
+    if (device >= 0 && device < AIMDO_MAX_DEVICES)
+        InterlockedExchangeAdd64((volatile LONG64 *)&dev_vram_usage[device], (LONG64)size);
+}
+
+static inline void dev_vram_sub(int device, size_t size) {
+    InterlockedExchangeAdd64((volatile LONG64 *)&total_vram_usage, -(LONG64)size);
+    if (device >= 0 && device < AIMDO_MAX_DEVICES)
+        InterlockedExchangeAdd64((volatile LONG64 *)&dev_vram_usage[device], -(LONG64)size);
+}
+
+#else
+
+static inline uint64_t dev_vram_load(int device) {
+    return __atomic_load_n(&dev_vram_usage[device], __ATOMIC_RELAXED);
+}
+
+static inline void dev_vram_add(int device, size_t size) {
+    __atomic_add_fetch(&total_vram_usage, size, __ATOMIC_RELAXED);
+    if (device >= 0 && device < AIMDO_MAX_DEVICES)
+        __atomic_add_fetch(&dev_vram_usage[device], size, __ATOMIC_RELAXED);
+}
+
+static inline void dev_vram_sub(int device, size_t size) {
+    __atomic_sub_fetch(&total_vram_usage, size, __ATOMIC_RELAXED);
+    if (device >= 0 && device < AIMDO_MAX_DEVICES)
+        __atomic_sub_fetch(&dev_vram_usage[device], size, __ATOMIC_RELAXED);
+}
+
+#endif
+
+/* ---- Context save/restore helpers (Phase 2) ---- */
+static inline bool with_device_ctx(int device, CUcontext *prev) {
+    CUcontext cur = NULL;
+    cuCtxGetCurrent(&cur);
+    *prev = cur;
+    if (device >= 0 && device < AIMDO_MAX_DEVICES && g_dev[device].inited) {
+        CUcontext target = g_dev[device].ctx;
+        if (cur != target) {
+            cuCtxSetCurrent(target);
+        }
+        return true;
+    }
+    return (cur != NULL);
+}
+
+static inline void restore_ctx(CUcontext prev) {
+    cuCtxSetCurrent(prev);
+}
+
+/* ---- Per-device hybrid budget (Phase 3) ---- */
+
+/* Poll cuMemGetInfo for a specific device using its context */
+bool poll_budget_deficit_dev(int device);
+
 static inline size_t budget_deficit(size_t size) {
     ssize_t deficit_simple, deficit_delta;
     size_t deficit;
@@ -132,6 +231,27 @@ static inline size_t budget_deficit(size_t size) {
     return deficit;
 }
 
+/* Per-device budget deficit with cuMemGetInfo hybrid backstop (Phase 3). */
+static inline size_t budget_deficit_dev(size_t size, int device) {
+    if (device < 0 || device >= AIMDO_MAX_DEVICES || !g_dev[device].inited)
+        return budget_deficit(size);
+
+    AimdoDeviceState *s = &g_dev[device];
+    uint64_t usage = dev_vram_load(device);
+
+    poll_budget_deficit_dev(device);
+
+    ssize_t deficit_simple = (ssize_t)(usage + VRAM_HEADROOM + size) - (ssize_t)s->vram_capacity;
+    ssize_t deficit_delta = s->deficit_sync + (ssize_t)usage - (ssize_t)s->usage_last_check + size;
+    size_t deficit = (size_t)MAX(MAX(deficit_simple, deficit_delta), (ssize_t)0);
+    if (deficit) {
+        log(DEBUG, "%s: dev=%d %s deficit=%zuM usage=%zuM cap=%zuM size=%zuM\n", __func__,
+            device, deficit_simple > deficit_delta ? "simple" : s->prevailing_method,
+            deficit / M, usage / M, s->vram_capacity / M, size / M);
+    }
+    return deficit;
+}
+
 static inline int check_cu_impl(CUresult res, const char *label) {
     if (res != CUDA_SUCCESS && res != CUDA_ERROR_OUT_OF_MEMORY) {
         const char* desc;
@@ -171,7 +291,7 @@ static inline CUresult three_stooges(CUdeviceptr vaddr, size_t size, int device,
     if (!CHECK_CU(err = cuMemSetAccess(vaddr, size, &accessDesc, 1))) {
         goto fail_access;
     }
-    total_vram_usage += size;
+    dev_vram_add(device, size);
 
     *handle = h;
     return CUDA_SUCCESS;
@@ -186,6 +306,7 @@ static inline CUresult three_stooges(CUdeviceptr vaddr, size_t size, int device,
 
 /* model_vbar.c */
 size_t vbars_free(size_t size);
+size_t vbars_free_dev(size_t size, int device);
 SHARED_EXPORT
 uint64_t vbars_analyze(bool only_dirty);
 
diff --git a/src/pyt-cu-plug-alloc-async.c b/src/pyt-cu-plug-alloc-async.c
index 817144e..f5cb899 100644
--- a/src/pyt-cu-plug-alloc-async.c
+++ b/src/pyt-cu-plug-alloc-async.c
@@ -9,9 +9,19 @@
 typedef struct SizeEntry {
     CUdeviceptr ptr;
     size_t size;
+    int device;
     struct SizeEntry *next;
 } SizeEntry;
 
+static inline int current_cuda_device(void) {
+    CUdevice dev = 0;
+    CUcontext ctx = NULL;
+    if (cuCtxGetCurrent(&ctx) == CUDA_SUCCESS && ctx) {
+        cuCtxGetDevice(&dev);
+    }
+    return (int)dev;
+}
+
 static SizeEntry *size_table[SIZE_HASH_SIZE];
 
 static inline unsigned int size_hash(CUdeviceptr ptr) {
@@ -42,14 +52,16 @@ static inline void st_unlock(void) { pthread_mutex_unlock(&size_table_lock); }
 static inline void account_alloc(CUdeviceptr ptr, size_t size) {
     unsigned int h = size_hash(ptr);
     SizeEntry *entry;
+    int dev = current_cuda_device();
 
     st_lock();
-    total_vram_usage += CUDA_ALIGN_UP(size);
+    dev_vram_add(dev, CUDA_ALIGN_UP(size));
 
     entry = (SizeEntry *)malloc(sizeof(*entry));
     if (entry) {
         entry->ptr = ptr;
         entry->size = size;
+        entry->device = dev;
         entry->next = size_table[h];
         size_table[h] = entry;
     }
@@ -70,7 +82,7 @@ static inline void account_free(CUdeviceptr ptr, CUstream hStream) {
             *prev = entry->next;
 
             log(VVERBOSE, "Freed: ptr=0x%llx, size=%zuk, stream=%p\n", ptr, entry->size / K, hStream);
-            total_vram_usage -= CUDA_ALIGN_UP(entry->size);
+            dev_vram_sub(entry->device, CUDA_ALIGN_UP(entry->size));
 
             st_unlock();
             free(entry);
@@ -88,12 +100,14 @@ int aimdo_cuda_malloc(CUdeviceptr *devPtr, size_t size,
                       CUresult (*true_cuMemAlloc_v2)(CUdeviceptr*, size_t)) {
     CUdeviceptr dptr;
     CUresult status = 0;
+    int dev = current_cuda_device();
 
     if (!devPtr || !true_cuMemAlloc_v2) {
         return 1;
     }
 
-    vbars_free(budget_deficit(size + CUDA_MALLOC_HEADROOM));
+    ensure_device_init(dev);
+    vbars_free_dev(budget_deficit_dev(size + CUDA_MALLOC_HEADROOM, dev), dev);
 
     if (CHECK_CU(true_cuMemAlloc_v2(&dptr, size))) {
         *devPtr = dptr;
@@ -101,7 +115,7 @@ int aimdo_cuda_malloc(CUdeviceptr *devPtr, size_t size,
         return 0;
     }
 
-    vbars_free(size + CUDA_MALLOC_HEADROOM);
+    vbars_free_dev(size + CUDA_MALLOC_HEADROOM, dev);
     status = true_cuMemAlloc_v2(&dptr, size);
     if (CHECK_CU(status)) {
         *devPtr = dptr;
@@ -137,20 +151,22 @@ int aimdo_cuda_malloc_async(CUdeviceptr *devPtr, size_t size, CUstream hStream,
                             CUresult (*true_cuMemAllocAsync)(CUdeviceptr*, size_t, CUstream)) {
     CUdeviceptr dptr;
     CUresult status = 0;
+    int dev = current_cuda_device();
 
-    log(VVERBOSE, "%s (start) size=%zuk stream=%p\n", __func__, size / K, hStream);
+    log(VVERBOSE, "%s (start) size=%zuk stream=%p dev=%d\n", __func__, size / K, hStream, dev);
 
     if (!devPtr) {
         return 1;
     }
 
-    vbars_free(budget_deficit(size));
+    ensure_device_init(dev);
+    vbars_free_dev(budget_deficit_dev(size, dev), dev);
 
     if (CHECK_CU(true_cuMemAllocAsync(&dptr, size, hStream))) {
         *devPtr = dptr;
         goto success;
     }
-    vbars_free(size);
+    vbars_free_dev(size, dev);
     status = true_cuMemAllocAsync(&dptr, size, hStream);
     if (CHECK_CU(status)) {
         *devPtr = dptr;
@@ -192,7 +208,14 @@ static inline void ensure_ctx(void) {
     CUcontext ctx = NULL;
 
     if (cuCtxGetCurrent(&ctx) != CUDA_SUCCESS || !ctx) {
-        cuCtxSetCurrent(aimdo_cuda_ctx);
+        /* No context set — try per-device init for device 0, then use its ctx */
+        int dev = 0;
+        ensure_device_init(dev);
+        if (g_dev[dev].inited) {
+            cuCtxSetCurrent(g_dev[dev].ctx);
+        } else {
+            cuCtxSetCurrent(aimdo_cuda_ctx);
+        }
     }
 }
 
diff --git a/src/vrambuf.c b/src/vrambuf.c
index 75deea0..41e33e9 100644
--- a/src/vrambuf.c
+++ b/src/vrambuf.c
@@ -46,7 +46,7 @@ bool vrambuf_grow(void *arg, size_t required_size) {
         grow_to = buf->max_size;
     }
 
-    vbars_free(budget_deficit(grow_to - buf->allocated));
+    vbars_free_dev(budget_deficit_dev(grow_to - buf->allocated, buf->device), buf->device);
     while (buf->allocated < grow_to) {
         size_t to_allocate = grow_to - buf->allocated;
         if (to_allocate > VRAM_CHUNK_SIZE) {
@@ -58,7 +58,7 @@ bool vrambuf_grow(void *arg, size_t required_size) {
                 return false;
             }
             log(DEBUG, "Pytorch allocator attempt exceeds available VRAM ...\n");
-            vbars_free(VRAM_CHUNK_SIZE);
+            vbars_free_dev(VRAM_CHUNK_SIZE, buf->device);
             if ((err = three_stooges(buf->base_ptr + buf->allocated, to_allocate, buf->device, &handle)) != CUDA_SUCCESS) {
                 bool is_oom = err == CUDA_ERROR_OUT_OF_MEMORY;
                 log(is_oom ? INFO : ERROR, "VRAM Allocation failed (%s)\n", is_oom ? "OOM" : "error");
@@ -101,6 +101,6 @@ void vrambuf_destroy(void *arg) {
     }
 
     CHECK_CU(cuMemAddressFree(buf->base_ptr, buf->max_size));
-    total_vram_usage -= buf->allocated;
+    dev_vram_sub(buf->device, buf->allocated);
     free(buf);
 }