diff --git a/MULTIGPU_PLAN.md b/MULTIGPU_PLAN.md new file mode 100644 index 0000000..7ba763c --- /dev/null +++ b/MULTIGPU_PLAN.md @@ -0,0 +1,139 @@ +# comfy-aimdo Multi-GPU Support Plan + +## Problem +aimdo's VRAM management assumes a single GPU. With 2+ GPUs, `budget_deficit()` uses global `total_vram_usage` (sum of ALL GPUs) against one GPU's `vram_capacity`, creating a phantom deficit that triggers constant unnecessary eviction. Result: 2-GPU is **slower** than 1-GPU (79s vs 56s), while no-aimdo 2-GPU runs at 37s. + +## Benchmark Baseline (2×RTX 4090, Qwen-Image 38GB, CFG=7, 20 steps) +| Config | Time | vs 1-GPU | Status | +|--------|------|----------|--------| +| 1-GPU + aimdo | 56.39s | 1.00× | ✅ stable | +| 2-GPU + aimdo (original) | — | — | 💥 segfault | +| 2-GPU + aimdo (mutex only) | 95.39s | 0.59× | ✅ stable | +| 2-GPU + aimdo (dev-aware eviction) | 78.85s | 0.72× | ✅ stable | +| 2-GPU no aimdo | 37.32s | 1.51× | ❌ CUDA errors after ~2 runs | +| **2-GPU + aimdo (Phases 1–5)** | **41.29s** | **1.37×** | ✅ stable (7/7 runs) | + +## Branch +`multigpu-thread-safety` — contains mutex + device-aware eviction (current stable baseline). + +--- + +## Phase 1: Per-Device State Object +**Status:** ✅ Done + +Create `AimdoDeviceState` struct with per-device fields currently stored as globals: +```c +typedef struct { + bool inited; + uint64_t vram_capacity; + CUcontext ctx; + uint64_t usage_last_check; + ssize_t deficit_sync; + uint64_t last_check_tick; + const char *prevailing_method; +} AimdoDeviceState; + +extern AimdoDeviceState g_dev[AIMDO_MAX_DEVICES]; +``` + +Changes: +- **`control.c`**: Define `g_dev[]`. Split `init()` into global-once + per-device init. Add `ensure_device_init(device)` for lazy init. +- **`plat.h`**: Declare the struct and extern. Keep `total_vram_usage` for diagnostics only. +- **`model-vbar.c`**: Call `ensure_device_init(mv->device)` in `vbar_allocate()`. +- **`vbar_allocate()`**: Use `g_dev[device].vram_capacity` instead of global `vram_capacity` for VBAR sizing. + +Notes: +- ComfyUI currently calls `init_device(device_0_index)` once. We don't need to change the Python API — lazy init handles other devices. +- Windows: store WDDM adapter/node per device (future — only if testing on multi-GPU Windows). + +--- + +## Phase 2: Context Safety +**Status:** ✅ Done + +Add save/restore context helpers: +```c +bool with_device_ctx(int device, CUcontext *prev); +void restore_ctx(CUcontext prev); +``` + +Wrap all context-sensitive CUDA calls: +- `cuMemGetInfo` in `poll_budget_deficit` / `cuda_budget_deficit` +- `cuCtxSynchronize` in `vbars_free_locked_dev`, `vbars_free_for_vbar`, `vbar_fault`, `vbar_unpin`, `vbar_free`, `vbar_free_memory` +- Fix Linux `ensure_ctx()` to not override pytorch's context when it's already set for device 1 + +Key principle: **never leave a different context active than what the caller had**. + +--- + +## Phase 3: Per-Device Hybrid Budget +**Status:** ✅ Done + +Replace global `budget_deficit()` in `vbar_fault` with per-device version using `AimdoDeviceState`: +```c +size_t budget_deficit_dev(size_t size, int device) { + AimdoDeviceState *s = &g_dev[device]; + uint64_t usage = dev_vram_usage[device]; + poll_budget_deficit_dev(device); // cuMemGetInfo with correct ctx + ssize_t simple = (ssize_t)(usage + HEADROOM + size) - (ssize_t)s->vram_capacity; + ssize_t delta = s->deficit_sync + (ssize_t)usage - (ssize_t)s->usage_last_check + size; + return (size_t)MAX(MAX(simple, delta), 0); +} +``` + +**Critical**: Must keep the `cuMemGetInfo` backstop (hybrid approach). Pure accounting OOM'd in testing because `cudaFreeAsync` decrements counters before memory is actually reusable. + +Also make `poll_budget_deficit_dev(device)` — calls `cuMemGetInfo` with the correct per-device context (depends on Phase 2). + +--- + +## Phase 4: Allocator Hooks Device-Aware +**Status:** ✅ Done + +In `aimdo_cuda_malloc` / `aimdo_cuda_malloc_async`: +- Determine device via `current_cuda_device()` +- Call `ensure_device_init(dev)` +- Use `vbars_free_dev(budget_deficit_dev(size, dev), dev)` instead of global +- OOM retry path: also evict from same device only + +Expose `vbars_free_dev()` as the device-filtered wrapper (we already have `vbars_free_locked_dev` internally). + +--- + +## Phase 5: Counter Thread Safety +**Status:** ✅ Done + +`dev_vram_add/sub` run under different locks or no lock. Fix with atomics: +```c +static inline void dev_vram_add(int device, size_t size) { + __atomic_add_fetch(&total_vram_usage, size, __ATOMIC_RELAXED); + if (device >= 0 && device < AIMDO_MAX_DEVICES) + __atomic_add_fetch(&dev_vram_usage[device], size, __ATOMIC_RELAXED); +} +``` +Windows equivalent: `InterlockedExchangeAdd64`. + +--- + +## Phase 6 (Future): Per-Device Locking +**Status:** ⬜ Not started — only pursue if performance still >1.3× worse than no-aimdo after Phases 1–5 + +The global `vbar_lock` serializes both GPUs. `cuCtxSynchronize()` is called while holding it, blocking the other GPU's `vbar_fault`. + +Options: +- Per-device lock for device-local operations +- Global lock only for list mutations (insert/remove) +- Avoid `cuCtxSynchronize` on hot path if possible + +--- + +## Implementation Order +Phase 1 → 2 → 3 → 4 → 5 → **benchmark** → decide Phase 6 + +## Expected Outcome +Each GPU sees its own ~20GB usage vs 24GB capacity. Phantom deficit eliminated. Should approach the no-aimdo 37s target while remaining crash-free. + +## Key Learnings from Earlier Attempts +1. **Can't remove `budget_deficit` pre-check entirely** — aimdo consuming too much VRAM causes `cudaErrorLaunchFailure` crash in pytorch's async allocator. +2. **Pure per-device accounting (without cuMemGetInfo backstop) causes OOM** — `cudaFreeAsync` decrements counters before memory is actually freed, so accounting under-reports real usage. +3. **No-aimdo 2-GPU is unstable** — `cudaErrorInvalidValue` after ~2 runs, even with async offload disabled. Aimdo provides stability that pytorch's default offloading doesn't. diff --git a/build-linux-docker b/build-linux-docker index 0ffd8e4..f02eae8 100644 --- a/build-linux-docker +++ b/build-linux-docker @@ -5,6 +5,7 @@ SRCS="src/*.c src-posix/*.c" docker build -t manylinux-cuda -f docker/cuda-on-manylinux.Dockerfile . docker run --rm -v $(pwd):/project -w /project manylinux-cuda \ gcc -shared -o comfy_aimdo/aimdo.so -fPIC -Werror \ + -Isrc \ -I/usr/local/cuda/include \ -L/usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/ \ ${SRCS} -lcuda diff --git a/src/control.c b/src/control.c index 1ed6c54..c3075f1 100644 --- a/src/control.c +++ b/src/control.c @@ -4,10 +4,14 @@ uint64_t vram_capacity; uint64_t total_vram_usage; uint64_t total_vram_last_check; +uint64_t dev_vram_usage[AIMDO_MAX_DEVICES]; ssize_t deficit_sync; const char *prevailing_deficit_method; CUcontext aimdo_cuda_ctx; +/* Phase 1: per-device state */ +AimdoDeviceState g_dev[AIMDO_MAX_DEVICES]; + bool cuda_budget_deficit() { uint64_t now = GET_TICK(); static uint64_t last_check = 0; @@ -27,6 +31,117 @@ bool cuda_budget_deficit() { return true; } +/* Phase 3: per-device cuMemGetInfo with context switching (Phase 2) */ +bool poll_budget_deficit_dev(int device) { + if (device < 0 || device >= AIMDO_MAX_DEVICES || !g_dev[device].inited) + return false; + + AimdoDeviceState *s = &g_dev[device]; + uint64_t now = GET_TICK(); + + if (now - s->last_check_tick < 2000) { + return true; + } + + CUcontext prev; + if (!with_device_ctx(device, &prev)) + return false; + + size_t free_vram = 0, total_vram = 0; + bool ok = CHECK_CU(cuMemGetInfo(&free_vram, &total_vram)); + + restore_ctx(prev); + + if (!ok) + return false; + + s->last_check_tick = now; + s->usage_last_check = dev_vram_load(device); + s->deficit_sync = (ssize_t)VRAM_HEADROOM - (ssize_t)free_vram; + s->prevailing_method = "cuMemGetInfo (per-dev)"; + return true; +} + +/* Phase 1: lazy per-device init — thread-safe via init_lock */ +#if defined(_WIN32) || defined(_WIN64) +#include +static CRITICAL_SECTION dev_init_lock; +static volatile LONG dev_init_lock_ready; + +static inline void dev_init_lock_acquire(void) { + if (!InterlockedCompareExchange(&dev_init_lock_ready, 1, 0)) { + InitializeCriticalSection(&dev_init_lock); + InterlockedExchange(&dev_init_lock_ready, 2); + } + while (dev_init_lock_ready != 2) { /* spin until init done */ } + EnterCriticalSection(&dev_init_lock); +} +static inline void dev_init_lock_release(void) { LeaveCriticalSection(&dev_init_lock); } +#else +#include +static pthread_mutex_t dev_init_lock = PTHREAD_MUTEX_INITIALIZER; +static inline void dev_init_lock_acquire(void) { pthread_mutex_lock(&dev_init_lock); } +static inline void dev_init_lock_release(void) { pthread_mutex_unlock(&dev_init_lock); } +#endif + +void ensure_device_init(int device) { + if (device < 0 || device >= AIMDO_MAX_DEVICES) + return; + + AimdoDeviceState *s = &g_dev[device]; + if (s->inited) + return; + + dev_init_lock_acquire(); + + /* Double-check after acquiring lock */ + if (s->inited) { + dev_init_lock_release(); + return; + } + + CUdevice dev; + if (!CHECK_CU(cuDeviceGet(&dev, device))) { + dev_init_lock_release(); + return; + } + + uint64_t cap = 0; + if (!CHECK_CU(cuDeviceTotalMem(&cap, dev))) { + dev_init_lock_release(); + return; + } + + CUcontext ctx = NULL; + if (!CHECK_CU(cuDevicePrimaryCtxRetain(&ctx, dev))) { + dev_init_lock_release(); + return; + } + + s->vram_capacity = cap; + s->ctx = ctx; + s->prevailing_method = "none"; + + /* Write inited last with a store barrier so other threads see + * fully initialized fields before they see inited == true. + */ +#if defined(_WIN32) || defined(_WIN64) + MemoryBarrier(); + s->inited = true; +#else + __atomic_store_n(&s->inited, true, __ATOMIC_RELEASE); +#endif + + dev_init_lock_release(); + + char dev_name[256]; + if (!CHECK_CU(cuDeviceGetName(dev_name, sizeof(dev_name), dev))) + sprintf(dev_name, ""); + + log(INFO, "comfy-aimdo device %d init: %s (VRAM: %zu MB)\n", + device, dev_name, (size_t)(cap / (1024 * 1024))); +} + SHARED_EXPORT void aimdo_analyze() { size_t free_bytes = 0, total_bytes = 0; @@ -37,6 +152,13 @@ void aimdo_analyze() { log(DEBUG, " Aimdo Recorded Usage: %7zu MB\n", total_vram_usage / M); log(DEBUG, " Cuda: %7zu MB / %7zu MB Free\n", free_bytes / M, total_bytes / M); + for (int i = 0; i < AIMDO_MAX_DEVICES; i++) { + if (dev_vram_usage[i]) + log(DEBUG, " Device %d Usage: %7zu MB (cap %zu MB)\n", + i, (size_t)(dev_vram_usage[i] / M), + g_dev[i].inited ? (size_t)(g_dev[i].vram_capacity / M) : 0); + } + vbars_analyze(true); allocations_analyze(); } @@ -65,6 +187,9 @@ bool init(int cuda_device_id) { sprintf(dev_name, ""); } + /* Also populate g_dev for the primary device */ + ensure_device_init(cuda_device_id); + log(INFO, "comfy-aimdo inited for GPU: %s (VRAM: %zu MB)\n", dev_name, (size_t)(vram_capacity / (1024 * 1024))); return true; diff --git a/src/model-vbar.c b/src/model-vbar.c index 26db1c4..d24cd29 100644 --- a/src/model-vbar.c +++ b/src/model-vbar.c @@ -5,6 +5,28 @@ #define VBAR_GET_PAGE_NR(x) ((x) / VBAR_PAGE_SIZE) #define VBAR_GET_PAGE_NR_UP(x) VBAR_GET_PAGE_NR((x) + VBAR_PAGE_SIZE - 1) +/* ---- global lock for the priority linked list and vbars_dirty ---- */ +#if defined(_WIN32) || defined(_WIN64) +#include +static CRITICAL_SECTION vbar_lock; +static volatile LONG vbar_lock_init; + +static inline void vbar_list_lock(void) { + if (!InterlockedCompareExchange(&vbar_lock_init, 1, 0)) { + InitializeCriticalSection(&vbar_lock); + InterlockedExchange(&vbar_lock_init, 2); + } + while (vbar_lock_init != 2) { /* spin until init done */ } + EnterCriticalSection(&vbar_lock); +} +static inline void vbar_list_unlock(void) { LeaveCriticalSection(&vbar_lock); } +#else +#include +static pthread_mutex_t vbar_lock = PTHREAD_MUTEX_INITIALIZER; +static inline void vbar_list_lock(void) { pthread_mutex_lock(&vbar_lock); } +static inline void vbar_list_unlock(void) { pthread_mutex_unlock(&vbar_lock); } +#endif + typedef struct ResidentPage { CUmemGenericAllocationHandle handle; bool pinned; @@ -44,8 +66,10 @@ SHARED_EXPORT uint64_t vbars_analyze(bool only_dirty) { size_t calculated_total_vram = 0; + vbar_list_lock(); one_time_setup(); if (only_dirty && !vbars_dirty) { + vbar_list_unlock(); return 0; } vbars_dirty = false; @@ -83,6 +107,7 @@ uint64_t vbars_analyze(bool only_dirty) { } log(DEBUG, "Total VRAM for VBARs: %zu MB\n", calculated_total_vram / M); + vbar_list_unlock(); return (uint64_t)calculated_total_vram; } @@ -94,7 +119,7 @@ static inline bool mod1(ModelVBAR *mv, size_t page_nr, bool do_free, bool do_unp if (do_free) { CHECK_CU(cuMemUnmap(vaddr, VBAR_PAGE_SIZE)); CHECK_CU(cuMemRelease(rp->handle)); - total_vram_usage -= VBAR_PAGE_SIZE; + dev_vram_sub(mv->device, VBAR_PAGE_SIZE); rp->handle = 0; mv->resident_count--; } @@ -104,9 +129,12 @@ static inline bool mod1(ModelVBAR *mv, size_t page_nr, bool do_free, bool do_unp return do_free; } -size_t vbars_free(size_t size) { +/* Must be called with vbar_list_lock held. + * device_filter: -1 = evict from any device, >= 0 = only evict from that device. + */ +static size_t vbars_free_locked_dev(size_t size, int device_filter) { size_t pages_needed = VBAR_GET_PAGE_NR_UP(size); - bool dirty = false; + bool synced[AIMDO_MAX_DEVICES] = {false}; one_time_setup(); vbars_dirty = true; @@ -117,10 +145,16 @@ size_t vbars_free(size_t size) { for (ModelVBAR *i = lowest_priority.higher; pages_needed && i != &highest_priority; i = i->higher) { + if (device_filter >= 0 && i->device != device_filter) + continue; for (;pages_needed && i->watermark > i->watermark_limit; i->watermark--) { - if (!dirty) { + int dev = i->device; + if (dev >= 0 && dev < AIMDO_MAX_DEVICES && !synced[dev]) { + CUcontext prev; + with_device_ctx(dev, &prev); CHECK_CU(cuCtxSynchronize()); - dirty = true; + restore_ctx(prev); + synced[dev] = true; } if (mod1(i, i->watermark - 1, true, false)) { pages_needed--; @@ -131,6 +165,24 @@ size_t vbars_free(size_t size) { return pages_needed; } +size_t vbars_free(size_t size) { + size_t ret; + + vbar_list_lock(); + ret = vbars_free_locked_dev(size, -1); + vbar_list_unlock(); + return ret; +} + +size_t vbars_free_dev(size_t size, int device) { + size_t ret; + + vbar_list_lock(); + ret = vbars_free_locked_dev(size, device); + vbar_list_unlock(); + return ret; +} + static inline size_t move_cursor_to_absent(ModelVBAR *mv, size_t cursor) { while (cursor < mv->watermark && mv->residency_map[cursor].handle) { cursor++; @@ -140,12 +192,18 @@ static inline size_t move_cursor_to_absent(ModelVBAR *mv, size_t cursor) { static void vbars_free_for_vbar(ModelVBAR *mv, size_t target) { size_t cursor = move_cursor_to_absent(mv, 0); + int device_filter = mv->device; + CUcontext prev; + with_device_ctx(device_filter, &prev); CHECK_CU(cuCtxSynchronize()); + restore_ctx(prev); for (ModelVBAR *i = lowest_priority.higher; cursor < target && cursor < mv->watermark && i != &highest_priority; i = i->higher) { + if (i->device != device_filter) + continue; for (; cursor < target && cursor < mv->watermark && i->watermark > i->watermark_limit; i->watermark--) { if (mod1(i, i->watermark - 1, true, false)) { @@ -178,13 +236,18 @@ SHARED_EXPORT void *vbar_allocate(uint64_t size, int device) { ModelVBAR *mv; - one_time_setup(); log_reset_shots(); log(DEBUG, "%s (start): size=%zuM, device=%d\n", __func__, size / M, device); - vbars_dirty = true; + + /* Phase 1: lazy init for this device */ + ensure_device_init(device); + + /* Use per-device capacity if available, else global */ + uint64_t cap = (device >= 0 && device < AIMDO_MAX_DEVICES && g_dev[device].inited) + ? g_dev[device].vram_capacity : vram_capacity; size_t nr_pages = VBAR_GET_PAGE_NR_UP(size); - size_t nr_pages_max = VBAR_GET_PAGE_NR(vram_capacity); + size_t nr_pages_max = VBAR_GET_PAGE_NR(cap); if (nr_pages_max < nr_pages) { nr_pages = nr_pages_max; } @@ -204,8 +267,12 @@ void *vbar_allocate(uint64_t size, int device) { mv->device = device; mv->nr_pages = mv->watermark = nr_pages; - + + vbar_list_lock(); + one_time_setup(); + vbars_dirty = true; insert_vbar(mv); + vbar_list_unlock(); log(DEBUG, "%s (return): vbar=%p\n", __func__, (void *)mv); return mv; @@ -216,17 +283,21 @@ void vbar_set_watermark_limit(void *vbar, uint64_t size) { ModelVBAR *mv = (ModelVBAR *)vbar; log(DEBUG, "%s: size=%zu\n", __func__, size); + vbar_list_lock(); mv->watermark_limit = VBAR_GET_PAGE_NR_UP(size); + vbar_list_unlock(); } SHARED_EXPORT void vbars_reset_watermark_limits() { + vbar_list_lock(); one_time_setup(); log(VERBOSE, "%s\n", __func__); for (ModelVBAR *i = lowest_priority.higher; i && i != &highest_priority; i = i->higher) { i->watermark_limit = 0; } + vbar_list_unlock(); } SHARED_EXPORT @@ -234,14 +305,15 @@ void vbar_prioritize(void *vbar) { ModelVBAR *mv = (ModelVBAR *)vbar; log(DEBUG, "%s vbar=%p\n", __func__, vbar); - vbars_dirty = true; log_reset_shots(); + vbar_list_lock(); + vbars_dirty = true; remove_vbar(mv); insert_vbar(mv); - mv->watermark = mv->nr_pages; + vbar_list_unlock(); } SHARED_EXPORT @@ -249,12 +321,14 @@ void vbar_deprioritize(void *vbar) { ModelVBAR *mv = (ModelVBAR *)vbar; log(DEBUG, "%s vbar=%p\n", __func__, vbar); - vbars_dirty = true; log_reset_shots(); + vbar_list_lock(); + vbars_dirty = true; remove_vbar(mv); insert_vbar_last(mv); + vbar_list_unlock(); } SHARED_EXPORT @@ -276,16 +350,18 @@ int vbar_fault(void *vbar, uint64_t offset, uint64_t size, uint32_t *signature) size_t page_end = VBAR_GET_PAGE_NR_UP(offset + size); log(VVERBOSE, "%s (start): offset=%lldk, size=%lldk\n", __func__, (ull)(offset / K), (ull)(size / K)); + + vbar_list_lock(); vbars_dirty = true; - /* Stopgap. If the we get a bad shared memory spike, collect it here on the next layer - * as the allocator is unreliable as it may not actually be called reliably when you - * really need to know you have spilled. + /* Stopgap: use per-device budget to avoid phantom cross-GPU deficit. + * Only evict pages from the same device to avoid cross-GPU thrashing. */ - vbars_free(budget_deficit(0)); + vbars_free_locked_dev(budget_deficit_dev(0, mv->device), mv->device); if (page_end > mv->watermark) { log(VVERBOSE, "VBAR Allocation is above watermark\n"); + vbar_list_unlock(); return VBAR_FAULT_OOM; } @@ -301,20 +377,23 @@ int vbar_fault(void *vbar, uint64_t offset, uint64_t size, uint32_t *signature) log(VERBOSE, "VBAR needs to allocate VRAM for page %d\n", (int)page_nr); - if (budget_deficit(VBAR_PAGE_SIZE) || + if (budget_deficit_dev(VBAR_PAGE_SIZE, mv->device) || (err = three_stooges(vaddr, VBAR_PAGE_SIZE, mv->device, &rp->handle)) != CUDA_SUCCESS) { if (err != CUDA_ERROR_OUT_OF_MEMORY) { log(ERROR, "VRAM Allocation failed (non OOM)\n"); + vbar_list_unlock(); return VBAR_FAULT_ERROR; } log(DEBUG, "VBAR allocator attempt exceeds available VRAM ...\n"); vbars_free_for_vbar(mv, page_end); if (page_nr >= mv->watermark) { log(DEBUG, "VBAR allocation cancelled due to watermark reduction\n"); + vbar_list_unlock(); return VBAR_FAULT_OOM; } if ((err = three_stooges(vaddr, VBAR_PAGE_SIZE, mv->device, &rp->handle)) != CUDA_SUCCESS) { log(ERROR, "VRAM Allocation failed\n"); + vbar_list_unlock(); return VBAR_FAULT_ERROR; } } @@ -330,6 +409,7 @@ int vbar_fault(void *vbar, uint64_t offset, uint64_t size, uint32_t *signature) rp->pinned = true; } + vbar_list_unlock(); log(VVERBOSE, "%s (return) %d\n", __func__, ret); return ret; } @@ -339,16 +419,22 @@ void vbar_unpin(void *vbar, uint64_t offset, uint64_t size) { ModelVBAR *mv = (ModelVBAR *)vbar; log(VVERBOSE, "%s (start): offset=%lldk, size=%lldk\n", __func__, (ull)(offset / K), (ull)(size / K)); + + vbar_list_lock(); vbars_dirty = true; size_t page_end = VBAR_GET_PAGE_NR_UP(offset + size); if (page_end > mv->watermark) { + CUcontext prev; + with_device_ctx(mv->device, &prev); CHECK_CU(cuCtxSynchronize()); + restore_ctx(prev); } for (uint64_t page_nr = VBAR_GET_PAGE_NR(offset); page_nr < page_end && page_nr < mv->nr_pages; page_nr++) { mod1(mv, page_nr, page_nr >= mv->watermark, true); } + vbar_list_unlock(); } SHARED_EXPORT @@ -356,14 +442,23 @@ void vbar_free(void *vbar) { ModelVBAR *mv = (ModelVBAR *)vbar; log(DEBUG, "%s: vbar=%p\n", __func__, vbar); + + vbar_list_lock(); vbars_dirty = true; - CHECK_CU(cuCtxSynchronize()); + { + CUcontext prev; + with_device_ctx(mv->device, &prev); + CHECK_CU(cuCtxSynchronize()); + restore_ctx(prev); + } for (uint64_t page_nr = 0; page_nr < mv->nr_pages; page_nr++) { mod1(mv, page_nr, true, true); } remove_vbar(mv); + vbar_list_unlock(); + CHECK_CU(cuMemAddressFree(mv->vbar, (size_t)mv->nr_pages * VBAR_PAGE_SIZE)); free(mv); } @@ -405,9 +500,16 @@ uint64_t vbar_free_memory(void *vbar, uint64_t size) { size_t pages_freed = 0; log(DEBUG, "%s (start): size=%lldk\n", __func__, (ull)size); + + vbar_list_lock(); vbars_dirty = true; - CHECK_CU(cuCtxSynchronize()); + { + CUcontext prev; + with_device_ctx(mv->device, &prev); + CHECK_CU(cuCtxSynchronize()); + restore_ctx(prev); + } for (;pages_to_free && mv->watermark > mv->watermark_limit; mv->watermark--) { /* In theory we should never have pins here, but @@ -419,5 +521,6 @@ uint64_t vbar_free_memory(void *vbar, uint64_t size) { } } + vbar_list_unlock(); return (uint64_t)pages_freed * VBAR_PAGE_SIZE; } diff --git a/src/plat.h b/src/plat.h index 2339ebc..0b1df82 100644 --- a/src/plat.h +++ b/src/plat.h @@ -25,6 +25,23 @@ bool cuda_budget_deficit(); #define SHARED_EXPORT __declspec(dllexport) #include +#include + +/* MSVC C mode: the non-prefixed InterlockedXxx64 names are macros + * defined in (via ), but we can't include + * here because it #defines ERROR/DEBUG which clash + * with our DebugLevels enum. Use the underscore-prefixed intrinsics + * directly and provide our own macro aliases. + */ +#ifndef InterlockedExchangeAdd64 +#define InterlockedExchangeAdd64 _InterlockedExchangeAdd64 +#endif +#ifndef InterlockedOr64 +#define InterlockedOr64 _InterlockedOr64 +#endif +#ifndef InterlockedCompareExchange +#define InterlockedCompareExchange _InterlockedCompareExchange +#endif typedef SSIZE_T ssize_t; @@ -109,6 +126,21 @@ void log_reset_shots(); /* The default VRAM headroom. Different deficit methods with BYO headroom */ #define VRAM_HEADROOM (256 * 1024 * 1024) +#define AIMDO_MAX_DEVICES 16 + +/* ---- Per-device state (Phase 1) ---- */ +typedef struct { + bool inited; + uint64_t vram_capacity; + CUcontext ctx; + uint64_t usage_last_check; + ssize_t deficit_sync; + uint64_t last_check_tick; + const char *prevailing_method; +} AimdoDeviceState; + +extern AimdoDeviceState g_dev[AIMDO_MAX_DEVICES]; + /* control.c */ extern uint64_t vram_capacity; extern uint64_t total_vram_usage; @@ -116,6 +148,73 @@ extern uint64_t total_vram_last_check; extern ssize_t deficit_sync; extern const char *prevailing_deficit_method; +void ensure_device_init(int device); + +/* Per-device VRAM accounting (control.c) — Phase 5: atomic counters */ +extern uint64_t dev_vram_usage[AIMDO_MAX_DEVICES]; + +#if defined(_WIN32) || defined(_WIN64) + +static inline uint64_t dev_vram_load(int device) { + return (uint64_t)InterlockedOr64((volatile LONG64 *)&dev_vram_usage[device], 0); +} + +static inline void dev_vram_add(int device, size_t size) { + InterlockedExchangeAdd64((volatile LONG64 *)&total_vram_usage, (LONG64)size); + if (device >= 0 && device < AIMDO_MAX_DEVICES) + InterlockedExchangeAdd64((volatile LONG64 *)&dev_vram_usage[device], (LONG64)size); +} + +static inline void dev_vram_sub(int device, size_t size) { + InterlockedExchangeAdd64((volatile LONG64 *)&total_vram_usage, -(LONG64)size); + if (device >= 0 && device < AIMDO_MAX_DEVICES) + InterlockedExchangeAdd64((volatile LONG64 *)&dev_vram_usage[device], -(LONG64)size); +} + +#else + +static inline uint64_t dev_vram_load(int device) { + return __atomic_load_n(&dev_vram_usage[device], __ATOMIC_RELAXED); +} + +static inline void dev_vram_add(int device, size_t size) { + __atomic_add_fetch(&total_vram_usage, size, __ATOMIC_RELAXED); + if (device >= 0 && device < AIMDO_MAX_DEVICES) + __atomic_add_fetch(&dev_vram_usage[device], size, __ATOMIC_RELAXED); +} + +static inline void dev_vram_sub(int device, size_t size) { + __atomic_sub_fetch(&total_vram_usage, size, __ATOMIC_RELAXED); + if (device >= 0 && device < AIMDO_MAX_DEVICES) + __atomic_sub_fetch(&dev_vram_usage[device], size, __ATOMIC_RELAXED); +} + +#endif + +/* ---- Context save/restore helpers (Phase 2) ---- */ +static inline bool with_device_ctx(int device, CUcontext *prev) { + CUcontext cur = NULL; + cuCtxGetCurrent(&cur); + *prev = cur; + if (device >= 0 && device < AIMDO_MAX_DEVICES && g_dev[device].inited) { + CUcontext target = g_dev[device].ctx; + if (cur != target) { + cuCtxSetCurrent(target); + } + return true; + } + return (cur != NULL); +} + +static inline void restore_ctx(CUcontext prev) { + cuCtxSetCurrent(prev); +} + +/* ---- Per-device hybrid budget (Phase 3) ---- */ + +/* Poll cuMemGetInfo for a specific device using its context */ +bool poll_budget_deficit_dev(int device); + static inline size_t budget_deficit(size_t size) { ssize_t deficit_simple, deficit_delta; size_t deficit; @@ -132,6 +231,27 @@ static inline size_t budget_deficit(size_t size) { return deficit; } +/* Per-device budget deficit with cuMemGetInfo hybrid backstop (Phase 3). */ +static inline size_t budget_deficit_dev(size_t size, int device) { + if (device < 0 || device >= AIMDO_MAX_DEVICES || !g_dev[device].inited) + return budget_deficit(size); + + AimdoDeviceState *s = &g_dev[device]; + uint64_t usage = dev_vram_load(device); + + poll_budget_deficit_dev(device); + + ssize_t deficit_simple = (ssize_t)(usage + VRAM_HEADROOM + size) - (ssize_t)s->vram_capacity; + ssize_t deficit_delta = s->deficit_sync + (ssize_t)usage - (ssize_t)s->usage_last_check + size; + size_t deficit = (size_t)MAX(MAX(deficit_simple, deficit_delta), (ssize_t)0); + if (deficit) { + log(DEBUG, "%s: dev=%d %s deficit=%zuM usage=%zuM cap=%zuM size=%zuM\n", __func__, + device, deficit_simple > deficit_delta ? "simple" : s->prevailing_method, + deficit / M, usage / M, s->vram_capacity / M, size / M); + } + return deficit; +} + static inline int check_cu_impl(CUresult res, const char *label) { if (res != CUDA_SUCCESS && res != CUDA_ERROR_OUT_OF_MEMORY) { const char* desc; @@ -171,7 +291,7 @@ static inline CUresult three_stooges(CUdeviceptr vaddr, size_t size, int device, if (!CHECK_CU(err = cuMemSetAccess(vaddr, size, &accessDesc, 1))) { goto fail_access; } - total_vram_usage += size; + dev_vram_add(device, size); *handle = h; return CUDA_SUCCESS; @@ -186,6 +306,7 @@ static inline CUresult three_stooges(CUdeviceptr vaddr, size_t size, int device, /* model_vbar.c */ size_t vbars_free(size_t size); +size_t vbars_free_dev(size_t size, int device); SHARED_EXPORT uint64_t vbars_analyze(bool only_dirty); diff --git a/src/pyt-cu-plug-alloc-async.c b/src/pyt-cu-plug-alloc-async.c index 817144e..f5cb899 100644 --- a/src/pyt-cu-plug-alloc-async.c +++ b/src/pyt-cu-plug-alloc-async.c @@ -9,9 +9,19 @@ typedef struct SizeEntry { CUdeviceptr ptr; size_t size; + int device; struct SizeEntry *next; } SizeEntry; +static inline int current_cuda_device(void) { + CUdevice dev = 0; + CUcontext ctx = NULL; + if (cuCtxGetCurrent(&ctx) == CUDA_SUCCESS && ctx) { + cuCtxGetDevice(&dev); + } + return (int)dev; +} + static SizeEntry *size_table[SIZE_HASH_SIZE]; static inline unsigned int size_hash(CUdeviceptr ptr) { @@ -42,14 +52,16 @@ static inline void st_unlock(void) { pthread_mutex_unlock(&size_table_lock); } static inline void account_alloc(CUdeviceptr ptr, size_t size) { unsigned int h = size_hash(ptr); SizeEntry *entry; + int dev = current_cuda_device(); st_lock(); - total_vram_usage += CUDA_ALIGN_UP(size); + dev_vram_add(dev, CUDA_ALIGN_UP(size)); entry = (SizeEntry *)malloc(sizeof(*entry)); if (entry) { entry->ptr = ptr; entry->size = size; + entry->device = dev; entry->next = size_table[h]; size_table[h] = entry; } @@ -70,7 +82,7 @@ static inline void account_free(CUdeviceptr ptr, CUstream hStream) { *prev = entry->next; log(VVERBOSE, "Freed: ptr=0x%llx, size=%zuk, stream=%p\n", ptr, entry->size / K, hStream); - total_vram_usage -= CUDA_ALIGN_UP(entry->size); + dev_vram_sub(entry->device, CUDA_ALIGN_UP(entry->size)); st_unlock(); free(entry); @@ -88,12 +100,14 @@ int aimdo_cuda_malloc(CUdeviceptr *devPtr, size_t size, CUresult (*true_cuMemAlloc_v2)(CUdeviceptr*, size_t)) { CUdeviceptr dptr; CUresult status = 0; + int dev = current_cuda_device(); if (!devPtr || !true_cuMemAlloc_v2) { return 1; } - vbars_free(budget_deficit(size + CUDA_MALLOC_HEADROOM)); + ensure_device_init(dev); + vbars_free_dev(budget_deficit_dev(size + CUDA_MALLOC_HEADROOM, dev), dev); if (CHECK_CU(true_cuMemAlloc_v2(&dptr, size))) { *devPtr = dptr; @@ -101,7 +115,7 @@ int aimdo_cuda_malloc(CUdeviceptr *devPtr, size_t size, return 0; } - vbars_free(size + CUDA_MALLOC_HEADROOM); + vbars_free_dev(size + CUDA_MALLOC_HEADROOM, dev); status = true_cuMemAlloc_v2(&dptr, size); if (CHECK_CU(status)) { *devPtr = dptr; @@ -137,20 +151,22 @@ int aimdo_cuda_malloc_async(CUdeviceptr *devPtr, size_t size, CUstream hStream, CUresult (*true_cuMemAllocAsync)(CUdeviceptr*, size_t, CUstream)) { CUdeviceptr dptr; CUresult status = 0; + int dev = current_cuda_device(); - log(VVERBOSE, "%s (start) size=%zuk stream=%p\n", __func__, size / K, hStream); + log(VVERBOSE, "%s (start) size=%zuk stream=%p dev=%d\n", __func__, size / K, hStream, dev); if (!devPtr) { return 1; } - vbars_free(budget_deficit(size)); + ensure_device_init(dev); + vbars_free_dev(budget_deficit_dev(size, dev), dev); if (CHECK_CU(true_cuMemAllocAsync(&dptr, size, hStream))) { *devPtr = dptr; goto success; } - vbars_free(size); + vbars_free_dev(size, dev); status = true_cuMemAllocAsync(&dptr, size, hStream); if (CHECK_CU(status)) { *devPtr = dptr; @@ -192,7 +208,14 @@ static inline void ensure_ctx(void) { CUcontext ctx = NULL; if (cuCtxGetCurrent(&ctx) != CUDA_SUCCESS || !ctx) { - cuCtxSetCurrent(aimdo_cuda_ctx); + /* No context set — try per-device init for device 0, then use its ctx */ + int dev = 0; + ensure_device_init(dev); + if (g_dev[dev].inited) { + cuCtxSetCurrent(g_dev[dev].ctx); + } else { + cuCtxSetCurrent(aimdo_cuda_ctx); + } } } diff --git a/src/vrambuf.c b/src/vrambuf.c index 75deea0..41e33e9 100644 --- a/src/vrambuf.c +++ b/src/vrambuf.c @@ -46,7 +46,7 @@ bool vrambuf_grow(void *arg, size_t required_size) { grow_to = buf->max_size; } - vbars_free(budget_deficit(grow_to - buf->allocated)); + vbars_free_dev(budget_deficit_dev(grow_to - buf->allocated, buf->device), buf->device); while (buf->allocated < grow_to) { size_t to_allocate = grow_to - buf->allocated; if (to_allocate > VRAM_CHUNK_SIZE) { @@ -58,7 +58,7 @@ bool vrambuf_grow(void *arg, size_t required_size) { return false; } log(DEBUG, "Pytorch allocator attempt exceeds available VRAM ...\n"); - vbars_free(VRAM_CHUNK_SIZE); + vbars_free_dev(VRAM_CHUNK_SIZE, buf->device); if ((err = three_stooges(buf->base_ptr + buf->allocated, to_allocate, buf->device, &handle)) != CUDA_SUCCESS) { bool is_oom = err == CUDA_ERROR_OUT_OF_MEMORY; log(is_oom ? INFO : ERROR, "VRAM Allocation failed (%s)\n", is_oom ? "OOM" : "error"); @@ -101,6 +101,6 @@ void vrambuf_destroy(void *arg) { } CHECK_CU(cuMemAddressFree(buf->base_ptr, buf->max_size)); - total_vram_usage -= buf->allocated; + dev_vram_sub(buf->device, buf->allocated); free(buf); }