Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions MULTIGPU_PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# comfy-aimdo Multi-GPU Support Plan

## Problem
aimdo's VRAM management assumes a single GPU. With 2+ GPUs, `budget_deficit()` uses global `total_vram_usage` (sum of ALL GPUs) against one GPU's `vram_capacity`, creating a phantom deficit that triggers constant unnecessary eviction. Result: 2-GPU is **slower** than 1-GPU (79s vs 56s), while no-aimdo 2-GPU runs at 37s.

## Benchmark Baseline (2×RTX 4090, Qwen-Image 38GB, CFG=7, 20 steps)
| Config | Time | vs 1-GPU | Status |
|--------|------|----------|--------|
| 1-GPU + aimdo | 56.39s | 1.00× | ✅ stable |
| 2-GPU + aimdo (original) | — | — | 💥 segfault |
| 2-GPU + aimdo (mutex only) | 95.39s | 0.59× | ✅ stable |
| 2-GPU + aimdo (dev-aware eviction) | 78.85s | 0.72× | ✅ stable |
| 2-GPU no aimdo | 37.32s | 1.51× | ❌ CUDA errors after ~2 runs |
| **2-GPU + aimdo (Phases 1–5)** | **41.29s** | **1.37×** | ✅ stable (7/7 runs) |

## Branch
`multigpu-thread-safety` — contains mutex + device-aware eviction (current stable baseline).

---

## Phase 1: Per-Device State Object
**Status:** ✅ Done

Create `AimdoDeviceState` struct with per-device fields currently stored as globals:
```c
typedef struct {
bool inited;
uint64_t vram_capacity;
CUcontext ctx;
uint64_t usage_last_check;
ssize_t deficit_sync;
uint64_t last_check_tick;
const char *prevailing_method;
} AimdoDeviceState;

extern AimdoDeviceState g_dev[AIMDO_MAX_DEVICES];
```

Changes:
- **`control.c`**: Define `g_dev[]`. Split `init()` into global-once + per-device init. Add `ensure_device_init(device)` for lazy init.
- **`plat.h`**: Declare the struct and extern. Keep `total_vram_usage` for diagnostics only.
- **`model-vbar.c`**: Call `ensure_device_init(mv->device)` in `vbar_allocate()`.
- **`vbar_allocate()`**: Use `g_dev[device].vram_capacity` instead of global `vram_capacity` for VBAR sizing.

Notes:
- ComfyUI currently calls `init_device(device_0_index)` once. We don't need to change the Python API — lazy init handles other devices.
- Windows: store WDDM adapter/node per device (future — only if testing on multi-GPU Windows).

---

## Phase 2: Context Safety
**Status:** ✅ Done

Add save/restore context helpers:
```c
bool with_device_ctx(int device, CUcontext *prev);
void restore_ctx(CUcontext prev);
```

Wrap all context-sensitive CUDA calls:
- `cuMemGetInfo` in `poll_budget_deficit` / `cuda_budget_deficit`
- `cuCtxSynchronize` in `vbars_free_locked_dev`, `vbars_free_for_vbar`, `vbar_fault`, `vbar_unpin`, `vbar_free`, `vbar_free_memory`
- Fix Linux `ensure_ctx()` to not override pytorch's context when it's already set for device 1

Key principle: **never leave a different context active than what the caller had**.

---

## Phase 3: Per-Device Hybrid Budget
**Status:** ✅ Done

Replace global `budget_deficit()` in `vbar_fault` with per-device version using `AimdoDeviceState`:
```c
size_t budget_deficit_dev(size_t size, int device) {
AimdoDeviceState *s = &g_dev[device];
uint64_t usage = dev_vram_usage[device];
poll_budget_deficit_dev(device); // cuMemGetInfo with correct ctx
ssize_t simple = (ssize_t)(usage + HEADROOM + size) - (ssize_t)s->vram_capacity;
ssize_t delta = s->deficit_sync + (ssize_t)usage - (ssize_t)s->usage_last_check + size;
return (size_t)MAX(MAX(simple, delta), 0);
}
```

**Critical**: Must keep the `cuMemGetInfo` backstop (hybrid approach). Pure accounting OOM'd in testing because `cudaFreeAsync` decrements counters before memory is actually reusable.

Also make `poll_budget_deficit_dev(device)` — calls `cuMemGetInfo` with the correct per-device context (depends on Phase 2).

---

## Phase 4: Allocator Hooks Device-Aware
**Status:** ✅ Done

In `aimdo_cuda_malloc` / `aimdo_cuda_malloc_async`:
- Determine device via `current_cuda_device()`
- Call `ensure_device_init(dev)`
- Use `vbars_free_dev(budget_deficit_dev(size, dev), dev)` instead of global
- OOM retry path: also evict from same device only

Expose `vbars_free_dev()` as the device-filtered wrapper (we already have `vbars_free_locked_dev` internally).

---

## Phase 5: Counter Thread Safety
**Status:** ✅ Done

`dev_vram_add/sub` run under different locks or no lock. Fix with atomics:
```c
static inline void dev_vram_add(int device, size_t size) {
__atomic_add_fetch(&total_vram_usage, size, __ATOMIC_RELAXED);
if (device >= 0 && device < AIMDO_MAX_DEVICES)
__atomic_add_fetch(&dev_vram_usage[device], size, __ATOMIC_RELAXED);
}
```
Windows equivalent: `InterlockedExchangeAdd64`.

---

## Phase 6 (Future): Per-Device Locking
**Status:** ⬜ Not started — only pursue if performance still >1.3× worse than no-aimdo after Phases 1–5

The global `vbar_lock` serializes both GPUs. `cuCtxSynchronize()` is called while holding it, blocking the other GPU's `vbar_fault`.

Options:
- Per-device lock for device-local operations
- Global lock only for list mutations (insert/remove)
- Avoid `cuCtxSynchronize` on hot path if possible

---

## Implementation Order
Phase 1 → 2 → 3 → 4 → 5 → **benchmark** → decide Phase 6

## Expected Outcome
Each GPU sees its own ~20GB usage vs 24GB capacity. Phantom deficit eliminated. Should approach the no-aimdo 37s target while remaining crash-free.

## Key Learnings from Earlier Attempts
1. **Can't remove `budget_deficit` pre-check entirely** — aimdo consuming too much VRAM causes `cudaErrorLaunchFailure` crash in pytorch's async allocator.
2. **Pure per-device accounting (without cuMemGetInfo backstop) causes OOM** — `cudaFreeAsync` decrements counters before memory is actually freed, so accounting under-reports real usage.
3. **No-aimdo 2-GPU is unstable** — `cudaErrorInvalidValue` after ~2 runs, even with async offload disabled. Aimdo provides stability that pytorch's default offloading doesn't.
1 change: 1 addition & 0 deletions build-linux-docker
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ SRCS="src/*.c src-posix/*.c"
docker build -t manylinux-cuda -f docker/cuda-on-manylinux.Dockerfile .
docker run --rm -v $(pwd):/project -w /project manylinux-cuda \
gcc -shared -o comfy_aimdo/aimdo.so -fPIC -Werror \
-Isrc \
-I/usr/local/cuda/include \
-L/usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/ \
${SRCS} -lcuda
125 changes: 125 additions & 0 deletions src/control.c
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,14 @@
uint64_t vram_capacity;
uint64_t total_vram_usage;
uint64_t total_vram_last_check;
uint64_t dev_vram_usage[AIMDO_MAX_DEVICES];
ssize_t deficit_sync;
const char *prevailing_deficit_method;
CUcontext aimdo_cuda_ctx;

/* Phase 1: per-device state */
AimdoDeviceState g_dev[AIMDO_MAX_DEVICES];

bool cuda_budget_deficit() {
uint64_t now = GET_TICK();
static uint64_t last_check = 0;
Expand All @@ -27,6 +31,117 @@ bool cuda_budget_deficit() {
return true;
}

/* Phase 3: per-device cuMemGetInfo with context switching (Phase 2) */
bool poll_budget_deficit_dev(int device) {
if (device < 0 || device >= AIMDO_MAX_DEVICES || !g_dev[device].inited)
return false;

AimdoDeviceState *s = &g_dev[device];
uint64_t now = GET_TICK();

if (now - s->last_check_tick < 2000) {
return true;
}

CUcontext prev;
if (!with_device_ctx(device, &prev))
return false;

size_t free_vram = 0, total_vram = 0;
bool ok = CHECK_CU(cuMemGetInfo(&free_vram, &total_vram));

restore_ctx(prev);

if (!ok)
return false;

s->last_check_tick = now;
s->usage_last_check = dev_vram_load(device);
s->deficit_sync = (ssize_t)VRAM_HEADROOM - (ssize_t)free_vram;
s->prevailing_method = "cuMemGetInfo (per-dev)";
return true;
}

/* Phase 1: lazy per-device init — thread-safe via init_lock */
#if defined(_WIN32) || defined(_WIN64)
#include <windows.h>
static CRITICAL_SECTION dev_init_lock;
static volatile LONG dev_init_lock_ready;

static inline void dev_init_lock_acquire(void) {
if (!InterlockedCompareExchange(&dev_init_lock_ready, 1, 0)) {
InitializeCriticalSection(&dev_init_lock);
InterlockedExchange(&dev_init_lock_ready, 2);
}
while (dev_init_lock_ready != 2) { /* spin until init done */ }
EnterCriticalSection(&dev_init_lock);
}
static inline void dev_init_lock_release(void) { LeaveCriticalSection(&dev_init_lock); }
#else
#include <pthread.h>
static pthread_mutex_t dev_init_lock = PTHREAD_MUTEX_INITIALIZER;
static inline void dev_init_lock_acquire(void) { pthread_mutex_lock(&dev_init_lock); }
static inline void dev_init_lock_release(void) { pthread_mutex_unlock(&dev_init_lock); }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/pyt-cu-plug-alloc-async.c has the same. Should be generalize if we still need this lock (more to come).

#endif

void ensure_device_init(int device) {
if (device < 0 || device >= AIMDO_MAX_DEVICES)
return;

AimdoDeviceState *s = &g_dev[device];
if (s->inited)
return;

dev_init_lock_acquire();

/* Double-check after acquiring lock */
if (s->inited) {
dev_init_lock_release();
return;
}

CUdevice dev;
if (!CHECK_CU(cuDeviceGet(&dev, device))) {
dev_init_lock_release();
return;
}

uint64_t cap = 0;
if (!CHECK_CU(cuDeviceTotalMem(&cap, dev))) {
dev_init_lock_release();
return;
}

CUcontext ctx = NULL;
if (!CHECK_CU(cuDevicePrimaryCtxRetain(&ctx, dev))) {
dev_init_lock_release();
return;
}

s->vram_capacity = cap;
s->ctx = ctx;
s->prevailing_method = "none";

/* Write inited last with a store barrier so other threads see
* fully initialized fields before they see inited == true.
*/
#if defined(_WIN32) || defined(_WIN64)
MemoryBarrier();
s->inited = true;
#else
__atomic_store_n(&s->inited, true, __ATOMIC_RELEASE);
#endif

dev_init_lock_release();

char dev_name[256];
if (!CHECK_CU(cuDeviceGetName(dev_name, sizeof(dev_name), dev)))
sprintf(dev_name, "<unknown>");

log(INFO, "comfy-aimdo device %d init: %s (VRAM: %zu MB)\n",
device, dev_name, (size_t)(cap / (1024 * 1024)));
}

SHARED_EXPORT
void aimdo_analyze() {
size_t free_bytes = 0, total_bytes = 0;
Expand All @@ -37,6 +152,13 @@ void aimdo_analyze() {
log(DEBUG, " Aimdo Recorded Usage: %7zu MB\n", total_vram_usage / M);
log(DEBUG, " Cuda: %7zu MB / %7zu MB Free\n", free_bytes / M, total_bytes / M);

for (int i = 0; i < AIMDO_MAX_DEVICES; i++) {
if (dev_vram_usage[i])
log(DEBUG, " Device %d Usage: %7zu MB (cap %zu MB)\n",
i, (size_t)(dev_vram_usage[i] / M),
g_dev[i].inited ? (size_t)(g_dev[i].vram_capacity / M) : 0);
}

vbars_analyze(true);
allocations_analyze();
}
Expand Down Expand Up @@ -65,6 +187,9 @@ bool init(int cuda_device_id) {
sprintf(dev_name, "<unknown>");
}

/* Also populate g_dev for the primary device */
ensure_device_init(cuda_device_id);

log(INFO, "comfy-aimdo inited for GPU: %s (VRAM: %zu MB)\n",
dev_name, (size_t)(vram_capacity / (1024 * 1024)));
return true;
Expand Down
Loading
Loading