[Issue]: moe_ck2stages Mixtral TP8 fails #257

arakowsk-amd · 2025-03-31T19:33:59Z

Problem Description

Running Mixtral 8x7B/8x22B TP8 fails using AITER, disabling 2Stage MoE works, as does running TP1 first and then TP8. Fails at start build [module_moe_ck2stages] under /usr/local/lib/python3.12/dist-packages/aiter/jit/build/module_moe_ck2stages

Error:

start build [module_moe_ck2stages] under /usr/local/lib/python3.12/dist-packages/aiter/jit/build/module_moe_ck2stages
(VllmWorkerProcess pid=290) failed build jit [module_moe_ck2stages]
(VllmWorkerProcess pid=290) -->[History]: Traceback (most recent call last):
(VllmWorkerProcess pid=290) -->  File "/usr/local/lib/python3.12/dist-packages/aiter/jit/core.py", line 322, in wrapper
(VllmWorkerProcess pid=290)     module = get_module(custom_build_args.get('md_name',
(VllmWorkerProcess pid=290)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=290) -->  File "/usr/local/lib/python3.12/dist-packages/aiter/jit/core.py", line 130, in get_module
(VllmWorkerProcess pid=290)     return importlib.import_module(f'{__package__}.{md_name}')
(VllmWorkerProcess pid=290)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=290) -->  File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
(VllmWorkerProcess pid=290)     return _bootstrap._gcd_import(name[level:], package, level)
(VllmWorkerProcess pid=290)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=290) -->  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
(VllmWorkerProcess pid=290) -->  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
(VllmWorkerProcess pid=290) -->  File "<frozen importlib._bootstrap>", line 1324, in _find_and_load_unlocked
(VllmWorkerProcess pid=290) -->ModuleNotFoundError: No module named 'aiter.jit.module_moe_ck2stages'
(VllmWorkerProcess pid=290) -->
(VllmWorkerProcess pid=290) During handling of the above exception, another exception occurred:
(VllmWorkerProcess pid=290)
(VllmWorkerProcess pid=290) -->Traceback (most recent call last):
(VllmWorkerProcess pid=290) -->  File "/usr/local/lib/python3.12/dist-packages/aiter/jit/core.py", line 228, in build_module
(VllmWorkerProcess pid=290)     shutil.copy(f'{opbd_dir}/{md_name}.so', f'{this_dir}')
(VllmWorkerProcess pid=290) -->  File "/usr/lib/python3.12/shutil.py", line 436, in copy
(VllmWorkerProcess pid=290)     copymode(src, dst, follow_symlinks=follow_symlinks)
(VllmWorkerProcess pid=290) -->  File "/usr/lib/python3.12/shutil.py", line 317, in copymode
(VllmWorkerProcess pid=290)     chmod_func(dst, stat.S_IMODE(st.st_mode))
(VllmWorkerProcess pid=290) -->FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.12/dist-packages/aiter/jit/module_moe_ck2stages.so'

Example Commands:

# FAILS
docker run -it     --ipc=host     --network=host     --privileged     --cap-add=CAP_SYS_ADMIN     --device=/dev/kfd     --device=/dev/dri     --device=/dev/mem     --group-add render     --cap-add=SYS_PTRACE     --security-opt seccomp=unconfined     -v /data:/data     -e HF_HOME=/data/huggingface-cache     -e VLLM_USE_TRITON_FLASH_ATTN=0    -e VLLM_USE_AITER=1     rocm/vllm-dev:nightly_aiter_integration_final_20250325
 
python /app/vllm/benchmarks/profiling/benchmark_latency.py \
--model amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV \
--dtype auto \
--gpu-memory-utilization 0.92 \
--num-scheduler-steps 1 \
--max-model-len 8192 \
--distributed-executor-backend mp \
--tensor-parallel-size 8 \
--input-len 128 \
--output-len 128


#Disabling 2Stage MoE works 
docker run -it     --ipc=host     --network=host     --privileged     --cap-add=CAP_SYS_ADMIN     --device=/dev/kfd     --device=/dev/dri     --device=/dev/mem     --group-add render     --cap-add=SYS_PTRACE     --security-opt seccomp=unconfined     -v /data:/data     -e HF_HOME=/data/huggingface-cache     -e VLLM_USE_TRITON_FLASH_ATTN=0    -e VLLM_USE_AITER=1     rocm/vllm-dev:nightly_aiter_integration_final_20250325
 
VLLM_USE_AITER_2STAGE_MOE=0  python /app/vllm/benchmarks/profiling/benchmark_latency.py \
--model amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV \
--dtype auto \
--gpu-memory-utilization 0.92 \
--num-scheduler-steps 1 \
--max-model-len 8192 \
--distributed-executor-backend mp \
--tensor-parallel-size 8 \
--input-len 128 \
--output-len 128



# works when running TP1 first and then TP8 in the same container 
docker run -it     --ipc=host     --network=host     --privileged     --cap-add=CAP_SYS_ADMIN     --device=/dev/kfd     --device=/dev/dri     --device=/dev/mem     --group-add render     --cap-add=SYS_PTRACE     --security-opt seccomp=unconfined     -v /data:/data     -e HF_HOME=/data/huggingface-cache     - -e VLLM_USE_TRITON_FLASH_ATTN=0    -e VLLM_USE_AITER=1     rocm/vllm-dev:nightly_aiter_integration_final_20250325
 
python /app/vllm/benchmarks/profiling/benchmark_latency.py \
--model amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV \
--dtype auto \
--gpu-memory-utilization 0.92 \
--num-scheduler-steps 1 \
--max-model-len 8192 \
--distributed-executor-backend mp \
--tensor-parallel-size 1 \
--input-len 128 \
--output-len 128
# start build [module_moe_ck2stages] under /usr/local/lib/python3.12/dist-packages/aiter/jit/build/module_moe_ck2stages
# Completes without error
 
# now TP8 Works
python /app/vllm/benchmarks/profiling/benchmark_latency.py \
--model amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV \
--dtype auto \
--gpu-memory-utilization 0.92 \
--num-scheduler-steps 1 \
--max-model-len 8192 \
--distributed-executor-backend mp \
--tensor-parallel-size 8 \
--input-len 128 \
--output-len 128

Sys Info:
OS:
NAME="Ubuntu"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
CPU:
model name : AMD EPYC 9575F 64-Core Processor

Operating System

Ubuntu "22.04.5 LTS (Jammy Jellyfish)"

CPU

AMD EPYC 9575F 64-Core Processor

GPU

MI300X

ROCm Version

ROCm 6.3.1

ROCm Component

No response

Steps to Reproduce

FAILS

docker run -it --ipc=host --network=host --privileged --cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri --device=/dev/mem --group-add render --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v /data:/data -e HF_HOME=/data/huggingface-cache -e VLLM_USE_TRITON_FLASH_ATTN=0 -e VLLM_USE_AITER=1 rocm/vllm-dev:nightly_aiter_integration_final_20250325

python /app/vllm/benchmarks/profiling/benchmark_latency.py
--model amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV
--dtype auto
--gpu-memory-utilization 0.92
--num-scheduler-steps 1
--max-model-len 8192
--distributed-executor-backend mp
--tensor-parallel-size 8
--input-len 128
--output-len 128

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

junhaha666 · 2025-04-01T12:48:11Z

I'm sorry I can't reproduce your problem. Your problem seems to be that the CK 2Stage MoE JIT build is failed, maybe you can try running this test to check if CK 2Stage MoE works. If it works. you can run your CMD to launch vllm benchmark again.
test: https://github.com/ROCm/aiter/blob/main/op_tests/test_moe_2stage.py

arakowsk-amd · 2025-04-01T16:32:11Z

That test seems to pass and then running tp 8 in the same container works fines. However, I get this error when i start a new container and run TP8 as the first command. I'm seeing this on multiple systems, multiple different input/output sizes. What is the exit status of the command below for you?

docker run -it \
    --ipc=host \
    --network=host \
    --privileged \
    --cap-add=CAP_SYS_ADMIN \
    --device=/dev/kfd \
    --device=/dev/dri \
    --device=/dev/mem \
    --group-add render \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    -v /data:/data \
    -e HF_HOME=/data/huggingface-cache \
    -e HF_TOKEN=<TOKEN> \
    -e VLLM_USE_TRITON_FLASH_ATTN=0 \
    -e VLLM_USE_AITER=1 \
    rocm/vllm-dev:nightly_aiter_integration_final_20250325 \
    python /app/vllm/benchmarks/profiling/benchmark_latency.py \
    --model amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV \
    --quantization fp8 \
    --kv-cache-dtype fp8 \
    --dtype auto \
    --gpu-memory-utilization 0.92 \
    --num-scheduler-steps 10 \
    --max-model-len 8192 \
    --distributed-executor-backend mp \
    --tensor-parallel-size 8 \
    --input-len 128 \
    --output-len 128
echo $?
135

docker run -it \
    --ipc=host \
    --network=host \
    --privileged \
    --cap-add=CAP_SYS_ADMIN \
    --device=/dev/kfd \
    --device=/dev/dri \
    --device=/dev/mem \
    --group-add render \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    -v /data:/data \
    -e HF_HOME=/data/huggingface-cache \
    -e HF_TOKEN=<TOKEN> \
    -e VLLM_USE_TRITON_FLASH_ATTN=0 \
    -e VLLM_USE_AITER=0 \
    rocm/vllm-dev:nightly_aiter_integration_final_20250325 \
    python /app/vllm/benchmarks/profiling/benchmark_latency.py \
    --model amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV \
    --quantization fp8 \
    --kv-cache-dtype fp8 \
    --dtype auto \
    --gpu-memory-utilization 0.92 \
    --num-scheduler-steps 10 \
    --max-model-len 8192 \
    --distributed-executor-backend mp \
    --tensor-parallel-size 8 \
    --input-len 128 \
    --output-len 128
echo $?
0

junhaha666 · 2025-04-02T05:26:16Z

This problem can be caused by multiple processes triggering JIT compilation. To fix this, we created a new branch (https://github.com/ROCm/aiter/tree/jit_update). You can replace AITER in container with this branch and run your command.

arakowsk-amd added the bug Something isn't working label Mar 31, 2025

valarLip assigned junhaha666 Apr 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: moe_ck2stages Mixtral TP8 fails #257

[Issue]: moe_ck2stages Mixtral TP8 fails #257

arakowsk-amd commented Mar 31, 2025 •

edited

Loading

junhaha666 commented Apr 1, 2025

arakowsk-amd commented Apr 1, 2025

junhaha666 commented Apr 2, 2025

[Issue]: moe_ck2stages Mixtral TP8 fails #257

[Issue]: moe_ck2stages Mixtral TP8 fails #257

Comments

arakowsk-amd commented Mar 31, 2025 • edited Loading

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

FAILS

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

junhaha666 commented Apr 1, 2025

arakowsk-amd commented Apr 1, 2025

junhaha666 commented Apr 2, 2025

arakowsk-amd commented Mar 31, 2025 •

edited

Loading