-
Notifications
You must be signed in to change notification settings - Fork 24
Open
Description
I run into an illegal memory access was encountered with flashmoe.run_moe
(ziming) ubuntu@ip-10-0-0-122:~/ziming/FlashMoE$ python -c "import flashmoe; flashmoe.run_moe(n_processes=2)"
Launching FlashMoE with: /opt/amazon/openmpi/bin/mpirun -np 2 --map-by ppr:2:node -x NVSHMEM_HOME -x LD_LIBRARY_PATH -x CUDA_HOME -x PATH /home/ubuntu/miniconda3/envs/ziming/bin/python /home/ubuntu/ziming/FlashMoE/flashmoe/worker.py /home/ubuntu/ziming/FlashMoE/csrc/kleos_config.json
============================================================
ERROR: Command failed with exit code 1
============================================================
STDOUT:
Process 0/2 using GPU 0
Process 0: Creating 64 local experts (total 64)
Process 0: Calling moe_forward...
Process 1/2 using GPU 1
Process 1: Creating 64 local experts (total 64)
Process 1: Calling moe_forward...
Process 0: FlashMoE forward pass took 5.31 ms
Process 0: Completed! Output: torch.Size([1, 8192, 2048])
STDERR:
</home/ubuntu/ziming/FlashMoE/csrc/include/flashmoe/moe/moe.cuh:179> cudaStreamSynchronize(flashmoeStream):
cudaErrorIllegalAddress: an illegal memory access was encountered
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[36782,1],1]
Exit code: 1
--------------------------------------------------------------------------
============================================================
Traceback (most recent call last):
File "<string>", line 1, in <module>
import flashmoe; flashmoe.run_moe(n_processes=2)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
File "/home/ubuntu/ziming/FlashMoE/flashmoe/ops.py", line 54, in run_moe
nvshmrun_launcher(
~~~~~~~~~~~~~~~~~^
config_path=config_path,
^^^^^^^^^^^^^^^^^^^^^^^^
...<2 lines>...
hostfile=hostfile
^^^^^^^^^^^^^^^^^
)
^
File "/home/ubuntu/ziming/FlashMoE/flashmoe/launcher.py", line 89, in nvshmrun_launcher
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
File "/home/ubuntu/miniconda3/envs/ziming/lib/python3.13/subprocess.py", line 577, in run
raise CalledProcessError(retcode, process.args,
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/opt/amazon/openmpi/bin/mpirun', '-np', '2', '--map-by', 'ppr:2:node', '-x', 'NVSHMEM_HOME', '-x', 'LD_LIBRARY_PATH', '-x', 'CUDA_HOME', '-x', 'PATH', '/home/ubuntu/miniconda3/envs/ziming/bin/python', '/home/ubuntu/ziming/FlashMoE/flashmoe/worker.py', '/home/ubuntu/ziming/FlashMoE/csrc/kleos_config.json']' returned non-zero exit status 1.
However, with n_processes=1 this works.
(ziming) ubuntu@ip-10-0-0-122:~/ziming/FlashMoE$ python -c "import flashmoe; flashmoe.run_moe(n_processes=1)"
Launching FlashMoE with: /opt/amazon/openmpi/bin/mpirun -np 1 --map-by ppr:1:node -x NVSHMEM_HOME -x LD_LIBRARY_PATH -x CUDA_HOME -x PATH /home/ubuntu/miniconda3/envs/ziming/bin/python /home/ubuntu/ziming/FlashMoE/flashmoe/worker.py /home/ubuntu/ziming/FlashMoE/csrc/kleos_config.json
Process 0/1 using GPU 0
Process 0: Creating 64 local experts (total 64)
Process 0: Calling moe_forward...
Process 0: FlashMoE forward pass took 5.21 ms
Process 0: Completed! Output: torch.Size([1, 8192, 2048])
(ziming) ubuntu@ip-10-0-0-122:~/ziming/FlashMoE$
I am running on H100:8.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels