Skip to content

flashmoe.run_moe runs into cudaErrorIllegalAddress: an illegal memory access was encountered #20

@MaoZiming

Description

@MaoZiming

I run into an illegal memory access was encountered with flashmoe.run_moe

(ziming) ubuntu@ip-10-0-0-122:~/ziming/FlashMoE$ python -c "import flashmoe; flashmoe.run_moe(n_processes=2)"
Launching FlashMoE with: /opt/amazon/openmpi/bin/mpirun -np 2 --map-by ppr:2:node -x NVSHMEM_HOME -x LD_LIBRARY_PATH -x CUDA_HOME -x PATH /home/ubuntu/miniconda3/envs/ziming/bin/python /home/ubuntu/ziming/FlashMoE/flashmoe/worker.py /home/ubuntu/ziming/FlashMoE/csrc/kleos_config.json
============================================================
ERROR: Command failed with exit code 1
============================================================
STDOUT:
Process 0/2 using GPU 0
Process 0: Creating 64 local experts (total 64)
Process 0: Calling moe_forward...
Process 1/2 using GPU 1
Process 1: Creating 64 local experts (total 64)
Process 1: Calling moe_forward...
Process 0: FlashMoE forward pass took 5.31 ms
Process 0: Completed! Output: torch.Size([1, 8192, 2048])


STDERR:
</home/ubuntu/ziming/FlashMoE/csrc/include/flashmoe/moe/moe.cuh:179> cudaStreamSynchronize(flashmoeStream):
    cudaErrorIllegalAddress: an illegal memory access was encountered
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[36782,1],1]
  Exit code:    1
--------------------------------------------------------------------------

============================================================
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import flashmoe; flashmoe.run_moe(n_processes=2)
                     ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
  File "/home/ubuntu/ziming/FlashMoE/flashmoe/ops.py", line 54, in run_moe
    nvshmrun_launcher(
    ~~~~~~~~~~~~~~~~~^
        config_path=config_path,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
        hostfile=hostfile
        ^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ubuntu/ziming/FlashMoE/flashmoe/launcher.py", line 89, in nvshmrun_launcher
    result = subprocess.run(cmd, capture_output=True, text=True, check=True)
  File "/home/ubuntu/miniconda3/envs/ziming/lib/python3.13/subprocess.py", line 577, in run
    raise CalledProcessError(retcode, process.args,
                             output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/opt/amazon/openmpi/bin/mpirun', '-np', '2', '--map-by', 'ppr:2:node', '-x', 'NVSHMEM_HOME', '-x', 'LD_LIBRARY_PATH', '-x', 'CUDA_HOME', '-x', 'PATH', '/home/ubuntu/miniconda3/envs/ziming/bin/python', '/home/ubuntu/ziming/FlashMoE/flashmoe/worker.py', '/home/ubuntu/ziming/FlashMoE/csrc/kleos_config.json']' returned non-zero exit status 1.

However, with n_processes=1 this works.

(ziming) ubuntu@ip-10-0-0-122:~/ziming/FlashMoE$ python -c "import flashmoe; flashmoe.run_moe(n_processes=1)"
Launching FlashMoE with: /opt/amazon/openmpi/bin/mpirun -np 1 --map-by ppr:1:node -x NVSHMEM_HOME -x LD_LIBRARY_PATH -x CUDA_HOME -x PATH /home/ubuntu/miniconda3/envs/ziming/bin/python /home/ubuntu/ziming/FlashMoE/flashmoe/worker.py /home/ubuntu/ziming/FlashMoE/csrc/kleos_config.json
Process 0/1 using GPU 0
Process 0: Creating 64 local experts (total 64)
Process 0: Calling moe_forward...
Process 0: FlashMoE forward pass took 5.21 ms
Process 0: Completed! Output: torch.Size([1, 8192, 2048])

(ziming) ubuntu@ip-10-0-0-122:~/ziming/FlashMoE$ 

I am running on H100:8.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions