[BUG]: Cannot run inference in applications/ColossalMoE #6175

Guodanding · 2025-01-02T07:55:02Z

Is there an existing issue for this bug?

I have searched the existing issues

🐛 Describe the bug

Hello ColossalAI team.
First i install with
BUILD_EXT=1 pip install colossalai
and then run ColossalAI/applications/ColossalMoE/infer.sh, the program break itself without any useful bug report:

(clos) root@autodl-container-bce24a9cd7-bd6ad579:~/workspace-clos/ColossalAI/applications/ColossalMoE# bash infer.sh
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
  warnings.warn(
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
  warnings.warn(
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
  warnings.warn(
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
  warnings.warn(
[01/02/25 17:10:20] INFO     colossalai - colossalai - INFO: /root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/initialize.py:75 launch                                             
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, world size: 2                                                                                  
/root/miniconda3/envs/clos/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/root/miniconda3/envs/clos/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Set plugin as MoeHybridParallelPlugin
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:23<00:00,  1.26s/it]
Finish load model
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/moe/_operation.py:207: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, tokens, mask, dest_idx, ec):
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/moe/_operation.py:229: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, output_grad):
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/moe/_operation.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, expert_tokens, logits, mask, dest_idx, ec):
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/moe/_operation.py:270: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, tokens_grad):
W0102 17:10:54.409000 140049982514304 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 13902 closing signal SIGTERM
E0102 17:11:03.597000 140049982514304 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 1 (pid: 13903) of binary: /root/miniconda3/envs/clos/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/clos/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
infer.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-02_17:10:54
  host      : autodl-container-bce24a9cd7-bd6ad579
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 13903)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 13903
======================================================

Then I install with

cd ColossalAI/applications/ColossalMoE
pip install -e .

as readme said but meet bug.
If I install colossalai == 0.3.3 or 0.3.4, the bug is

Traceback (most recent call last):
  File "/root/workspace-clos/ColossalAI/applications/ColossalMoE/infer.py", line 14, in <module>
    from colossalai.booster.plugin.moe_hybrid_parallel_plugin import MoeHybridParallelPlugin
ModuleNotFoundError: No module named 'colossalai.booster.plugin.moe_hybrid_parallel_plugin'

If I install colossalai >=0.3.5, the bug is

Traceback (most recent call last):
  File "/root/workspace-clos/ColossalAI/applications/ColossalMoE/infer.py", line 14, in <module>
    from colossalai.booster.plugin.moe_hybrid_parallel_plugin import MoeHybridParallelPlugin
  File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 92, in <module>
    class MoeHybridParallelPlugin(HybridParallelPlugin):
  File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 185, in MoeHybridParallelPlugin
    checkpoint_io: Optional[MoECheckpintIO] = None,
NameError: name 'MoECheckpintIO' is not defined. Did you mean: 'MoECheckpointIO'?

After I fix it, the bug will be

Traceback (most recent call last):
  File "/root/workspace-clos/ColossalAI/applications/ColossalMoE/infer.py", line 114, in <module>
    main()
  File "/root/workspace-clos/ColossalAI/applications/ColossalMoE/infer.py", line 62, in main
    colossalai.launch_from_torch(seed=args.seed)
TypeError: launch_from_torch() missing 1 required positional argument: 'config'

How to fix it?

Environment

(clos) root@autodl-container-bce24a9cd7-bd6ad579:~/workspace-clos/ColossalAI/applications/ColossalMoE# colossalai check -i
#### Installation Report ####

------------ Environment ------------
Colossal-AI version: 0.3.6
PyTorch version: 2.4.1
System CUDA version: 12.4
CUDA version required by PyTorch: 12.1

Note:
1. The table above checks the versions of the libraries/tools in the current environment
2. If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
3. If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is give by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.

------------ CUDA Extensions AOT Compilation ------------
Found AOT CUDA Extension: ✓
PyTorch version used for AOT compilation: N/A
CUDA version used for AOT compilation: N/A

Note:
1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment variable BUILD_EXT=1 is set
2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime

------------ Compatibility ------------
PyTorch version match: N/A
System and PyTorch CUDA version match: x
System and Colossal-AI CUDA version match: N/A

Note:
1. The table above checks the version compatibility of the libraries/tools in the current environment
   - PyTorch version mismatch: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
   - System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
   - System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

The text was updated successfully, but these errors were encountered:

Guodanding added the bug Something isn't working label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Cannot run inference in applications/ColossalMoE #6175

[BUG]: Cannot run inference in applications/ColossalMoE #6175

Guodanding commented Jan 2, 2025 •

edited

Loading

[BUG]: Cannot run inference in applications/ColossalMoE #6175

[BUG]: Cannot run inference in applications/ColossalMoE #6175

Comments

Guodanding commented Jan 2, 2025 • edited Loading

Is there an existing issue for this bug?

🐛 Describe the bug

Environment

Guodanding commented Jan 2, 2025 •

edited

Loading