Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Cannot run inference in applications/ColossalMoE #6175

Open
1 task done
Guodanding opened this issue Jan 2, 2025 · 0 comments
Open
1 task done

[BUG]: Cannot run inference in applications/ColossalMoE #6175

Guodanding opened this issue Jan 2, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@Guodanding
Copy link

Guodanding commented Jan 2, 2025

Is there an existing issue for this bug?

  • I have searched the existing issues

🐛 Describe the bug

Hello ColossalAI team.
First i install with
BUILD_EXT=1 pip install colossalai
and then run ColossalAI/applications/ColossalMoE/infer.sh, the program break itself without any useful bug report:

(clos) root@autodl-container-bce24a9cd7-bd6ad579:~/workspace-clos/ColossalAI/applications/ColossalMoE# bash infer.sh
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
  warnings.warn(
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
  warnings.warn(
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
  warnings.warn(
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
  warnings.warn(
[01/02/25 17:10:20] INFO     colossalai - colossalai - INFO: /root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/initialize.py:75 launch                                             
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, world size: 2                                                                                  
/root/miniconda3/envs/clos/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/root/miniconda3/envs/clos/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Set plugin as MoeHybridParallelPlugin
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:23<00:00,  1.26s/it]
Finish load model
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/moe/_operation.py:207: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, tokens, mask, dest_idx, ec):
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/moe/_operation.py:229: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, output_grad):
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/moe/_operation.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, expert_tokens, logits, mask, dest_idx, ec):
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/moe/_operation.py:270: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, tokens_grad):
W0102 17:10:54.409000 140049982514304 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 13902 closing signal SIGTERM
E0102 17:11:03.597000 140049982514304 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 1 (pid: 13903) of binary: /root/miniconda3/envs/clos/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/clos/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
infer.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-02_17:10:54
  host      : autodl-container-bce24a9cd7-bd6ad579
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 13903)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 13903
======================================================

Then I install with

cd ColossalAI/applications/ColossalMoE
pip install -e .

as readme said but meet bug.
If I install colossalai == 0.3.3 or 0.3.4, the bug is

Traceback (most recent call last):
  File "/root/workspace-clos/ColossalAI/applications/ColossalMoE/infer.py", line 14, in <module>
    from colossalai.booster.plugin.moe_hybrid_parallel_plugin import MoeHybridParallelPlugin
ModuleNotFoundError: No module named 'colossalai.booster.plugin.moe_hybrid_parallel_plugin'

If I install colossalai >=0.3.5, the bug is

Traceback (most recent call last):
  File "/root/workspace-clos/ColossalAI/applications/ColossalMoE/infer.py", line 14, in <module>
    from colossalai.booster.plugin.moe_hybrid_parallel_plugin import MoeHybridParallelPlugin
  File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 92, in <module>
    class MoeHybridParallelPlugin(HybridParallelPlugin):
  File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 185, in MoeHybridParallelPlugin
    checkpoint_io: Optional[MoECheckpintIO] = None,
NameError: name 'MoECheckpintIO' is not defined. Did you mean: 'MoECheckpointIO'?

After I fix it, the bug will be

Traceback (most recent call last):
  File "/root/workspace-clos/ColossalAI/applications/ColossalMoE/infer.py", line 114, in <module>
    main()
  File "/root/workspace-clos/ColossalAI/applications/ColossalMoE/infer.py", line 62, in main
    colossalai.launch_from_torch(seed=args.seed)
TypeError: launch_from_torch() missing 1 required positional argument: 'config'

How to fix it?

Environment

(clos) root@autodl-container-bce24a9cd7-bd6ad579:~/workspace-clos/ColossalAI/applications/ColossalMoE# colossalai check -i
#### Installation Report ####

------------ Environment ------------
Colossal-AI version: 0.3.6
PyTorch version: 2.4.1
System CUDA version: 12.4
CUDA version required by PyTorch: 12.1

Note:
1. The table above checks the versions of the libraries/tools in the current environment
2. If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
3. If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is give by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.

------------ CUDA Extensions AOT Compilation ------------
Found AOT CUDA Extension: ✓
PyTorch version used for AOT compilation: N/A
CUDA version used for AOT compilation: N/A

Note:
1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment variable BUILD_EXT=1 is set
2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime

------------ Compatibility ------------
PyTorch version match: N/A
System and PyTorch CUDA version match: x
System and Colossal-AI CUDA version match: N/A

Note:
1. The table above checks the version compatibility of the libraries/tools in the current environment
   - PyTorch version mismatch: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
   - System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
   - System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation
@Guodanding Guodanding added the bug Something isn't working label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant