You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello ColossalAI team.
First i install with BUILD_EXT=1 pip install colossalai
and then run ColossalAI/applications/ColossalMoE/infer.sh, the program break itself without any useful bug report:
(clos) root@autodl-container-bce24a9cd7-bd6ad579:~/workspace-clos/ColossalAI/applications/ColossalMoE# bash infer.sh
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/utils/safetensors.py:13: UserWarning: Please install the latest tensornvme to use async save. pip install git+https://github.com/hpcaitech/TensorNVMe.git
warnings.warn(
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:48: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:93: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn(
[01/02/25 17:10:20] INFO colossalai - colossalai - INFO: /root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/initialize.py:75 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, world size: 2
/root/miniconda3/envs/clos/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
/root/miniconda3/envs/clos/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Set plugin as MoeHybridParallelPlugin
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:23<00:00, 1.26s/it]
Finish load model
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/moe/_operation.py:207: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(ctx, tokens, mask, dest_idx, ec):
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/moe/_operation.py:229: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, output_grad):
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/moe/_operation.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(ctx, expert_tokens, logits, mask, dest_idx, ec):
/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/moe/_operation.py:270: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, tokens_grad):
W0102 17:10:54.409000 140049982514304 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 13902 closing signal SIGTERM
E0102 17:11:03.597000 140049982514304 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 1 (pid: 13903) of binary: /root/miniconda3/envs/clos/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/clos/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
infer.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-01-02_17:10:54
host : autodl-container-bce24a9cd7-bd6ad579
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 13903)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 13903
======================================================
Then I install with
cd ColossalAI/applications/ColossalMoE
pip install -e .
as readme said but meet bug.
If I install colossalai == 0.3.3 or 0.3.4, the bug is
Traceback (most recent call last):
File "/root/workspace-clos/ColossalAI/applications/ColossalMoE/infer.py", line 14, in <module>
from colossalai.booster.plugin.moe_hybrid_parallel_plugin import MoeHybridParallelPlugin
ModuleNotFoundError: No module named 'colossalai.booster.plugin.moe_hybrid_parallel_plugin'
If I install colossalai >=0.3.5, the bug is
Traceback (most recent call last):
File "/root/workspace-clos/ColossalAI/applications/ColossalMoE/infer.py", line 14, in <module>
from colossalai.booster.plugin.moe_hybrid_parallel_plugin import MoeHybridParallelPlugin
File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 92, in <module>
class MoeHybridParallelPlugin(HybridParallelPlugin):
File "/root/miniconda3/envs/clos/lib/python3.10/site-packages/colossalai/booster/plugin/moe_hybrid_parallel_plugin.py", line 185, in MoeHybridParallelPlugin
checkpoint_io: Optional[MoECheckpintIO] = None,
NameError: name 'MoECheckpintIO' is not defined. Did you mean: 'MoECheckpointIO'?
After I fix it, the bug will be
Traceback (most recent call last):
File "/root/workspace-clos/ColossalAI/applications/ColossalMoE/infer.py", line 114, in <module>
main()
File "/root/workspace-clos/ColossalAI/applications/ColossalMoE/infer.py", line 62, in main
colossalai.launch_from_torch(seed=args.seed)
TypeError: launch_from_torch() missing 1 required positional argument: 'config'
How to fix it?
Environment
(clos) root@autodl-container-bce24a9cd7-bd6ad579:~/workspace-clos/ColossalAI/applications/ColossalMoE# colossalai check -i
#### Installation Report ####
------------ Environment ------------
Colossal-AI version: 0.3.6
PyTorch version: 2.4.1
System CUDA version: 12.4
CUDA version required by PyTorch: 12.1
Note:
1. The table above checks the versions of the libraries/tools in the current environment
2. If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
3. If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is give by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.
------------ CUDA Extensions AOT Compilation ------------
Found AOT CUDA Extension: ✓
PyTorch version used for AOT compilation: N/A
CUDA version used for AOT compilation: N/A
Note:
1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment variable BUILD_EXT=1 is set
2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime
------------ Compatibility ------------
PyTorch version match: N/A
System and PyTorch CUDA version match: x
System and Colossal-AI CUDA version match: N/A
Note:
1. The table above checks the version compatibility of the libraries/tools in the current environment
- PyTorch version mismatch: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
- System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
- System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation
The text was updated successfully, but these errors were encountered:
Is there an existing issue for this bug?
🐛 Describe the bug
Hello ColossalAI team.
First i install with
BUILD_EXT=1 pip install colossalai
and then run ColossalAI/applications/ColossalMoE/infer.sh, the program break itself without any useful bug report:
Then I install with
as readme said but meet bug.
If I install colossalai == 0.3.3 or 0.3.4, the bug is
If I install colossalai >=0.3.5, the bug is
After I fix it, the bug will be
How to fix it?
Environment
The text was updated successfully, but these errors were encountered: