-
Notifications
You must be signed in to change notification settings - Fork 167
[Bug]: GLM-5-w8a8 fails to launch in 910C #967
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your environment
- Hardware: 910C with ARM
- xLLM version: preview/glm5
- startup parameters:
- --max_memory_utilization=0.85
--max_tokens_per_batch=8192
--max_seqs_per_batch=16
--block_size=128
--enable_prefix_cache=true
--enable_chunked_prefill=true
--communication_backend="hccl"
--enable_schedule_overlap=true
--enable_graph=true
--enable_graph_no_padding=true
--enable_mla=true
--draft_model=$DRAFT_MODEL_PATH
--draft_devices="npu:$DEVICE"
--num_speculative_tokens=1
--ep_size=8
--dp_size=1
- --max_memory_utilization=0.85
🐛 Describe the bug
- Log from rank-0 showed that word_embedding_layer execute plan fail:
I20260302 12:21:10.077145 521062 llm_engine.cpp:389] Initializing v cache with shape: [275 128 1 64]
I20260302 12:21:10.077220 521062 llm_engine.cpp:391] Initializing indexer cache with shape: [275 128 1 128]
I20260302 12:21:10.078318 521062 profile_manager.cpp:63] Starting ACL Graph/CUDA Graph warmup.
I20260302 12:21:10.078365 521062 profile_manager.cpp:771] Starting ACL Graph/CUDA Graph warmup with prefill and decode requests...
I20260302 12:21:10.078394 521062 profile_manager.cpp:809] Warming up prefill request: tokens=8192
mki_log mkdir /root/ascend/log/atb
E20260302 12:21:10.300601 525515 npu_base_layer.cpp:124] word_embedding_layer execute plan fail, error code: 28
terminate called after throwing an instance of 'std::runtime_error'
terminate called recursively
what(): The Inner error is reported as above. The process exits for this inner error, and the current working operator name is word_embedding_layer0.
Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
[ERROR] 2026-03-02-12:21:10 (PID:521062, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
/usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
- ATB log show that the HcclGetRootInfo fail:
[2026-03-02 12:21:10.125760] [error] [525604] [hccl_runner.cpp:178] AllGatherHcclRunner:0 HcclGetRootInfo fail, error:7, rank:0
[2026-03-02 12:21:10.127881] [error] [525604] [comm_pool.h:42] CommPool commCreateFunc fail
[2026-03-02 12:21:10.127889] [error] [525604] [hccl_runner.cpp:81] AllGatherHcclRunner:0 get hccl comm fail by rank:0
[2026-03-02 12:21:10.300542] [error] [525515] [all_gather_hccl_runner.cpp:39] hcclComm is null, rank: 0
[2026-03-02 12:21:10.300575] [error] [525515] [runner.cpp:133] AllGatherHcclRunner_0_1:1 Execute Failed. st: 28
[2026-03-02 12:21:10.300583] [error] [525515] [graph_runner.cpp:972] WordEmbeddingRunner_0:0 node[1] execute fail, runner name:AllGatherHcclRunner
[2026-03-02 12:21:10.300588] [error] [525515] [runner.cpp:133] WordEmbeddingRunner_0:1 Execute Failed. st: 28
[2026-03-02 12:21:10.300593] [error] [525515] [operation_base.cpp:1018] WordEmbedding_0 execute WordEmbeddingRunner fail
[2026-03-02 12:21:10.300596] [error] [525515] [operation_base.cpp:1095] WordEmbedding_0 Launch fail, error code: 28
- Always happens.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working