Skip to content

[Bug]: kimi k2.5开启VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE后p节点崩溃 #7508

@wakalai

Description

@wakalai

Your current environment

环境:
机器:a3
镜像:quay.io/ascend/vllm-ascend:v0.17.0rc1-a3

启动命令:
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
export VLLM_TORCH_PROFILER_WITH_STACK=0
export VLLM_RPC_TIMEOUT=3600000
export VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=3600000
export HCCL_OP_EXPANSION_MODE="AIV"

export VLLM_WORKER_MULTIPROC_METHOD="fork"
export ASCEND_BUFFER_POOL=4:8
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
#export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export LD_PRELOAD=/lib/aarch64-linux-gnu/libjemalloc.so.2
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000

export VLLM_ASCEND_ENABLE_MLAPO=1
export HCCL_BUFFSIZE=800
export VLLM_ASCEND_ENABLE_FUSED_MC2=2
#export VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=0
export VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=4
vllm serve Kimi-K25-W4A8
--host 0.0.0.0
--port 1025
--quantization ascend
--served-model-name kimi
--tool-call-parser kimi_k2
--reasoning-parser kimi_k2
--trust-remote-code
--tensor-parallel-size 8
--data-parallel-size 2
--enable-expert-parallel
--max-num-seqs 8
--max-model-len 131072
--max-num-batched-tokens 11264
--no-enable-prefix-caching
--gpu-memory-utilization 0.9
--allowed-local-media-path /
--seed 42
--async-scheduling
--mm-processor-cache-type shm
--mm-encoder-tp-mode data
--additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false},"recompute_scheduler_enable" : true,"layer_sharding": ["o_proj"]}'
--kv-transfer-config
'{"kv_connector": "MooncakeLayerwiseConnector",
"kv_role": "kv_producer",
"kv_port": "30100",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
"kv_connector_extra_config": {
"prefill": {
"dp_size": 2,
"tp_size": 8
},
"decode": {
"dp_size": 4,
"tp_size": 4
}
}
}' 2>&1 | tee p.log

🐛 Describe the bug

报错:
File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/layer_shard_linear.py", line 73, in post_process_after_loading^M
assert layer.layer_idx == layer_idx, "layer_idx must be consecutive"
AssertionError: layer_idx must be consecutive

ps:请好好写一下每个优化项的详细信息,我根本不知道VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE后面需要填几!以及recompute_scheduler_enable到底用来干嘛的。

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions