[model] Add support for openPangu_Ultra_MoE #27521

yt0428 · 2025-10-26T03:39:56Z

Purpose

Add support for openPangu_Ultra_MoE models
FIX #27019

Test Plan

Test for openPangu-Ultra-MoE-718B-V1.1

Start serving:

vllm serve $LOCAL_CKPT_DIR/openPangu-Ultra-MoE-718B-V1.1 \ --data-parallel-size 4 \ --data-parallel-size-local 1 \ --data-parallel-start-rank $NODE_RANK \ --data-parallel-address $MASTER_NODE_IP \ --data-parallel-rpc-port 13389 \ --tensor-parallel-size 8 \ --served-model-name pangu_ultra_moe \ --enable-expert-parallel \ --trust-remote-code \

Test for openPangu-Embedded-7B-V1.1

Start serving:

Master node:
vllm serve FreedomIntelligence/openPangu-Embedded-7B-V1.1 \ --host 0.0.0.0 \ --port 8000 \ --max-num-batched-tokens 32768 \ --max-model-len 32768 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --served-model-name pangu \ --tensor-parallel-size 8 \ --data-parallel-size 4 \ --data-parallel-size-local 1 \ --data-parallel-start-rank $NODE_RANK \ --data-parallel-address $MASTER_NODE_IP \ --data-parallel-rpc-port 13345
Other nodes:
vllm serve FreedomIntelligence/openPangu-Embedded-7B-V1.1 \ --host 0.0.0.0 \ --port 8000 \ --headless \ --max-num-batched-tokens 32768 \ --max-model-len 32768 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --served-model-name pangu \ --tensor-parallel-size 8 \ --data-parallel-size 4 \ --data-parallel-size-local 1 \ --data-parallel-start-rank $NODE_RANK \ --data-parallel-address $MASTER_NODE_IP \ --data-parallel-rpc-port 13345

Test Result

Results for openPangu-Ultra-MoE-718B-V1.1

MATH500: 97.6

Results for openPangu-Embedded-7B-V1.1

Request test

python3 -c " import requests; response = requests.post( 'http://localhost:8000/v1/chat/completions', headers={'Content-Type': 'application/json'}, json={ 'model': 'pangu', 'temperature': 0.6, 'top_p': 0.95, 'max_tokens': 500, 'messages': [ { 'role': 'user', 'content': 'Let $S$ be the set of points $(a,b)$ with $0 \\le a,$ $b \\le 1$ such that the equation\n\\[x^4 + ax^3 - bx^2 + ax + 1 = 0\\]has at least one real root. Determine the area of the graph of $S.$' } ] } ); result = response.json(); if 'choices' in result and result['choices']: print(result['choices'][0]['message']['content']) else: print('No response') "

Response Correctness

`
[unused16] The answer is ( \frac{2x + 1}{2} ), which matches the expected value ( \frac{2x + 1}{2} ).

So the final answer is ( \frac{2x + 1}{2} ).

But wait—there's a second way to compute the the number of real roots by checking all the possible combinations of x and y, but only counting those where the equation holds and there's at least one real root. So in the code, we can do:

count = 0
for x in x_list:
    for y in y_list:
        if (x^4 + ax^3 - b*x^2 + a*x + 1) == 0 and (x^4 + ax^3 - b*x^2 + a*x + 1) == x^2 + y^2:
            count += 1

But that's O(n*m), which is acceptable for small n and m......
`

Inference throughout

The throuthoutput is ~180 Tokens/s

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: yuantao <[email protected]>

github-actions · 2025-10-26T03:40:11Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mergify · 2025-10-26T03:40:35Z

Documentation preview: https://vllm--27521.org.readthedocs.build/en/27521/

gemini-code-assist

Code Review

This pull request adds support for the openPangu_Ultra_MoE model. The changes include a new model implementation file and updates to various configuration and registry files to integrate the new model. The implementation appears to be largely adapted from the existing deepseek_v2 model.

I've identified a critical issue in the scaling logic within the OpenPanguMoE module, which seems to have been carried over from the deepseek_v2 implementation. This logic flaw could lead to incorrect computations, particularly in float16 precision, potentially affecting the model's output. A detailed comment with a suggested fix is provided below. The other changes appear to be correct and consistent with adding a new model to the framework.

vllm/model_executor/models/openpangu.py

…raMoEForCausalLM Signed-off-by: yuantao <[email protected]>

Bye-legumes · 2025-10-28T18:12:40Z

hi, can you give us (me and https://github.com/kcmnd )the access to your fork repo as we tested it it doent work now. We can fix some codes.

jeejeelee · 2025-10-29T01:45:45Z

docs/models/supported_models.md

 | `OLMoEForCausalLM` | OLMoE | `allenai/OLMoE-1B-7B-0924`, `allenai/OLMoE-1B-7B-0924-Instruct`, etc. | | ✅︎ |
 | `OPTForCausalLM` | OPT, OPT-IML | `facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc. | ✅︎ | ✅︎ |
 | `OrionForCausalLM` | Orion | `OrionStarAI/Orion-14B-Base`, `OrionStarAI/Orion-14B-Chat`, etc. | | ✅︎ |
+| `PanguUltraMoEForCausalLM` |openpangu-ultra-moe-718b-model | | ✅︎ | ✅︎ |


Does this model have a publicly accessible link?

There is a publicly accessible version in https://ai.gitcode.com/ascend-tribe/openPangu-Ultra-MoE-718B-V1.1. However, it has not been upload to huggingface yet. The config file in this repo https://ai.gitcode.com/ascend-tribe/openPangu-Ultra-MoE-718B-V1.1 needs to be modified to align with the common practice in vllm. Therefore, I basically test the model in my local environments and it works well.

The model will be upload to Huggingface soon :)

yt0428 · 2025-10-29T02:30:12Z

hi, can you give us (me and https://github.com/kcmnd )the access to your fork repo as we tested it it doent work now. We can fix some codes.

Sure, I have sent the invitation. By the way, the reasoning for not working may the the config file issue, as I mentioned above.

Kishanthan · 2025-10-29T14:35:47Z

hi, can you give us (me and https://github.com/kcmnd )the access to your fork repo as we tested it it doent work now. We can fix some codes.

Sure, I have sent the invitation. By the way, the reasoning for not working may the the config file issue, as I mentioned above.

Could you also share your config.json file for the above changes? We are currently using the following and we could not load the model with the changes suggested in this PR.

{
  "architectures": [
    "PanguUltraMoEForCausalLM"
  ],
  "attention_bias": false,
  "auto_map": {
    "AutoConfig": "configuration_openpangu_moe.PanguUltraMoEConfig",
    "AutoModel": "modeling_openpangu_moe.PanguUltraMoEModel",
    "AutoModelForCausalLM": "modeling_openpangu_moe.PanguUltraMoEForCausalLM"
  },
  "num_dense_layers": 3,
  "bos_token_id": 0,
  "eos_token_id": 1,
  "ep_size": 1,
  "first_k_dense_replace": 3,
  "hidden_act": "silu",
  "hidden_size": 7680,
  "initializer_range": 0.02,
  "intermediate_size": 18432,
  "kv_lora_rank": 512,
  "attention_kv_lora_dim": 512,
  "max_position_embeddings": 131072, 
  "model_type": "pangu_ultra_moe",
  "moe_intermediate_size": 2048,
  "num_routed_experts": 256,
  "num_shared_experts": 1,
  "moe_layer_freq": 1,
  "n_group": 8,
  "n_routed_experts": 256,
  "n_shared_experts": 1,
  "norm_topk_prob": true,
  "num_attention_heads": 128,
  "num_experts_per_tok": 8,
  "num_hidden_layers": 62,
  "num_key_value_heads": 128,
  "num_nextn_predict_layers": 1,
  "q_lora_rank": 1536,
  "qk_nope_head_dim": 128,
  "qk_rope_head_dim": 64,
  "quantization_config": {
    "activation_scheme": "dynamic",
    "fmt": "e4m3",
    "quant_method": "fp8",
    "weight_block_size": [
      128,
      128
    ]
  },
  "rope_scaling": {
    "beta_fast": 32,
    "beta_slow": 1,
    "factor": 40,
    "mscale": 1.0,
    "mscale_all_dim": 1.0,
    "original_max_position_embeddings": 4096,
    "type": "yarn"
  },
  "num_mtp_layers": 1,
  "attention_q_lora_dim": 1536,
  "attention_qk_dim": 128,
  "attention_qk_rope_dim": 64,
  "rms_norm_eps": 1e-05,
  "rope_theta": 25600000,
  "routed_scaling_factor": 2.5,
  "sandwich_norm": true,
  "tie_word_embeddings": false,
  "topk_group": 4,
  "topk_method": "noaux_tc",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.2",
  "use_cache": true,
  "attention_v_dim": 128,
  "v_head_dim": 128,
  "vocab_size": 153600
}

vllm/model_executor/models/openpangu.py

MengqingCao · 2025-10-29T13:49:14Z

vllm/model_executor/models/openpangu.py

+
+class OpenPanguForCausalLM(nn.Module, SupportsPP, MixtureOfExperts, SupportsLoRA):
+    packed_modules_mapping = {
+        "gate_up_proj": ["gate_proj", "up_proj"],


I noticed QKVParallelLinear is used for self.qkv_proj layer creation. Why we don't add the mapping here?

Yes, you are right! We should add the mapping for qkv_proj here.

MengqingCao · 2025-10-29T14:08:29Z

vllm/model_executor/models/openpangu.py

+        else:
+            shared_output = None
+            final_hidden_states = fused_moe_out
+


Let's check shared_ouput to ensure it is not None when self.shared_experts is not None

Good point! We will add a assertion here.

MengqingCao · 2025-10-29T14:12:53Z

vllm/model_executor/models/openpangu.py

+        self.tp_size = get_tensor_model_parallel_world_size()
+        self.tp_rank = get_tp_group().rank_in_group


Let's move this to line 136, to make the parallel related parameters in one region.

MengqingCao · 2025-10-29T14:52:09Z

vllm/model_executor/models/openpangu.py

+        self.num_heads = self.total_num_heads // tp_size
+        self.total_num_kv_heads = num_kv_heads
+        if (
+            self.total_num_kv_heads >= tp_size


Suggested change

self.total_num_kv_heads >= tp_size

self.total_num_kv_heads > tp_size

Good point! self.total_num_kv_heads can not equal to tp_size in this condition.

MengqingCao · 2025-10-29T14:55:10Z

vllm/model_executor/models/openpangu.py

+        elif (
+            self.total_num_kv_heads < tp_size and tp_size % self.total_num_kv_heads != 0
+        ):
+            # Number of KV heads is less than TP size, so we replicate


Is this replication a TODO?

No. When number of KV heads is less than TP size, we can simply set the self.num_kv_heads to 1. The 'QKVParallelLinearmodule will do the replication automatically, which can be found in the description ofQKVParallelLinear`.

Thanks for this details

MengqingCao · 2025-10-29T15:05:28Z

vllm/model_executor/models/openpangu.py

+            config.hidden_size, eps=config.rms_norm_eps
+        )
+        self.tp_group = get_tp_group().device_group
+        if getattr(config, "sandwich_norm", False):


we could just set self.sandwich_norm = getattr(config, "sandwich_norm", False) and create the pre- and post- mlp layer when self.sandwich_norm is True

Yes! We can make this adjustment to make the code simpler.

hmellor · 2025-10-29T16:27:43Z

vllm/config/speculative.py

+        if hf_config.model_type in ("pangu_ultra_moe"):
+            hf_config.model_type = "pangu_ultra_moe_mtp"
+        if hf_config.model_type == "pangu_ultra_moe_mtp":
+            n_predict = getattr(hf_config, "num_nextn_predict_layers", None)
+            hf_config.update(
+                {"n_predict": n_predict, "architectures": ["OpenPanguMTPModel"]}
+            )


Could this override be done in vllm/model_executor/models/openpangu.py and vllm/model_executor/models/openpangu_mtp.py instead of putting model specific config overrides in the global configs?

Hmm I see this is currently done for quite a few models... We should do this in a follow up

I follow the common practice (like qwen3_next_mtp, longcat_flash_mtp) and place the override for mtp in vllm/config/speculative.py. I also thought about move the override to the modeling definition file, but it seems the initialization work flow for mtp makes it infeasible. Could you please provide some suggestions on the implementation?

For this PR, please follow the existing pattern as you have already done. Refactoring MTP config is a separate task.

Cool! Looking forward to it.

yt0428 · 2025-10-30T01:33:59Z

hi, can you give us (me and https://github.com/kcmnd )the access to your fork repo as we tested it it doent work now. We can fix some codes.

Sure, I have sent the invitation. By the way, the reasoning for not working may the the config file issue, as I mentioned above.

Could you also share your config.json file for the above changes? We are currently using the following and we could not load the model with the changes suggested in this PR.

{
  "architectures": [
    "PanguUltraMoEForCausalLM"
  ],
  "attention_bias": false,
  "auto_map": {
    "AutoConfig": "configuration_openpangu_moe.PanguUltraMoEConfig",
    "AutoModel": "modeling_openpangu_moe.PanguUltraMoEModel",
    "AutoModelForCausalLM": "modeling_openpangu_moe.PanguUltraMoEForCausalLM"
  },
  "num_dense_layers": 3,
  "bos_token_id": 0,
  "eos_token_id": 1,
  "ep_size": 1,
  "first_k_dense_replace": 3,
  "hidden_act": "silu",
  "hidden_size": 7680,
  "initializer_range": 0.02,
  "intermediate_size": 18432,
  "kv_lora_rank": 512,
  "attention_kv_lora_dim": 512,
  "max_position_embeddings": 131072, 
  "model_type": "pangu_ultra_moe",
  "moe_intermediate_size": 2048,
  "num_routed_experts": 256,
  "num_shared_experts": 1,
  "moe_layer_freq": 1,
  "n_group": 8,
  "n_routed_experts": 256,
  "n_shared_experts": 1,
  "norm_topk_prob": true,
  "num_attention_heads": 128,
  "num_experts_per_tok": 8,
  "num_hidden_layers": 62,
  "num_key_value_heads": 128,
  "num_nextn_predict_layers": 1,
  "q_lora_rank": 1536,
  "qk_nope_head_dim": 128,
  "qk_rope_head_dim": 64,
  "quantization_config": {
    "activation_scheme": "dynamic",
    "fmt": "e4m3",
    "quant_method": "fp8",
    "weight_block_size": [
      128,
      128
    ]
  },
  "rope_scaling": {
    "beta_fast": 32,
    "beta_slow": 1,
    "factor": 40,
    "mscale": 1.0,
    "mscale_all_dim": 1.0,
    "original_max_position_embeddings": 4096,
    "type": "yarn"
  },
  "num_mtp_layers": 1,
  "attention_q_lora_dim": 1536,
  "attention_qk_dim": 128,
  "attention_qk_rope_dim": 64,
  "rms_norm_eps": 1e-05,
  "rope_theta": 25600000,
  "routed_scaling_factor": 2.5,
  "sandwich_norm": true,
  "tie_word_embeddings": false,
  "topk_group": 4,
  "topk_method": "noaux_tc",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.2",
  "use_cache": true,
  "attention_v_dim": 128,
  "v_head_dim": 128,
  "vocab_size": 153600
}

Yes, the config file should be:

{
  "architectures": [
    "PanguUltraMoEForCausalLM"
  ],
  "attention_bias": false,
  "auto_map": {
    "AutoConfig": "configuration_openpangu_moe.PanguUltraMoEConfig",
    "AutoModel": "modeling_openpangu_moe.PanguUltraMoEModel",
    "AutoModelForCausalLM": "modeling_openpangu_moe.PanguUltraMoEForCausalLM"
  },
  "first_k_dense_replace": 3,
  "hidden_act": "silu",
  "hidden_size": 7680,
  "initializer_range": 0.02,
  "intermediate_size": 18432,
  "kv_lora_rank": 512,
  "max_position_embeddings": 131072, 
  "model_type": "pangu_ultra_moe",
  "moe_intermediate_size": 2048,
  "n_routed_experts": 256,
  "n_shared_experts": 1,
  "norm_topk_prob": true,
  "num_attention_heads": 128,
  "num_experts_per_tok": 8,
  "num_hidden_layers": 61,
  "num_key_value_heads": 128,
  "num_nextn_predict_layers": 1,
  "q_lora_rank": 1536,
  "qk_nope_head_dim": 128,
  "qk_rope_head_dim": 64,
  "rms_norm_eps": 1e-05,
  "rope_theta": 25600000,
  "routed_scaling_factor": 2.5,
  "sandwich_norm": true,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.2",
  "use_cache": true,
  "v_head_dim": 128,
  "vocab_size": 153600
}

And you should also change the name in configuration_openpangu_moe.py, which should be corresponing to the new config file.

Signed-off-by: yuantao <[email protected]>

…odels, which is dense models Signed-off-by: yuantao <[email protected]>

Bye-legumes · 2025-10-30T16:15:52Z

Can you share the whole process with the doc on how to use it? We have met some problems and trying to find which step is wrong. like how to use the pangu model and how to config the vllm and patch with current PR. Thanks! We still cannot run it with TP8 and PP4 as it will hang.

Kishanthan · 2025-10-30T16:59:45Z

Can you share the whole process with the doc on how to use it? We have met some problems and trying to find which step is wrong. like how to use the pangu model and how to config the vllm and patch with current PR. Thanks! We still cannot run it with TP8 and PP4 as it will hang.

What we are trying is to load this model with TP 8 and PP 4 on 4 nodes (32 H100 cards). We are not using DP. The model weights loads fine on all for nodes but when sending a request the inference fails with the below error. Our observation is that, though the weights are loaded on all nodes, only one node were seen using the GPU for inference while other nodes were idle. And after sometime, the request timeouts with the below CCGraph timeout error.

(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.1rc5.dev3+gd2c33c397) with config: model='/home/original_models/openPangu-Ultra-MoE-718B-model', speculative_config=None, tokenizer='/home/original_models/openPangu-Ultra-MoE-718B-model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=4, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/original_models/openPangu-Ultra-MoE-718B-model, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 16, 'local_cache_dir': None}, 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-838f640e-c966-48e6-ba3f-dd874733c65a'], resumed_from_preemption=[false], new_token_ids=[[45974]], resumed_req_token_ids=[null], new_block_ids=[null], num_computed_tokens=[15], num_output_tokens=[1]), num_scheduled_tokens={chatcmpl-838f640e-c966-48e6-ba3f-dd874733c65a: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=[], free_encoder_mm_hashes=[], structured_output_request_ids=[], grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=4.771675335213388e-05, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, spec_decoding_stats=None, kv_connector_stats=None, num_corrupted_reqs=0)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] Traceback (most recent call last):
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2525, in _execute_until
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     result = self._dag_output_fetcher.read(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 312, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     outputs = self._read_list(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]               ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 403, in _read_list
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     raise e
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 385, in _read_list
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     result = c.read(min(remaining_timeout, iteration_timeout))
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 776, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     return self._channel_dict[self._resolve_actor_id()].read(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 480, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     ret = self._worker.get_objects(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]           ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 998, in get_objects
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     ] = self.core_worker.get_objects(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "python/ray/_raylet.pyx", line 3141, in ray._raylet.CoreWorker.get_objects
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "python/ray/includes/common.pxi", line 120, in ray._raylet.check_status
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read. ObjectID: 005fed9a4d0da286e8b79c779c6273685d7b963e0200000002e1f505
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] Traceback (most recent call last):
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 772, in run_engine_core
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 799, in run_busy_loop
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     self._process_engine_step()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 828, in _process_engine_step
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 382, in step_with_batch_queue
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     model_output = future.result()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_utils.py", line 149, in result
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     return self.refs[0].get()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]            ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/compiled_dag_ref.py", line 115, in get
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     self._dag._execute_until(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2535, in _execute_until
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     raise RayChannelTimeoutError(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] ray.exceptions.RayChannelTimeoutError: System error: If the execution is expected to take a long time, increase RAY_CGRAPH_get_timeout which is currently 300 seconds. Otherwise, this may indicate that the execution is hanging.

yt0428 · 2025-10-31T00:58:21Z

Can you share the whole process with the doc on how to use it? We have met some problems and trying to find which step is wrong. like how to use the pangu model and how to config the vllm and patch with current PR. Thanks! We still cannot run it with TP8 and PP4 as it will hang.

What we are trying is to load this model with TP 8 and PP 4 on 4 nodes (32 H100 cards). We are not using DP. The model weights loads fine on all for nodes but when sending a request the inference fails with the below error. Our observation is that, though the weights are loaded on all nodes, only one node were seen using the GPU for inference while other nodes were idle. And after sometime, the request timeouts with the below CCGraph timeout error.

(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.1rc5.dev3+gd2c33c397) with config: model='/home/original_models/openPangu-Ultra-MoE-718B-model', speculative_config=None, tokenizer='/home/original_models/openPangu-Ultra-MoE-718B-model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=4, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/original_models/openPangu-Ultra-MoE-718B-model, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 16, 'local_cache_dir': None}, 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-838f640e-c966-48e6-ba3f-dd874733c65a'], resumed_from_preemption=[false], new_token_ids=[[45974]], resumed_req_token_ids=[null], new_block_ids=[null], num_computed_tokens=[15], num_output_tokens=[1]), num_scheduled_tokens={chatcmpl-838f640e-c966-48e6-ba3f-dd874733c65a: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=[], free_encoder_mm_hashes=[], structured_output_request_ids=[], grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=4.771675335213388e-05, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, spec_decoding_stats=None, kv_connector_stats=None, num_corrupted_reqs=0)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] Traceback (most recent call last):
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2525, in _execute_until
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     result = self._dag_output_fetcher.read(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 312, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     outputs = self._read_list(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]               ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 403, in _read_list
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     raise e
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 385, in _read_list
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     result = c.read(min(remaining_timeout, iteration_timeout))
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 776, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     return self._channel_dict[self._resolve_actor_id()].read(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 480, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     ret = self._worker.get_objects(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]           ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 998, in get_objects
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     ] = self.core_worker.get_objects(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "python/ray/_raylet.pyx", line 3141, in ray._raylet.CoreWorker.get_objects
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "python/ray/includes/common.pxi", line 120, in ray._raylet.check_status
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read. ObjectID: 005fed9a4d0da286e8b79c779c6273685d7b963e0200000002e1f505
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] Traceback (most recent call last):
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 772, in run_engine_core
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 799, in run_busy_loop
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     self._process_engine_step()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 828, in _process_engine_step
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 382, in step_with_batch_queue
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     model_output = future.result()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_utils.py", line 149, in result
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     return self.refs[0].get()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]            ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/compiled_dag_ref.py", line 115, in get
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     self._dag._execute_until(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2535, in _execute_until
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     raise RayChannelTimeoutError(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] ray.exceptions.RayChannelTimeoutError: System error: If the execution is expected to take a long time, increase RAY_CGRAPH_get_timeout which is currently 300 seconds. Otherwise, this may indicate that the execution is hanging.

It seems there is some problem in the communication of ray. My test script is running the following command on all four nodes:

uv run vllm serve $LOCAL_CKPT_PATH \
        --host 0.0.0.0 \
        --port 8000 \
        --max-num-batched-tokens 32768 \
        --max-model-len 32768 \
        --trust-remote-code \
        --gpu-memory-utilization 0.85 \
        --served-model-name pangu \
        --tensor-parallel-size 8 \
        --data-parallel-size 4 \
        --data-parallel-size-local 1 \
	    --data-parallel-rank $LOCAL_NODE_RANK\
        --data-parallel-address $MASTER_NODE_IP \
        --data-parallel-rpc-port 13345 \
        --enable-expert-parallel \

Signed-off-by: yt0428 <[email protected]>

Kishanthan · 2025-10-31T17:14:30Z

Can you share the whole process with the doc on how to use it? We have met some problems and trying to find which step is wrong. like how to use the pangu model and how to config the vllm and patch with current PR. Thanks! We still cannot run it with TP8 and PP4 as it will hang.

What we are trying is to load this model with TP 8 and PP 4 on 4 nodes (32 H100 cards). We are not using DP. The model weights loads fine on all for nodes but when sending a request the inference fails with the below error. Our observation is that, though the weights are loaded on all nodes, only one node were seen using the GPU for inference while other nodes were idle. And after sometime, the request timeouts with the below CCGraph timeout error.

(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.1rc5.dev3+gd2c33c397) with config: model='/home/original_models/openPangu-Ultra-MoE-718B-model', speculative_config=None, tokenizer='/home/original_models/openPangu-Ultra-MoE-718B-model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=4, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/original_models/openPangu-Ultra-MoE-718B-model, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 16, 'local_cache_dir': None}, 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-838f640e-c966-48e6-ba3f-dd874733c65a'], resumed_from_preemption=[false], new_token_ids=[[45974]], resumed_req_token_ids=[null], new_block_ids=[null], num_computed_tokens=[15], num_output_tokens=[1]), num_scheduled_tokens={chatcmpl-838f640e-c966-48e6-ba3f-dd874733c65a: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=[], free_encoder_mm_hashes=[], structured_output_request_ids=[], grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=4.771675335213388e-05, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, spec_decoding_stats=None, kv_connector_stats=None, num_corrupted_reqs=0)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] Traceback (most recent call last):
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2525, in _execute_until
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     result = self._dag_output_fetcher.read(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 312, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     outputs = self._read_list(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]               ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 403, in _read_list
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     raise e
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 385, in _read_list
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     result = c.read(min(remaining_timeout, iteration_timeout))
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 776, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     return self._channel_dict[self._resolve_actor_id()].read(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 480, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     ret = self._worker.get_objects(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]           ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 998, in get_objects
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     ] = self.core_worker.get_objects(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "python/ray/_raylet.pyx", line 3141, in ray._raylet.CoreWorker.get_objects
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "python/ray/includes/common.pxi", line 120, in ray._raylet.check_status
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read. ObjectID: 005fed9a4d0da286e8b79c779c6273685d7b963e0200000002e1f505
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] Traceback (most recent call last):
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 772, in run_engine_core
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 799, in run_busy_loop
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     self._process_engine_step()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 828, in _process_engine_step
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 382, in step_with_batch_queue
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     model_output = future.result()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_utils.py", line 149, in result
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     return self.refs[0].get()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]            ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/compiled_dag_ref.py", line 115, in get
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     self._dag._execute_until(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2535, in _execute_until
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     raise RayChannelTimeoutError(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] ray.exceptions.RayChannelTimeoutError: System error: If the execution is expected to take a long time, increase RAY_CGRAPH_get_timeout which is currently 300 seconds. Otherwise, this may indicate that the execution is hanging.

It seems there is some problem in the communication of ray. My test script is running the following command on all four nodes:

uv run vllm serve $LOCAL_CKPT_PATH \
        --host 0.0.0.0 \
        --port 8000 \
        --max-num-batched-tokens 32768 \
        --max-model-len 32768 \
        --trust-remote-code \
        --gpu-memory-utilization 0.85 \
        --served-model-name pangu \
        --tensor-parallel-size 8 \
        --data-parallel-size 4 \
        --data-parallel-size-local 1 \
	    --data-parallel-rank $LOCAL_NODE_RANK\
        --data-parallel-address $MASTER_NODE_IP \
        --data-parallel-rpc-port 13345 \
        --enable-expert-parallel \

Yes there seems to be a bug with latest vLLM version when using PP from ray cgraph side. Some related issues for this #26899 and ray-project/ray#58062. We applied the fixes proposed in those issues and we can now load and invoke the model with TP and PP. Thanks for the help.

Signed-off-by: yuantao <[email protected]>

jeejeelee

Thank you for contribution

Signed-off-by: yuantao <[email protected]> Signed-off-by: yt0428 <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

[model] Add support for openpangu

2aa3938

Signed-off-by: yuantao <[email protected]>

yt0428 requested review from ProExpertProg, WoosukKwon, benchislett, hmellor, houseroad, luccafong, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners October 26, 2025 03:39

mergify bot added documentation Improvements or additions to documentation new-model Requests to new models speculative-decoding v1 labels Oct 26, 2025

gemini-code-assist bot reviewed Oct 26, 2025

View reviewed changes

vllm/model_executor/models/openpangu.py Show resolved Hide resolved

Doc fix, delete extra data column in supported_models.md for PanguUlt…

8cc5682

…raMoEForCausalLM Signed-off-by: yuantao <[email protected]>

DarkLight1337 requested a review from jeejeelee October 27, 2025 10:14

jeejeelee reviewed Oct 29, 2025

View reviewed changes

MengqingCao reviewed Oct 29, 2025

View reviewed changes

hmellor reviewed Oct 29, 2025

View reviewed changes

Fix reviews

433a7ee

Signed-off-by: yuantao <[email protected]>

yt0428 added 2 commits October 30, 2025 19:09

Add links for model

8ae429b

Signed-off-by: yuantao <[email protected]>

Refactor openpangu.py to further add support for openPangu-Embedded m…

307b825

…odels, which is dense models Signed-off-by: yuantao <[email protected]>

yt0428 requested review from DarkLight1337 and ywang96 as code owners October 30, 2025 11:34

Merge branch 'main' into support_openpangu

6191527

Merge branch 'main' into support_openpangu

9804db4

Signed-off-by: yt0428 <[email protected]>

jeejeelee mentioned this pull request Nov 3, 2025

[Model] Add native OpenPangu Embedded 7B backend #27941

Closed

Update huggingface model links

5a959b6

Signed-off-by: yuantao <[email protected]>

jeejeelee approved these changes Nov 4, 2025

View reviewed changes

jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 4, 2025

Merge branch 'main' into support_openpangu

5a690e4

vllm-bot merged commit 05cae69 into vllm-project:main Nov 4, 2025
52 of 55 checks passed

		self.tp_size = get_tensor_model_parallel_world_size()
		self.tp_rank = get_tp_group().rank_in_group

	self.total_num_kv_heads >= tp_size
	self.total_num_kv_heads > tp_size

Uh oh!

[model] Add support for openPangu_Ultra_MoE #27521

[model] Add support for openPangu_Ultra_MoE #27521

Uh oh!

Conversation

yt0428 commented Oct 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test for openPangu-Ultra-MoE-718B-V1.1

Start serving:

Test for openPangu-Embedded-7B-V1.1

Start serving:

Test Result

Results for openPangu-Ultra-MoE-718B-V1.1

Results for openPangu-Embedded-7B-V1.1

Request test

Response Correctness

Inference throughout

Uh oh!

github-actions bot commented Oct 26, 2025

Uh oh!

mergify bot commented Oct 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Bye-legumes commented Oct 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yt0428 commented Oct 29, 2025

Uh oh!

Kishanthan commented Oct 29, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hmellor Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yt0428 commented Oct 26, 2025 •

edited by github-actions bot

Loading

hmellor Oct 29, 2025 •

edited

Loading

Kishanthan commented Oct 30, 2025 •

edited

Loading