-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[model] Add support for openPangu_Ultra_MoE #27521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: yuantao <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Documentation preview: https://vllm--27521.org.readthedocs.build/en/27521/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the openPangu_Ultra_MoE model. The changes include a new model implementation file and updates to various configuration and registry files to integrate the new model. The implementation appears to be largely adapted from the existing deepseek_v2 model.
I've identified a critical issue in the scaling logic within the OpenPanguMoE module, which seems to have been carried over from the deepseek_v2 implementation. This logic flaw could lead to incorrect computations, particularly in float16 precision, potentially affecting the model's output. A detailed comment with a suggested fix is provided below. The other changes appear to be correct and consistent with adding a new model to the framework.
…raMoEForCausalLM Signed-off-by: yuantao <[email protected]>
|
hi, can you give us (me and https://github.com/kcmnd )the access to your fork repo as we tested it it doent work now. We can fix some codes. |
docs/models/supported_models.md
Outdated
| | `OLMoEForCausalLM` | OLMoE | `allenai/OLMoE-1B-7B-0924`, `allenai/OLMoE-1B-7B-0924-Instruct`, etc. | | ✅︎ | | ||
| | `OPTForCausalLM` | OPT, OPT-IML | `facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc. | ✅︎ | ✅︎ | | ||
| | `OrionForCausalLM` | Orion | `OrionStarAI/Orion-14B-Base`, `OrionStarAI/Orion-14B-Chat`, etc. | | ✅︎ | | ||
| | `PanguUltraMoEForCausalLM` |openpangu-ultra-moe-718b-model | | ✅︎ | ✅︎ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this model have a publicly accessible link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a publicly accessible version in https://ai.gitcode.com/ascend-tribe/openPangu-Ultra-MoE-718B-V1.1. However, it has not been upload to huggingface yet. The config file in this repo https://ai.gitcode.com/ascend-tribe/openPangu-Ultra-MoE-718B-V1.1 needs to be modified to align with the common practice in vllm. Therefore, I basically test the model in my local environments and it works well.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model will be upload to Huggingface soon :)
Sure, I have sent the invitation. By the way, the reasoning for not working may the the config file issue, as I mentioned above. |
Could you also share your config.json file for the above changes? We are currently using the following and we could not load the model with the changes suggested in this PR. {
"architectures": [
"PanguUltraMoEForCausalLM"
],
"attention_bias": false,
"auto_map": {
"AutoConfig": "configuration_openpangu_moe.PanguUltraMoEConfig",
"AutoModel": "modeling_openpangu_moe.PanguUltraMoEModel",
"AutoModelForCausalLM": "modeling_openpangu_moe.PanguUltraMoEForCausalLM"
},
"num_dense_layers": 3,
"bos_token_id": 0,
"eos_token_id": 1,
"ep_size": 1,
"first_k_dense_replace": 3,
"hidden_act": "silu",
"hidden_size": 7680,
"initializer_range": 0.02,
"intermediate_size": 18432,
"kv_lora_rank": 512,
"attention_kv_lora_dim": 512,
"max_position_embeddings": 131072,
"model_type": "pangu_ultra_moe",
"moe_intermediate_size": 2048,
"num_routed_experts": 256,
"num_shared_experts": 1,
"moe_layer_freq": 1,
"n_group": 8,
"n_routed_experts": 256,
"n_shared_experts": 1,
"norm_topk_prob": true,
"num_attention_heads": 128,
"num_experts_per_tok": 8,
"num_hidden_layers": 62,
"num_key_value_heads": 128,
"num_nextn_predict_layers": 1,
"q_lora_rank": 1536,
"qk_nope_head_dim": 128,
"qk_rope_head_dim": 64,
"quantization_config": {
"activation_scheme": "dynamic",
"fmt": "e4m3",
"quant_method": "fp8",
"weight_block_size": [
128,
128
]
},
"rope_scaling": {
"beta_fast": 32,
"beta_slow": 1,
"factor": 40,
"mscale": 1.0,
"mscale_all_dim": 1.0,
"original_max_position_embeddings": 4096,
"type": "yarn"
},
"num_mtp_layers": 1,
"attention_q_lora_dim": 1536,
"attention_qk_dim": 128,
"attention_qk_rope_dim": 64,
"rms_norm_eps": 1e-05,
"rope_theta": 25600000,
"routed_scaling_factor": 2.5,
"sandwich_norm": true,
"tie_word_embeddings": false,
"topk_group": 4,
"topk_method": "noaux_tc",
"torch_dtype": "bfloat16",
"transformers_version": "4.48.2",
"use_cache": true,
"attention_v_dim": 128,
"v_head_dim": 128,
"vocab_size": 153600
}
|
|
|
||
| class OpenPanguForCausalLM(nn.Module, SupportsPP, MixtureOfExperts, SupportsLoRA): | ||
| packed_modules_mapping = { | ||
| "gate_up_proj": ["gate_proj", "up_proj"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed QKVParallelLinear is used for self.qkv_proj layer creation. Why we don't add the mapping here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right! We should add the mapping for qkv_proj here.
| else: | ||
| shared_output = None | ||
| final_hidden_states = fused_moe_out | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check shared_ouput to ensure it is not None when self.shared_experts is not None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! We will add a assertion here.
| self.tp_size = get_tensor_model_parallel_world_size() | ||
| self.tp_rank = get_tp_group().rank_in_group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to line 136, to make the parallel related parameters in one region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| self.num_heads = self.total_num_heads // tp_size | ||
| self.total_num_kv_heads = num_kv_heads | ||
| if ( | ||
| self.total_num_kv_heads >= tp_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.total_num_kv_heads >= tp_size | |
| self.total_num_kv_heads > tp_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! self.total_num_kv_heads can not equal to tp_size in this condition.
| elif ( | ||
| self.total_num_kv_heads < tp_size and tp_size % self.total_num_kv_heads != 0 | ||
| ): | ||
| # Number of KV heads is less than TP size, so we replicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this replication a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. When number of KV heads is less than TP size, we can simply set the self.num_kv_heads to 1. The 'QKVParallelLinearmodule will do the replication automatically, which can be found in the description ofQKVParallelLinear`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this details
| config.hidden_size, eps=config.rms_norm_eps | ||
| ) | ||
| self.tp_group = get_tp_group().device_group | ||
| if getattr(config, "sandwich_norm", False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could just set self.sandwich_norm = getattr(config, "sandwich_norm", False) and create the pre- and post- mlp layer when self.sandwich_norm is True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! We can make this adjustment to make the code simpler.
| if hf_config.model_type in ("pangu_ultra_moe"): | ||
| hf_config.model_type = "pangu_ultra_moe_mtp" | ||
| if hf_config.model_type == "pangu_ultra_moe_mtp": | ||
| n_predict = getattr(hf_config, "num_nextn_predict_layers", None) | ||
| hf_config.update( | ||
| {"n_predict": n_predict, "architectures": ["OpenPanguMTPModel"]} | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this override be done in vllm/model_executor/models/openpangu.py and vllm/model_executor/models/openpangu_mtp.py instead of putting model specific config overrides in the global configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I see this is currently done for quite a few models... We should do this in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I follow the common practice (like qwen3_next_mtp, longcat_flash_mtp) and place the override for mtp in vllm/config/speculative.py. I also thought about move the override to the modeling definition file, but it seems the initialization work flow for mtp makes it infeasible. Could you please provide some suggestions on the implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR, please follow the existing pattern as you have already done. Refactoring MTP config is a separate task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Looking forward to it.
Yes, the config file should be: And you should also change the name in configuration_openpangu_moe.py, which should be corresponing to the new config file. |
Signed-off-by: yuantao <[email protected]>
Signed-off-by: yuantao <[email protected]>
…odels, which is dense models Signed-off-by: yuantao <[email protected]>
|
Can you share the whole process with the doc on how to use it? We have met some problems and trying to find which step is wrong. like how to use the pangu model and how to config the vllm and patch with current PR. Thanks! We still cannot run it with TP8 and PP4 as it will hang. |
What we are trying is to load this model with TP 8 and PP 4 on 4 nodes (32 H100 cards). We are not using DP. The model weights loads fine on all for nodes but when sending a request the inference fails with the below error. Our observation is that, though the weights are loaded on all nodes, only one node were seen using the GPU for inference while other nodes were idle. And after sometime, the request timeouts with the below CCGraph timeout error. |
It seems there is some problem in the communication of ray. My test script is running the following command on all four nodes: |
Signed-off-by: yt0428 <[email protected]>
Yes there seems to be a bug with latest vLLM version when using PP from ray cgraph side. Some related issues for this #26899 and ray-project/ray#58062. We applied the fixes proposed in those issues and we can now load and invoke the model with TP and PP. Thanks for the help. |
Signed-off-by: yuantao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for contribution
Signed-off-by: yuantao <[email protected]> Signed-off-by: yt0428 <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Purpose
Add support for openPangu_Ultra_MoE models
FIX #27019
Test Plan
Test for openPangu-Ultra-MoE-718B-V1.1
Start serving:
vllm serve $LOCAL_CKPT_DIR/openPangu-Ultra-MoE-718B-V1.1 \ --data-parallel-size 4 \ --data-parallel-size-local 1 \ --data-parallel-start-rank $NODE_RANK \ --data-parallel-address $MASTER_NODE_IP \ --data-parallel-rpc-port 13389 \ --tensor-parallel-size 8 \ --served-model-name pangu_ultra_moe \ --enable-expert-parallel \ --trust-remote-code \Test for openPangu-Embedded-7B-V1.1
Start serving:
Master node:
vllm serve FreedomIntelligence/openPangu-Embedded-7B-V1.1 \ --host 0.0.0.0 \ --port 8000 \ --max-num-batched-tokens 32768 \ --max-model-len 32768 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --served-model-name pangu \ --tensor-parallel-size 8 \ --data-parallel-size 4 \ --data-parallel-size-local 1 \ --data-parallel-start-rank $NODE_RANK \ --data-parallel-address $MASTER_NODE_IP \ --data-parallel-rpc-port 13345Other nodes:
vllm serve FreedomIntelligence/openPangu-Embedded-7B-V1.1 \ --host 0.0.0.0 \ --port 8000 \ --headless \ --max-num-batched-tokens 32768 \ --max-model-len 32768 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --served-model-name pangu \ --tensor-parallel-size 8 \ --data-parallel-size 4 \ --data-parallel-size-local 1 \ --data-parallel-start-rank $NODE_RANK \ --data-parallel-address $MASTER_NODE_IP \ --data-parallel-rpc-port 13345Test Result
Results for openPangu-Ultra-MoE-718B-V1.1
MATH500: 97.6
Results for openPangu-Embedded-7B-V1.1
Request test
python3 -c " import requests; response = requests.post( 'http://localhost:8000/v1/chat/completions', headers={'Content-Type': 'application/json'}, json={ 'model': 'pangu', 'temperature': 0.6, 'top_p': 0.95, 'max_tokens': 500, 'messages': [ { 'role': 'user', 'content': 'Let $S$ be the set of points $(a,b)$ with $0 \\le a,$ $b \\le 1$ such that the equation\n\\[x^4 + ax^3 - bx^2 + ax + 1 = 0\\]has at least one real root. Determine the area of the graph of $S.$' } ] } ); result = response.json(); if 'choices' in result and result['choices']: print(result['choices'][0]['message']['content']) else: print('No response') "Response Correctness
`
[unused16] The answer is ( \frac{2x + 1}{2} ), which matches the expected value ( \frac{2x + 1}{2} ).
So the final answer is ( \frac{2x + 1}{2} ).
But wait—there's a second way to compute the the number of real roots by checking all the possible combinations of x and y, but only counting those where the equation holds and there's at least one real root. So in the code, we can do:
But that's O(n*m), which is acceptable for small n and m......
`
Inference throughout
The throuthoutput is ~180 Tokens/s

Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.