-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Model] Add native OpenPangu Embedded 7B backend #27941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds native support for the OpenPangu Embedded 7B model. The implementation is well-structured and follows the existing patterns in vLLM, particularly drawing from the Llama model implementation. I've identified a critical bug in the model's forward pass that affects speculative decoding and a missing configuration in the test registry. Addressing these issues will ensure the model integrates correctly and performs as expected.
vllm/model_executor/models/pangu.py
Outdated
|
|
||
| aux_hidden_states: list[torch.Tensor] = [] | ||
| for idx, layer in enumerate(self.layers[self.start_layer : self.end_layer]): | ||
| if idx in self.aux_hidden_state_layers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a bug in how auxiliary hidden states are collected for speculative decoding. The loop variable idx is a relative index for the layers on the current pipeline stage, starting from 0. However, self.aux_hidden_state_layers contains global layer indices. This means the check if idx in self.aux_hidden_state_layers: will almost always be incorrect, leading to auxiliary hidden states not being collected, which breaks features like EAGLE3 speculative decoding.
To fix this, you should use the global layer index for the check by adding self.start_layer to idx.
| if idx in self.aux_hidden_state_layers: | |
| if self.start_layer + idx in self.aux_hidden_state_layers: |
| "PanguEmbeddedForCausalLM": _HfExamplesInfo( | ||
| "FreedomIntelligence/openPangu-Embedded-7B" | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test plan in the pull request description indicates that trust_remote_code=True is required to run this model. This parameter is missing from the _HfExamplesInfo entry, which could cause tests for this model to fail. Please add trust_remote_code=True to ensure the tests can run correctly.
| "PanguEmbeddedForCausalLM": _HfExamplesInfo( | |
| "FreedomIntelligence/openPangu-Embedded-7B" | |
| ), | |
| "PanguEmbeddedForCausalLM": _HfExamplesInfo( | |
| "FreedomIntelligence/openPangu-Embedded-7B", trust_remote_code=True | |
| ), |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a native vLLM backend for the OpenPangu Embedded 7B model, which is a valuable performance enhancement. The implementation correctly mirrors the Hugging Face reference architecture, notably by using separate projections for QKV and MLP layers, and the code is well-structured, following existing patterns in the vLLM codebase. I have identified one critical latent bug related to speculative decoding that should be addressed to ensure future stability.
vllm/model_executor/models/pangu.py
Outdated
| if self.start_layer + idx in self.aux_hidden_state_layers: | ||
| aux_hidden_states.append(hidden_states + residual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential TypeError when using speculative decoding. The residual tensor can be None before the first decoder layer, which will cause hidden_states + residual to fail. This can happen if 0 is in aux_hidden_state_layers. While speculative decoding might not be the primary use case for this PR, it's best to fix this latent bug to prevent future issues.
| if self.start_layer + idx in self.aux_hidden_state_layers: | |
| aux_hidden_states.append(hidden_states + residual) | |
| if self.start_layer + idx in self.aux_hidden_state_layers: | |
| if residual is None: | |
| aux_hidden_states.append(hidden_states) | |
| else: | |
| aux_hidden_states.append(hidden_states + residual) |
Signed-off-by: YoussefEssDS <[email protected]>
Signed-off-by: YoussefEssDS <[email protected]>
Signed-off-by: YoussefEssDS <[email protected]>
6d694a4 to
232c9c5
Compare
|
Documentation preview: https://vllm--27941.org.readthedocs.build/en/27941/ |
Signed-off-by: YoussefEssDS <[email protected]>
Signed-off-by: YoussefEssDS <[email protected]>
DarkLight1337
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for your patience
|
FYI @DarkLight1337 @YoussefEssDS We already have a PR supporting this model, see: #27521 |
| if get_pp_group().is_first_rank or ( | ||
| getattr(config, "tie_word_embeddings", True) and get_pp_group().is_last_rank | ||
| ): | ||
| self.embed_tokens = VocabParallelEmbedding( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The missing line prefix=f"{prefix}.embed_tokens", may disable quantization support. As @jeejeelee pointed out, this PR should be merged into #27521.
|
Sorry I missed that, let's work on the existing PR #27521 then. |
Pull request was closed
Purpose
PanguEmbeddedForCausalLM) and provide a native implementation that mirrors the HF reference.Test Plan
Start instance:
vllm serve FreedomIntelligence/openPangu-Embedded-7B --tensor-parallel-size 4 --port 18051 --trust_remote_codeFunctional correctness:
curl http://localhost:18051/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "FreedomIntelligence/openPangu-Embedded-7B", "messages": [ {"role": "user", "content": "What is the capital of Morocco?."} ], "temperature": 0.6, "top_p": 0.95, "top_k": 20, "max_tokens": 8192 }'Run performance benchmark: (Used the script here: https://gist.github.com/YoussefEssDS/03c456ef5fed6f24e27394d2b2f2cc05)
Test Result
Significant improvement in throughput (TP=4 on nvidia H100 GPUs): vLLM backend ~147.5 T/s vs. transformers fallback ~50.8 T/s.