Skip to content

Conversation

@YoussefEssDS
Copy link

@YoussefEssDS YoussefEssDS commented Nov 2, 2025

Purpose

Test Plan

Start instance: vllm serve FreedomIntelligence/openPangu-Embedded-7B --tensor-parallel-size 4 --port 18051 --trust_remote_code
Functional correctness:
curl http://localhost:18051/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "FreedomIntelligence/openPangu-Embedded-7B", "messages": [ {"role": "user", "content": "What is the capital of Morocco?."} ], "temperature": 0.6, "top_p": 0.95, "top_k": 20, "max_tokens": 8192 }'
Run performance benchmark: (Used the script here: https://gist.github.com/YoussefEssDS/03c456ef5fed6f24e27394d2b2f2cc05)

Test Result

Significant improvement in throughput (TP=4 on nvidia H100 GPUs): vLLM backend ~147.5 T/s vs. transformers fallback ~50.8 T/s.

@github-actions
Copy link

github-actions bot commented Nov 2, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the new-model Requests to new models label Nov 2, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds native support for the OpenPangu Embedded 7B model. The implementation is well-structured and follows the existing patterns in vLLM, particularly drawing from the Llama model implementation. I've identified a critical bug in the model's forward pass that affects speculative decoding and a missing configuration in the test registry. Addressing these issues will ensure the model integrates correctly and performs as expected.


aux_hidden_states: list[torch.Tensor] = []
for idx, layer in enumerate(self.layers[self.start_layer : self.end_layer]):
if idx in self.aux_hidden_state_layers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a bug in how auxiliary hidden states are collected for speculative decoding. The loop variable idx is a relative index for the layers on the current pipeline stage, starting from 0. However, self.aux_hidden_state_layers contains global layer indices. This means the check if idx in self.aux_hidden_state_layers: will almost always be incorrect, leading to auxiliary hidden states not being collected, which breaks features like EAGLE3 speculative decoding.

To fix this, you should use the global layer index for the check by adding self.start_layer to idx.

Suggested change
if idx in self.aux_hidden_state_layers:
if self.start_layer + idx in self.aux_hidden_state_layers:

Comment on lines 373 to 375
"PanguEmbeddedForCausalLM": _HfExamplesInfo(
"FreedomIntelligence/openPangu-Embedded-7B"
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test plan in the pull request description indicates that trust_remote_code=True is required to run this model. This parameter is missing from the _HfExamplesInfo entry, which could cause tests for this model to fail. Please add trust_remote_code=True to ensure the tests can run correctly.

Suggested change
"PanguEmbeddedForCausalLM": _HfExamplesInfo(
"FreedomIntelligence/openPangu-Embedded-7B"
),
"PanguEmbeddedForCausalLM": _HfExamplesInfo(
"FreedomIntelligence/openPangu-Embedded-7B", trust_remote_code=True
),

@DarkLight1337
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a native vLLM backend for the OpenPangu Embedded 7B model, which is a valuable performance enhancement. The implementation correctly mirrors the Hugging Face reference architecture, notably by using separate projections for QKV and MLP layers, and the code is well-structured, following existing patterns in the vLLM codebase. I have identified one critical latent bug related to speculative decoding that should be addressed to ensure future stability.

Comment on lines 329 to 330
if self.start_layer + idx in self.aux_hidden_state_layers:
aux_hidden_states.append(hidden_states + residual)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a potential TypeError when using speculative decoding. The residual tensor can be None before the first decoder layer, which will cause hidden_states + residual to fail. This can happen if 0 is in aux_hidden_state_layers. While speculative decoding might not be the primary use case for this PR, it's best to fix this latent bug to prevent future issues.

Suggested change
if self.start_layer + idx in self.aux_hidden_state_layers:
aux_hidden_states.append(hidden_states + residual)
if self.start_layer + idx in self.aux_hidden_state_layers:
if residual is None:
aux_hidden_states.append(hidden_states)
else:
aux_hidden_states.append(hidden_states + residual)

@mergify
Copy link

mergify bot commented Nov 3, 2025

Documentation preview: https://vllm--27941.org.readthedocs.build/en/27941/

@mergify mergify bot added the documentation Improvements or additions to documentation label Nov 3, 2025
Signed-off-by: YoussefEssDS <[email protected]>
Signed-off-by: YoussefEssDS <[email protected]>
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for your patience

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) November 3, 2025 04:58
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 3, 2025
@jeejeelee
Copy link
Collaborator

FYI @DarkLight1337 @YoussefEssDS We already have a PR supporting this model, see: #27521

if get_pp_group().is_first_rank or (
getattr(config, "tie_word_embeddings", True) and get_pp_group().is_last_rank
):
self.embed_tokens = VocabParallelEmbedding(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The missing line prefix=f"{prefix}.embed_tokens", may disable quantization support. As @jeejeelee pointed out, this PR should be merged into #27521.

@DarkLight1337
Copy link
Member

Sorry I missed that, let's work on the existing PR #27521 then.

auto-merge was automatically disabled November 3, 2025 12:13

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants