-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[EPLB]: Optimize export_load_view update
#24091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization for the Expert Parallelism Load Balancer (EPLB) by avoiding scatter_add_ for expert load calculation when a more direct method is available. The core idea is sound and should improve performance. However, the implementation introduces code duplication and some brittle logic that could be improved for better long-term maintainability. My review focuses on refactoring these areas to make the code more robust and easier to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization for EPLB by allowing expert load updates to bypass the scatter_add_ operation when using specific modular kernels. The changes are logical and well-contained. However, I've identified a critical bug in the implementation that would prevent this optimization from ever being activated. Additionally, there are a couple of high-severity maintainability issues related to code duplication and a local import that should be addressed.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
ef5dd7b to
ea185c7
Compare
c89d747 to
3724107
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
caa2ae3 to
1a35968
Compare
|
Since #22167, we have been collecting the global physical expert loads, so the overhead of masking out non-local experts has already been removed. In addition, please see #24573, where the unnecessary local mask was removed. After that change, the only remaining operation for expert load collection is a |
export_load_view update
PR 24573 does not conflict with our optimization for the |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Sorry for the delay. I see your point. You're right that previously we needed to perform To proceed, I think we should do the following:
Thanks again for your contribution! |
|
Thanks for your thoughts and suggestions! Next, we’ll refactor this section—we’ll add a ABC |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: chenmenglong <[email protected]>
Signed-off-by: daishixun <[email protected]>
Signed-off-by: daishixun <[email protected]>
Signed-off-by: daishixun <[email protected]>
|
In future please take more care with git and force push. Almost all maintainers are now subcribed to this PR because of it. Also, please:
|
Signed-off-by: Mercykid-bash <[email protected]>
Update modular_kernel.py
|
For the token count addition simplification, we've replaced Initial tests have been completed to verify correctness. We'll follow up with the Benchmarking and Accuracy test experiments shortly. |
Signed-off-by: daishixun <[email protected]>
Signed-off-by: daishixun <[email protected]>
Signed-off-by: daishixun <[email protected]>
Signed-off-by: daishixun <[email protected]>
Signed-off-by: daishixun <[email protected]>
Hi, I used the Deepseek-V2-Lite model and gsm8k dataset for accuracy test. However, after enabling PPLX, there were accuracy issues. Before and after enabling PPLX, the accuracy was only around 2%. Below are the results from my AISbench test. |
Signed-off-by: daishixun <[email protected]>
Signed-off-by: daishixun <[email protected]>
Signed-off-by: daishixun <[email protected]>
Signed-off-by: daishixun <[email protected]>
|
@tlrmchlsmth Long tim no reply for this PR. Could you please take a look when you have time. |
|
Hi, @Skywalker-EP — could you please share the current status? Does the accuracy issue still persist? |
Hi, the accuracy issue does still persist. I pulled the latest code(without our pr), enabled the deepep backend, and tested the accuracy of the qwen3 model. However, the accuracy is still problematic (the accuracy is 96% without deepep), and curl shows only garbled characters.
|






Purpose
In the feature of EPLB(Experts Load Balance), the PR optimizes the update method for expert load during each forward. The current approach is using the
scatter_add_method based ontopk_idsresults. When using DeepEP Low-Latency or PPLX on the CUDA platform, expert loads can be obtained directly fromexpert_tokens_meta.expert_num_tokens, which reduces redundant calculations on the expert load.Test Plan
Since the use of the kernel, such as DeepEP Low-Latency or PPLX, leads to some changes in the inference process, the precision of intermediate values cannot be fully aligned. We directly illustrate whether the expert load update function is functioning properly by comparing the load imbalance of one layer model.
We add the following code in
vllm/distributed/eplb/eplb_state.pyfor data collection.Test the average time for a single update of the expert.
Test Result
The average time of update
expert_load_viewfor one layer in every batch:Before modification: 0.058ms
After modification: 0.015ms
Co-author
Co-authored-by: Skywalker-EP [email protected]