-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Description
Motivation
Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems:
- Host-bound latency:
There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. - Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend.
We have finished SwiftBalancer -- Dynamic Expert Load Balance with Zero-like-overhead on vllm-ascend 1943(vllm-project/vllm-ascend#2186) and it is well tested.
SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64.
The following updates have been made:
1、expert distribution recording with lower cost.
2、async cpu computing for eplb algo and other python operator.
3、new eplb algo with less expert rebalancing while almost the same effect.
Proposed Change
We would like to expand the current implementation:
The overall workflow involves:
- Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator.
- Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume.
- Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker.
- Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time.
- Lanch ibatch_send_recv in async_stream before forward.
- After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights.
Feedback Period
No response
CC List.
No response
Any Other Things.
Currently we are working on adding adaptor for working with eplb mainline code.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.