[RFC]: Dynamic Expert Load Balance with Zero-like-overhead

### Motivation

Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems:

1. Host-bound latency:
There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s.
3. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend.

We have finished SwiftBalancer -- Dynamic Expert Load Balance with Zero-like-overhead on vllm-ascend 1943(https://github.com/vllm-project/vllm-ascend/pull/2186) and it is well tested.
SwiftBalancer would not stop-the-world anymore,  in out test on NPU  1~2ms cost for each layer while  benefit 5ms-8ms decode latency with ep_size = 64. 
The following updates have been made:
1、expert distribution recording with lower cost.
2、async cpu computing for eplb algo and other python operator.
3、new eplb algo with less expert rebalancing  while almost the same effect.

### Proposed Change

We would like to expand the current implementation:


![Image](https://github.com/user-attachments/assets/9abde310-c793-44fe-86ee-456adc957373)


The overall workflow involves:

<img width="801" height="302" alt="Image" src="https://github.com/user-attachments/assets/23b06f58-23bc-44a3-a1be-00f268aeb15c" />


1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator.
2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume.
3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker.
4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time.
5. Lanch ibatch_send_recv in async_stream before forward.
6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights.


### Feedback Period

_No response_

### CC List.

_No response_

### Any Other Things.

Currently we are working on adding adaptor  for working with eplb mainline code.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Dynamic Expert Load Balance with Zero-like-overhead #22246

Motivation

Proposed Change

Feedback Period

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Dynamic Expert Load Balance with Zero-like-overhead #22246

Description

Motivation

Proposed Change

Feedback Period

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions