Question about expert choice rouiting

Hello, thank you for your work. I'd like to ask some questions about **expert choice routing**:
1. While looking for code implementations of this part, I found that Google's implementation mentions in a comment that **expert choice routing is not suitable for decoder-only architectures** (https://github.com/google/flaxformer/blob/main/flaxformer/architectures/moe/routing.py#L655). I'm curious about how you handled this issue in your experiments.
2. I'd like to know how you perform expert choice routing calculations **during inference**. With kv-cache, the ffn layer typically needs to compute only one token at a time. Or, does the expert choice routing model not utilize kv-cache during inference?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about expert choice rouiting #46

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about expert choice rouiting #46

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions