-
Notifications
You must be signed in to change notification settings - Fork 91
Description
Hi OLMoE team,
I am examining the implementation of the auxiliary MoE losses, specifically the Router Z-Loss (
Observed Implementations
The Z-Loss formula is generally based on the L2-norm of the LogSumExp of the router logits:
| Implementation Style | Z-Loss Normalization | Code Reference |
|---|---|---|
| MegaBlocks/OLMoE Style | Normalized by |
Link to OLMoE Code |
| OpenMoE Style | Normalized by |
Link to OpenMoE Code |
Specific Question
In the MegaBlocks/OLMoE style implementation, the scale_denominator includes the
scale_denominator = num_total_moe_layers * T_layer * top_k
# ...
zloss_normalized = zloss_sum_squared / scale_denominator Why is the
Since the Z-Loss aims to regularize the magnitude of the raw logits themselves, and the formula does not inherently depend on how many experts (
Could you provide any insight or documentation on the following:
-
Reasoning: What was the primary motivation for including the
$\text{Top-K}$ factor in the Z-Loss normalization? Was it for stability, better hyperparameter transfer, or maintaining loss parity with$L_{aux}$ ? -
Ablation Studies: Were any ablation studies performed to compare the model performance (e.g., perplexity, convergence speed) between Z-Loss scaled by
$\frac{1}{N_{L} \cdot N_{T}}$ versus scaling by$\frac{1}{N_{L} \cdot N_{T} \cdot \mathbf{k}}$ ?
Thank you for your time and insights!