Skip to content

Question on Router Z-Loss Normalization: Why Include Top-K in the Scaling Denominator? #39

@james-yw

Description

@james-yw

Hi OLMoE team,

I am examining the implementation of the auxiliary MoE losses, specifically the Router Z-Loss ($L_Z$). I've noticed a significant difference in the normalization factor used in the OLMoE implementation compared to other common open-source implementations (e.g., OpenMoE/ColossalAI).

Observed Implementations

The Z-Loss formula is generally based on the L2-norm of the LogSumExp of the router logits: $L_Z \propto \sum_{t} (\log \sum_{e} e^{W_{t,e}})^2$.

Implementation Style Z-Loss Normalization Code Reference
MegaBlocks/OLMoE Style Normalized by $N_{Layers}^{\text{total}} \cdot N_{Tokens} \cdot \mathbf{k}$ Link to OLMoE Code
OpenMoE Style Normalized by $N_{Layers} \cdot N_{Tokens}$ (excludes $\mathbf{k}$) Link to OpenMoE Code

Specific Question

In the MegaBlocks/OLMoE style implementation, the scale_denominator includes the $\text{Top-K}$ factor:

scale_denominator = num_total_moe_layers * T_layer * top_k
# ...
zloss_normalized = zloss_sum_squared / scale_denominator 

Why is the $\text{Top-K}$ factor ($k$) included in the Z-Loss denominator?

Since the Z-Loss aims to regularize the magnitude of the raw logits themselves, and the formula does not inherently depend on how many experts ($k$) are actually selected, including $k$ primarily seems to be for loss balancing/magnitude consistency with the Load Balancing Loss ($L_{aux}$).

Could you provide any insight or documentation on the following:

  1. Reasoning: What was the primary motivation for including the $\text{Top-K}$ factor in the Z-Loss normalization? Was it for stability, better hyperparameter transfer, or maintaining loss parity with $L_{aux}$?
  2. Ablation Studies: Were any ablation studies performed to compare the model performance (e.g., perplexity, convergence speed) between Z-Loss scaled by $\frac{1}{N_{L} \cdot N_{T}}$ versus scaling by $\frac{1}{N_{L} \cdot N_{T} \cdot \mathbf{k}}$?

Thank you for your time and insights!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions