Skip to content

ScatterMoE Gradient Norm Needs to be Properly Computed When Used With FSDPΒ #109

@fabianlim

Description

@fabianlim

When ScatterMoE is used together with FSDP, HF Accelerate will call FSDP's clip_grad_norm, see here, which does not know how to properly compute the grad norms for ScatterMoE shards.

We need to be able to hook in some logic to handle the grad norm properly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions