ScatterMoE Gradient Norm Needs to be Properly Computed When Used With FSDP

When ScatterMoE is used together with FSDP, HF Accelerate will call FSDP's `clip_grad_norm`, see [here](https://github.com/huggingface/accelerate/blob/c0552c9012a9bae7f125e1df89cf9ee0b0d250fd/src/accelerate/accelerator.py#L2367), which does not know how to properly compute the grad norms for ScatterMoE shards. 

We need to be able to hook in some logic to handle the grad norm properly.