Skip to content

Various Corrdiff optimizations for drastic increase of training efficiency #809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 1, 2025

Conversation

LostnEkko
Copy link
Contributor

@LostnEkko LostnEkko commented Mar 14, 2025

Various Corrdiff optimizations for drastic increase of training efficiency

Description

  • Updated CorrDiff training code to support multiple patch iterations
    to amortize regression cost and usage of torch.compile
  • Refactored modulus/models/diffusion/layers.py to optimize data type casting workflow,
    avoiding unnecessary casting under autocast mode
  • Refactored Conv2d to enable fusion of conv2d with bias addition
  • Refactored GroupNorm, UNetBlock, SongUNet, SongUNetPosEmbd to support usage of
    Apex GroupNorm, fusion of activation with GroupNorm, and AMP workflow.
  • Updated SongUNetPosEmbd to avoid unnecessary HtoD Memcpy of pos_embd
  • Updated from_checkpoint to accommodate usage of Apex GroupNorm
  • Refactored CorrDiff NVTX annotation workflow to be configurable

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

@CharlelieLrt CharlelieLrt self-requested a review March 14, 2025 18:07
@CharlelieLrt CharlelieLrt added enhancement New feature or request 3 - Ready for Review Ready for review by team 5 - Merge After Dependencies Depends on another PR: do not merge out of order Earth-2 labels Mar 14, 2025
@mnabian
Copy link
Collaborator

mnabian commented Mar 14, 2025

/blossom-ci

@CharlelieLrt CharlelieLrt mentioned this pull request Mar 18, 2025
5 tasks
@simonbyrne
Copy link
Contributor

What's the status of this? I would love to make use of these in ReGen.

@CharlelieLrt
Copy link
Collaborator

CharlelieLrt commented Apr 2, 2025

@simonbyrne it's currently blocked by #790 and under review, but that will be merged in the coming days.
AFAIK the current implementation of ReGen does not support these optimizations and there will be some work required to do so.

@CharlelieLrt
Copy link
Collaborator

/blossom-ci

@CharlelieLrt
Copy link
Collaborator

/blossom-ci

Signed-off-by: jialusui1102 <[email protected]>
@CharlelieLrt
Copy link
Collaborator

/blossom-ci

@loliverhennigh
Copy link
Collaborator

Hey @jialusui1102, @CharlelieLrt and talked about the backward compatibility issues this PR and this PR #790 raised. For now we can get this in but I will fix the backward compatibility stuff ASAP after. @CharlelieLrt and I discussed a solution that seems to solve all the issues. Ill need you @jialusui1102 to take a look at the PR when the time comes though to make sure this works with the corrdiff model.

@jialusui1102
Copy link
Contributor

Hey @loliverhennigh Thanks for letting me know and merging my PR in and thanks @CharlelieLrt for coordinating. Let me know when the PR is ready and I will test the corrdiff checkpoint to make sure everything works!

@CharlelieLrt CharlelieLrt self-requested a review May 1, 2025 19:20
Copy link
Collaborator

@CharlelieLrt CharlelieLrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@CharlelieLrt
Copy link
Collaborator

/blossom-ci

@CharlelieLrt
Copy link
Collaborator

/blossom-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 5 - Merge After Dependencies Depends on another PR: do not merge out of order Earth-2 enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants