Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training cluster performance boost for bert models part 2 #408

Merged
merged 15 commits into from
Nov 8, 2024

Conversation

jstjohn
Copy link
Collaborator

@jstjohn jstjohn commented Nov 7, 2024

See slack thread https://nvidia.slack.com/archives/C074Z808N05/p1730301123648729

The main issue I was hitting was the setting SBATCH --overcommit in my personal slurm script. As a side effect of this exploration I did a few things though which this PR has:

  • Identified that a bug in some combination of NeMo2 and Megatron-lm results in some fused kernels getting regularly re-compiled
  • Verified that even with this re-compilation bug, we still have the best performance when using these kernels
  • Added an option for geneformer to turn on the torch debugger that errors out if a kernel is getting recompiled, this is how I found which kernels were at fault and where/why the recompilations were happening.

Performance summary of different settings

Name Replicate Num GPUs Time per 10 steps Average Timing
no_recompile 0 1 2.487 2.50875
no_recompile 0 2 2.527
no_recompile 1 1 2.503
no_recompile 1 2 2.518
fused_bias_act 0 1 2.489 2.496666667
fused_bias_act 0 2 2.514
fused_bias_act 1 1 2.487
fused_bias_act 1 2
fused_bias_act_do 0 1 2.459 2.47425
fused_bias_act_do 0 2 2.478
fused_bias_act_do 1 1 2.471
fused_bias_act_do 1 2 2.489
fused_loss 0 1 2.312 2.326
fused_loss 0 2 2.335
fused_loss 1 1 2.323
fused_loss 1 2 2.334
fused_bias_do 0 1 2.467 2.4845
fused_bias_do 0 2 2.499
fused_bias_do 1 1 2.472
fused_bias_do 1 2 2.5
fused_bias_loss 0 1 2.282 2.28775
fused_bias_loss 0 2 2.297
fused_bias_loss 1 1 2.277
fused_bias_loss 1 2 2.295
fused_bias_loss_arange_expand 0 1 2.277 2.29075
fused_bias_loss_arange_expand 0 2 2.298
fused_bias_loss_arange_expand 1 1 2.285
fused_bias_loss_arange_expand 1 2 2.303

@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 7, 2024

/build-ci

@jstjohn jstjohn marked this pull request as draft November 7, 2024 00:14
@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 7, 2024

/build-ci

@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 8, 2024

/build-ci

@jstjohn jstjohn marked this pull request as ready for review November 8, 2024 00:16
@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 8, 2024

/build-ci

@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 8, 2024

/build-ci

@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 8, 2024

/build-ci

Copy link
Collaborator

@malcolmgreaves malcolmgreaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- only requesting one change to not use a known Python anti-pattern. Everything else is good to go though!

@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 8, 2024

/build-ci

@jstjohn jstjohn enabled auto-merge (squash) November 8, 2024 19:29
@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 8, 2024

/build-ci

@jstjohn jstjohn merged commit c054c22 into main Nov 8, 2024
4 checks passed
@jstjohn jstjohn deleted the jstjohn/optimize_cluster_part2 branch November 8, 2024 20:46
@jstjohn jstjohn mentioned this pull request Nov 9, 2024
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants