Training cluster performance boost for bert models part 2 #408

jstjohn · 2024-11-07T00:13:37Z

See slack thread https://nvidia.slack.com/archives/C074Z808N05/p1730301123648729

The main issue I was hitting was the setting SBATCH --overcommit in my personal slurm script. As a side effect of this exploration I did a few things though which this PR has:

Identified that a bug in some combination of NeMo2 and Megatron-lm results in some fused kernels getting regularly re-compiled
Verified that even with this re-compilation bug, we still have the best performance when using these kernels
Added an option for geneformer to turn on the torch debugger that errors out if a kernel is getting recompiled, this is how I found which kernels were at fault and where/why the recompilations were happening.

Performance summary of different settings

Name	Replicate	Num GPUs	Time per 10 steps	Average Timing
no_recompile	0	1	2.487	2.50875
no_recompile	0	2	2.527
no_recompile	1	1	2.503
no_recompile	1	2	2.518
fused_bias_act	0	1	2.489	2.496666667
fused_bias_act	0	2	2.514
fused_bias_act	1	1	2.487
fused_bias_act	1	2
fused_bias_act_do	0	1	2.459	2.47425
fused_bias_act_do	0	2	2.478
fused_bias_act_do	1	1	2.471
fused_bias_act_do	1	2	2.489
fused_loss	0	1	2.312	2.326
fused_loss	0	2	2.335
fused_loss	1	1	2.323
fused_loss	1	2	2.334
fused_bias_do	0	1	2.467	2.4845
fused_bias_do	0	2	2.499
fused_bias_do	1	1	2.472
fused_bias_do	1	2	2.5
fused_bias_loss	0	1	2.282	2.28775
fused_bias_loss	0	2	2.297
fused_bias_loss	1	1	2.277
fused_bias_loss	1	2	2.295
fused_bias_loss_arange_expand	0	1	2.277	2.29075
fused_bias_loss_arange_expand	0	2	2.298
fused_bias_loss_arange_expand	1	1	2.285
fused_bias_loss_arange_expand	1	2	2.303

jstjohn · 2024-11-07T00:13:56Z

/build-ci

…hn/optimize_cluster_part2

jstjohn · 2024-11-07T20:24:43Z

/build-ci

…t using fusions

…hn/optimize_cluster_part2

jstjohn · 2024-11-08T00:11:51Z

/build-ci

…hn/optimize_cluster_part2

jstjohn · 2024-11-08T01:16:53Z

/build-ci

…hn/optimize_cluster_part2

…on despite its speed

jstjohn · 2024-11-08T17:31:51Z

/build-ci

jstjohn · 2024-11-08T18:29:57Z

/build-ci

malcolmgreaves

LGTM -- only requesting one change to not use a known Python anti-pattern. Everything else is good to go though!

sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/train_geneformer.py

jstjohn · 2024-11-08T19:19:19Z

/build-ci

jstjohn · 2024-11-08T20:04:52Z

/build-ci

Initial version of geneformer change for testing cluster perf

7a78ca6

jstjohn requested review from malcolmgreaves, skothenhill-nv, farhadrgh, dorotat-nv and pstjohn as code owners November 7, 2024 00:13

jstjohn marked this pull request as draft November 7, 2024 00:14

jstjohn added 3 commits November 7, 2024 19:02

Try saving position ids into self buffer

0664aed

Updated defaults

a7084b5

Merge branch 'main' of github.com:NVIDIA/bionemo-framework into jstjo…

2ade66f

…hn/optimize_cluster_part2

jstjohn added 2 commits November 8, 2024 00:08

Go back on the fusion settings, recompilation is still faster than no…

5327ef8

…t using fusions

Merge branch 'main' of github.com:NVIDIA/bionemo-framework into jstjo…

b08436c

…hn/optimize_cluster_part2

jstjohn marked this pull request as ready for review November 8, 2024 00:16

jstjohn added 2 commits November 8, 2024 00:28

Merge branch 'main' of github.com:NVIDIA/bionemo-framework into jstjo…

1d1e289

…hn/optimize_cluster_part2

Add the bert_position_id_tensor to the init of esm2

8bd8e2b

jstjohn added 3 commits November 8, 2024 17:27

defaulting to slower loss that may be more stable

24373f5

Merge branch 'main' of github.com:NVIDIA/bionemo-framework into jstjo…

713e8df

…hn/optimize_cluster_part2

Adding more language around the risk of using cross_entropy_loss_fusi…

19e8b7e

…on despite its speed

undo async save for now with comment

305c6cc

malcolmgreaves requested changes Nov 8, 2024

View reviewed changes

sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/train_geneformer.py Outdated Show resolved Hide resolved

go back to None default for wandb tags

45a0ef8

malcolmgreaves approved these changes Nov 8, 2024

View reviewed changes

skothenhill-nv approved these changes Nov 8, 2024

View reviewed changes

Merge branch 'main' into jstjohn/optimize_cluster_part2

340efe6

jstjohn enabled auto-merge (squash) November 8, 2024 19:29

Merge branch 'main' into jstjohn/optimize_cluster_part2

8410db6

jstjohn merged commit c054c22 into main Nov 8, 2024
4 checks passed

jstjohn deleted the jstjohn/optimize_cluster_part2 branch November 8, 2024 20:46

jstjohn mentioned this pull request Nov 9, 2024

Fix geneformer training instability bug #421

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training cluster performance boost for bert models part 2 #408

Training cluster performance boost for bert models part 2 #408

jstjohn commented Nov 7, 2024 •

edited

Loading

jstjohn commented Nov 7, 2024

jstjohn commented Nov 7, 2024

jstjohn commented Nov 8, 2024

jstjohn commented Nov 8, 2024

jstjohn commented Nov 8, 2024

jstjohn commented Nov 8, 2024

malcolmgreaves left a comment

jstjohn commented Nov 8, 2024

jstjohn commented Nov 8, 2024

Training cluster performance boost for bert models part 2 #408

Training cluster performance boost for bert models part 2 #408

Conversation

jstjohn commented Nov 7, 2024 • edited Loading

Performance summary of different settings

jstjohn commented Nov 7, 2024

jstjohn commented Nov 7, 2024

jstjohn commented Nov 8, 2024

jstjohn commented Nov 8, 2024

jstjohn commented Nov 8, 2024

jstjohn commented Nov 8, 2024

malcolmgreaves left a comment

Choose a reason for hiding this comment

jstjohn commented Nov 8, 2024

jstjohn commented Nov 8, 2024

jstjohn commented Nov 7, 2024 •

edited

Loading