Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix release notes links from v1.x and update broken site: links #413

Merged
merged 8 commits into from
Nov 8, 2024
6 changes: 3 additions & 3 deletions docs/docs/models/esm2.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ checkpoints is consistent with their outputs when evaluated with the HuggingFace
#### Single-node Training Performance

<figure markdown="span">
![ESM-2 Single-Device Training Performance](site:assets/images/esm2/esm2_single_node_training_perf.svg){ width="350" }
![ESM-2 Single-Device Training Performance](../assets/images/esm2/esm2_single_node_training_perf.svg){ width="350" }
</figure>

The pure-pytorch baseline (compiled with `torch.compile()`) raised an out-of-memory error for batch sizes larger than 16
Expand All @@ -133,7 +133,7 @@ at the ESM2-650M model size. The `bionemo2` model could handle batch sizes of 46
#### Model Scaling

<figure markdown="span">
![ESM-2 Model Scaling](site:assets/images/esm2/esm2_model_scaling.svg)
![ESM-2 Model Scaling](../assets/images/esm2/esm2_model_scaling.svg)
</figure>

Training ESM-2 at the 650M, 3B, and 15B model variants show improved performance with the BioNeMo2 framework over the
Expand All @@ -143,7 +143,7 @@ nodes.
#### Device Scaling

<figure markdown="span">
![ESM-2 Device Scaling](site:assets/images/esm2/esm2_device_scaling.svg){ width="400" }
![ESM-2 Device Scaling](../assets/images/esm2/esm2_device_scaling.svg){ width="400" }
</figure>

Training ESM-3B on 256 NVIDIA A100s on 32 nodes achieved 96.85% of the theoretical linear throughput expected from
Expand Down
12 changes: 6 additions & 6 deletions docs/docs/user-guide/appendix/releasenotes-fw.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,11 +76,11 @@
* **Beta** [Geneformer](https://www.nature.com/articles/s41586-023-06139-9) a foundation model for single-cell data that encodes each cell as represented by an ordered list of differentially expressed genes for that cell.

### New Features
* **Beta** [Geneformer pretraining with custom datasets](notebooks/geneformer_cellxgene_tutorial.ipynb)
* [Low-Rank Adaptation (LoRA) finetuning for ESM2](lora-finetuning-esm2.md)
* **Beta** Geneformer pretraining with custom datasets
* Low-Rank Adaptation (LoRA) finetuning for ESM2

### Bug fixes and Improvements
* [OpenFold training improved benchmarks and validation of optimizations](models/openfold.md)
* OpenFold training improved benchmarks and validation of optimizations

### Known Issues
* BioNeMo Framework v24.04 container is vulnerable to [GHSA-whh8-fjgc-qp73](https://github.com/advisories/GHSA-whh8-fjgc-qp73) in onnx 1.14.0. Users are advised not to open untrusted onnx files with this image. Restrict your mount point to minimize directory traversal impact. A fix for this is scheduled in the 24.05 (May) release.
Expand All @@ -91,9 +91,9 @@

### New Features
* [MolMIM](https://developer.nvidia.com/blog/new-models-molmim-and-diffdock-power-molecule-generation-and-molecular-docking-in-bionemo/) re-trained on more data is now available in the framework, and achieves [state of the art performance](models/molmim.md).
* [MolMIM property guided tutorial notebook](notebooks/cma_es_guided_molecular_optimization_molmim.ipynb) covering property guided optimization using our new framework model.
* [MolMIM training tutorial](notebooks/model_training_molmim.ipynb) available walking users through either training from scratch or from an existing checkpoint on your own data.
* [MolMIM tutorial notebook covering molecular sampling and property prediction](notebooks/MolMIM_GenerativeAI_local_inference_with_examples.ipynb) is also now available.
* MolMIM property guided tutorial notebook covering property guided optimization using our new framework model.
* MolMIM training tutorial available walking users through either training from scratch or from an existing checkpoint on your own data.
* MolMIM tutorial notebook covering molecular sampling and property prediction is also now available.
* Numerous optimizations from [NVIDIA's entry to the MLPerf competition](https://developer.nvidia.com/blog/optimizing-openfold-training-for-drug-discovery/) have been added to OpenFold. Documentation and detailed benchmarks are works in progress and will be published in upcoming releases. This release contains the following performance optimizations:
* Fused GEMMs in multi-head attention (MHA)
* Non-blocking data pipeline
Expand Down
16 changes: 8 additions & 8 deletions docs/docs/user-guide/background/nemo2.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Synchronization of gradients occurs after the backward pass is complete for each
that ensures all GPUs have synchronized parameters for the next iteration. Here is an example of how this might appear
on your cluster with a small model:

![Data Parallelism Diagram](site:assets/images/megatron_background/data_parallelism.png)
![Data Parallelism Diagram](../../assets/images/megatron_background/data_parallelism.png)

### FSDP background
FSDP extends DDP by sharding (splitting) model weights across GPUs in your cluster to optimize memory usage.
Expand All @@ -40,8 +40,8 @@ Note that this process parallelizes the storage in a way that enables too large
layer is not too large to fit on a GPU). Megatron (next) co-locates both storage and compute.

The following two figures show two steps through the forward pass of a model that has been sharded with FSDP.
![FSDP Diagram Step 1](site:assets/images/megatron_background/fsdp_slide1.png)
![FSDP Diagram Step 2](site:assets/images/megatron_background/fsdp_slide2.png)
![FSDP Diagram Step 1](../../assets/images/megatron_background/fsdp_slide1.png)
![FSDP Diagram Step 2](../../assets/images/megatron_background/fsdp_slide2.png)

### Model Parallelism
Model parallelism is the catch-all term for the variety of different parallelism strategies
Expand All @@ -55,7 +55,7 @@ Pipeline parallelism is similar to FSDP, but the model blocks that are sharded a
nodes that own the model weight in question. You can think of this as a larger simulated GPU that happens to be spread
across several child GPUs. Examples of this include `parallel_state.is_pipeline_last_stage()` which is commonly
used to tell if a particular node is on last pipeline stage, where you compute the final head outputs, loss, etc.
![Pipeline Parallelism](site:/assets/images/megatron_background/pipeline_parallelism.png). Similarly there are convenience
![Pipeline Parallelism](../..//assets/images/megatron_background/pipeline_parallelism.png). Similarly there are convenience
environmental lookups for the first pipeline stage (where you compute the embedding for example)
`parallel_state.is_pipeline_first_stage()`.

Expand All @@ -64,7 +64,7 @@ Tensor parallelism represents splitting single layers across GPUs. This can also
layers could in theory be too large to fit on a single GPU, which would make FSDP not possible. This would still work
since individual layer weights (and computations) are distributed. Examples of this in megatron include `RowParallelLinear` and
`ColumnParallelLinear` layers.
![Tensor Parallelism](site:assets/images/megatron_background/tensor_parallelism.png)
![Tensor Parallelism](../../assets/images/megatron_background/tensor_parallelism.png)

#### Sequence Parallelism
In megatron, "sequence parallelism" refers to the parallelization of the dropout, and layernorm blocks of a transformer.
Expand All @@ -77,7 +77,7 @@ layers (which are typically set up for tensor parallelism). Next the result from
sequence parallel nodes which execute dropout, do a residual connection from the previous sequence parallel output, and
a layernorm. Next those results are again gathered for the final FFN and activation layers prior to a final scattering
across sequence parallel GPUs for the output of that transformer block.
![Sequence Parallelism](site:assets/images/megatron_background/sp_korthikanti_2022_fig5.png)
![Sequence Parallelism](../../assets/images/megatron_background/sp_korthikanti_2022_fig5.png)

As a user, if you know that your transformer is executed in parallel and you have custom losses or downstream layers,
you need to make sure that the appropriate gather operations are occurring for your loss computation etc.
Expand All @@ -102,12 +102,12 @@ Below is a figure demonstrating how mixing strategies results in larger "virtual
fewer distinct micro-batches in flight across your cluster. Also note that the number of virtual GPUs is multiplicative
so if you have `TP=2` and `PP=2` then you are creating a larger virtual GPU out of `2*2=4` GPUs, so your cluster size
needs to be a multiple of 4 in this case.
![Mixing Tensor and Pipeline Parallelism](site:assets/images/megatron_background/tensor_and_pipeline_parallelism.png)
![Mixing Tensor and Pipeline Parallelism](../../assets/images/megatron_background/tensor_and_pipeline_parallelism.png)

#### Scheduling model parallelism
You can improve on naive schedules by splitting up micro-batches into smaller pieces, executing multiple stages of the
model on single GPUs, and starting computing the backwards pass of one micro-batch while another is going through forward.
These optimizations allow for better cluster GPU utilization to be achieved. For example the following figure shows
how more advanced splitting techniques in megatron (eg the interleaved scheduler) provide better utilization when model
parallelism is used. Again when you can get away without using model parallelism (DDP), that is generally the best approach.
![Execution Schedulers](site:assets/images/megatron_background/execution_schedulers.png)
![Execution Schedulers](../../assets/images/megatron_background/execution_schedulers.png)
Loading