NVIDIA · tshimko-nv · Nov 8, 2024 · Oct 22, 2024 · Oct 22, 2024 · Nov 7, 2024
@@ -123,7 +123,7 @@ checkpoints is consistent with their outputs when evaluated with the HuggingFace
 #### Single-node Training Performance
 
 <figure markdown="span">
-  ![ESM-2 Single-Device Training Performance](site:assets/images/esm2/esm2_single_node_training_perf.svg){ width="350" }
+  ![ESM-2 Single-Device Training Performance](../assets/images/esm2/esm2_single_node_training_perf.svg){ width="350" }
 </figure>
 
 The pure-pytorch baseline (compiled with `torch.compile()`) raised an out-of-memory error for batch sizes larger than 16
@@ -133,7 +133,7 @@ at the ESM2-650M model size. The `bionemo2` model could handle batch sizes of 46
 #### Model Scaling
 
 <figure markdown="span">
-  ![ESM-2 Model Scaling](site:assets/images/esm2/esm2_model_scaling.svg)
+  ![ESM-2 Model Scaling](../assets/images/esm2/esm2_model_scaling.svg)
 </figure>
 
 Training ESM-2 at the 650M, 3B, and 15B model variants show improved performance with the BioNeMo2 framework over the
@@ -143,7 +143,7 @@ nodes.
 #### Device Scaling
 
 <figure markdown="span">
-  ![ESM-2 Device Scaling](site:assets/images/esm2/esm2_device_scaling.svg){ width="400" }
+  ![ESM-2 Device Scaling](../assets/images/esm2/esm2_device_scaling.svg){ width="400" }
 </figure>
 
 Training ESM-3B on 256 NVIDIA A100s on 32 nodes achieved 96.85% of the theoretical linear throughput expected from

@@ -76,11 +76,11 @@
 * **Beta** [Geneformer](https://www.nature.com/articles/s41586-023-06139-9) a foundation model for single-cell data that encodes each cell as represented by an ordered list of differentially expressed genes for that cell.
 
 ### New Features
-* **Beta** [Geneformer pretraining with custom datasets](notebooks/geneformer_cellxgene_tutorial.ipynb)
-* [Low-Rank Adaptation (LoRA) finetuning for ESM2](lora-finetuning-esm2.md)
+* **Beta** Geneformer pretraining with custom datasets
+* Low-Rank Adaptation (LoRA) finetuning for ESM2
 
 ### Bug fixes and Improvements
-* [OpenFold training improved benchmarks and validation of optimizations](models/openfold.md)
+* OpenFold training improved benchmarks and validation of optimizations
 
 ### Known Issues
 * BioNeMo Framework v24.04 container is vulnerable to [GHSA-whh8-fjgc-qp73](https://github.com/advisories/GHSA-whh8-fjgc-qp73) in onnx 1.14.0. Users are advised not to open untrusted onnx files with this image. Restrict your mount point to minimize directory traversal impact. A fix for this is scheduled in the 24.05 (May) release.
@@ -91,9 +91,9 @@
 
 ### New Features
 * [MolMIM](https://developer.nvidia.com/blog/new-models-molmim-and-diffdock-power-molecule-generation-and-molecular-docking-in-bionemo/) re-trained on more data is now available in the framework, and achieves [state of the art performance](models/molmim.md).
-* [MolMIM property guided tutorial notebook](notebooks/cma_es_guided_molecular_optimization_molmim.ipynb) covering property guided optimization using our new framework model.
-* [MolMIM training tutorial](notebooks/model_training_molmim.ipynb) available walking users through either training from scratch or from an existing checkpoint on your own data.
-* [MolMIM tutorial notebook covering molecular sampling and property prediction](notebooks/MolMIM_GenerativeAI_local_inference_with_examples.ipynb) is also now available.
+* MolMIM property guided tutorial notebook covering property guided optimization using our new framework model.
+* MolMIM training tutorial available walking users through either training from scratch or from an existing checkpoint on your own data.
+* MolMIM tutorial notebook covering molecular sampling and property prediction is also now available.
 * Numerous optimizations from [NVIDIA's entry to the MLPerf competition](https://developer.nvidia.com/blog/optimizing-openfold-training-for-drug-discovery/) have been added to OpenFold. Documentation and detailed benchmarks are works in progress and will be published in upcoming releases. This release contains the following performance optimizations:
     * Fused GEMMs in multi-head attention (MHA)
     * Non-blocking data pipeline

@@ -24,7 +24,7 @@ Synchronization of gradients occurs after the backward pass is complete for each
 that ensures all GPUs have synchronized parameters for the next iteration. Here is an example of how this might appear
 on your cluster with a small model:
 
-![Data Parallelism Diagram](site:assets/images/megatron_background/data_parallelism.png)
+![Data Parallelism Diagram](../../assets/images/megatron_background/data_parallelism.png)
 
 ### FSDP background
 FSDP extends DDP by sharding (splitting) model weights across GPUs in your cluster to optimize memory usage.
@@ -40,8 +40,8 @@ Note that this process parallelizes the storage in a way that enables too large
 layer is not too large to fit on a GPU). Megatron (next) co-locates both storage and compute.
 
 The following two figures show two steps through the forward pass of a model that has been sharded with FSDP.
-![FSDP Diagram Step 1](site:assets/images/megatron_background/fsdp_slide1.png)
-![FSDP Diagram Step 2](site:assets/images/megatron_background/fsdp_slide2.png)
+![FSDP Diagram Step 1](../../assets/images/megatron_background/fsdp_slide1.png)
+![FSDP Diagram Step 2](../../assets/images/megatron_background/fsdp_slide2.png)
 
 ### Model Parallelism
 Model parallelism is the catch-all term for the variety of different parallelism strategies
@@ -55,7 +55,7 @@ Pipeline parallelism is similar to FSDP, but the model blocks that are sharded a
 nodes that own the model weight in question. You can think of this as a larger simulated GPU that happens to be spread
 across several child GPUs. Examples of this include `parallel_state.is_pipeline_last_stage()` which is commonly
 used to tell if a particular node is on last pipeline stage, where you compute the final head outputs, loss, etc.
-![Pipeline Parallelism](site:/assets/images/megatron_background/pipeline_parallelism.png). Similarly there are convenience
+![Pipeline Parallelism](../..//assets/images/megatron_background/pipeline_parallelism.png). Similarly there are convenience
 environmental lookups for the first pipeline stage (where you compute the embedding for example)
 `parallel_state.is_pipeline_first_stage()`.
 
@@ -64,7 +64,7 @@ Tensor parallelism represents splitting single layers across GPUs. This can also
 layers could in theory be too large to fit on a single GPU, which would make FSDP not possible. This would still work
 since individual layer weights (and computations) are distributed. Examples of this in megatron include `RowParallelLinear` and
 `ColumnParallelLinear` layers.
-![Tensor Parallelism](site:assets/images/megatron_background/tensor_parallelism.png)
+![Tensor Parallelism](../../assets/images/megatron_background/tensor_parallelism.png)
 
 #### Sequence Parallelism
 In megatron, "sequence parallelism" refers to the parallelization of the dropout, and layernorm blocks of a transformer.
@@ -77,7 +77,7 @@ layers (which are typically set up for tensor parallelism). Next the result from
 sequence parallel nodes which execute dropout, do a residual connection from the previous sequence parallel output, and
 a layernorm. Next those results are again gathered for the final FFN and activation layers prior to a final scattering
 across sequence parallel GPUs for the output of that transformer block.
-![Sequence Parallelism](site:assets/images/megatron_background/sp_korthikanti_2022_fig5.png)
+![Sequence Parallelism](../../assets/images/megatron_background/sp_korthikanti_2022_fig5.png)
 
 As a user, if you know that your transformer is executed in parallel and you have custom losses or downstream layers,
 you need to make sure that the appropriate gather operations are occurring for your loss computation etc.
@@ -102,12 +102,12 @@ Below is a figure demonstrating how mixing strategies results in larger "virtual
 fewer distinct micro-batches in flight across your cluster. Also note that the number of virtual GPUs is multiplicative
 so if you have `TP=2` and `PP=2` then you are creating a larger virtual GPU out of `2*2=4` GPUs, so your cluster size
 needs to be a multiple of 4 in this case.
-![Mixing Tensor and Pipeline Parallelism](site:assets/images/megatron_background/tensor_and_pipeline_parallelism.png)
+![Mixing Tensor and Pipeline Parallelism](../../assets/images/megatron_background/tensor_and_pipeline_parallelism.png)
 
 #### Scheduling model parallelism
 You can improve on naive schedules by splitting up micro-batches into smaller pieces, executing multiple stages of the
 model on single GPUs, and starting computing the backwards pass of one micro-batch while another is going through forward.
 These optimizations allow for better cluster GPU utilization to be achieved. For example the following figure shows
 how more advanced splitting techniques in megatron (eg the interleaved scheduler) provide better utilization when model
 parallelism is used. Again when you can get away without using model parallelism (DDP), that is generally the best approach.
-![Execution Schedulers](site:assets/images/megatron_background/execution_schedulers.png)
+![Execution Schedulers](../../assets/images/megatron_background/execution_schedulers.png)