Skip to content

Commit 92a3a2a

Browse files
lbliiisarahyurick
andauthored
Llane/readme image fix (#939)
* image fixes Signed-off-by: Lawrence Lane <[email protected]> * marketing verbiage Signed-off-by: Lawrence Lane <[email protected]> * Update README.md Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: L.B. <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: Sarah Yurick <[email protected]>
1 parent 90f70c5 commit 92a3a2a

File tree

4 files changed

+12
-4
lines changed

4 files changed

+12
-4
lines changed

README.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
# Accelerate Data Processing and Streamline Synthetic Data Generation with NVIDIA NeMo Curator
1212

13-
NeMo Curator is a Python library specifically designed for fast and scalable data processing and curation for generative AI use cases such as foundation language model pretraining, text-to-image model training, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT).
13+
NeMo Curator, part of the NVIDIA NeMo software suite for managing the AI agent lifecycle, is a Python library specifically designed for fast and scalable data processing and curation for generative AI use cases such as foundation language model pretraining, text-to-image model training, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT).
1414

1515
It greatly accelerates data processing and curation by leveraging GPUs with [Dask](https://www.dask.org/) and [RAPIDS](https://developer.nvidia.com/rapids), resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model convergence through the preparation of high-quality tokens.
1616

@@ -94,12 +94,20 @@ The modules within NeMo Curator were primarily designed to process and curate hi
9494
The following figure shows that the use of different data curation modules implemented in NeMo Curator led to improved model zero-shot downstream task performance.
9595

9696
<p align="center">
97-
<img src="./docs/user-guide/assets/readme/chart.png" alt="drawing" width="700"/>
97+
<img src="./docs/_images/ablation.png" alt="drawing" width="700"/>
9898
</p>
9999

100-
NeMo Curator leverages NVIDIA RAPIDS™ libraries like cuDF, cuML, and cuGraph along with Dask to scale workloads across multi-node, multi-GPU environments, significantly reducing data processing time. With NeMo Curator, developers can achieve 16X faster processing for text. Refer to the chart below to learn more details.
100+
NeMo Curator leverages NVIDIA RAPIDS™ libraries like cuDF, cuML, and cuGraph along with Dask to scale workloads across multi-node, multi-GPU environments, significantly reducing data processing time. With NeMo Curator, developers achieve approximately 16× faster fuzzy‑deduplication on an 8 TB RedPajama‑v2 subset, with ~40% lower TCO and near‑linear scaling on 1–4 H100 80 GB nodes. Refer to the chart below to learn more details.
101101

102-
NeMo Curator scales near linearly which means that developers can accelerate their data processing by adding more compute. For deduplicating the 1.96 Trillion token subset of the RedPajama V2 dataset, NeMo Curator took 0.5 hours with 32 NVIDIA H100 GPUs. Refer to the scaling chart below to learn more
102+
<p align="center">
103+
<img src="./docs/_images/text-benchmarks.png" alt="drawing" width="700"/>
104+
</p>
105+
106+
NeMo Curator exhibits near‑linear scaling for fuzzy deduplication. On an 8 TB RedPajama‑v2 subset (~1.78 trillion tokens), processing time drops from 2.05 hours on one H100 80 GB node to 0.50 hours on four nodes. Refer to the scaling chart below to learn more:
107+
108+
<p align="center">
109+
<img src="./docs/_images/scaling.png" alt="drawing" width="700"/>
110+
</p>
103111

104112
## Contribute to NeMo Curator
105113

docs/_images/ablation.png

233 KB
Loading

docs/_images/scaling.png

575 KB
Loading

docs/_images/text-benchmarks.png

353 KB
Loading

0 commit comments

Comments
 (0)