Skip to content
Open
18 changes: 18 additions & 0 deletions config/translation_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Example YAML configuration for TranslationDataGenerator
# You can use this file with TranslationDataGenerator.from_yaml('config/translation_config.yaml')

base_url: "http://localhost:11434/v1"
api_key: ""
init_translate_model: "gpt-oss:latest"
reflection_model: "gpt-oss:latest"
improvement_model: "gpt-oss:latest"
hf_tokenizer: "openai/gpt-oss-20b"
hf_token: ""
max_token_per_chunk: 5000
temperature: 1.0
top_p: 1.0
stop: ["<|return|>","<|endoftext|>", "<|call|>"]
max_tokens: 24576
source_lang: "English"
target_lang: "Traditional Chinese"
country: "Taiwan"
219 changes: 219 additions & 0 deletions docs/curate-text/generate-data/pipelines/translation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
---
description: "Translate text documents using a multi-step LLM pipeline with reflection and improvement"
categories: ["how-to-guides"]
tags: ["translation", "llm", "synthetic-data", "reflection", "improvement"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "intermediate"
content_type: "how-to"
modality: "text-only"
---

(text-gen-data-pipelines-translation)=
# Translation Pipeline

This pipeline translates text documents using a multi-step process with large language models (LLMs), including initial translation, reflection, and improvement. It leverages the NeMo Curator framework and supports both function-based and YAML-based configuration for flexible workflows.

## Before You Start

- **LLM Client Setup**: The `TranslationDataGenerator` requires an `OpenAIClient` instance to interface with language models. See the [LLM services documentation](text-generate-data-connect-service) for details on configuring your client and model provider.

---

## Setup Steps

### Set up the LLM Client

Configure your LLM client (example with OpenAI):

```python
from openai import OpenAI

openai_client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="<insert API key>"
)
```

### Create the NeMo Curator Client Wrapper

Wrap the client with NeMo Curator's client wrapper:

```python
from nemo_curator import OpenAIClient

client = OpenAIClient(openai_client)
```

### Initialize the Translation Generator

Create the `TranslationDataGenerator` instance:

```python
from nemo_curator.synthetic.translate import TranslationDataGenerator

# Create a TranslationDataGenerator instance with specified parameters
generator = TranslationDataGenerator(
base_url="http://localhost:11434/v1", # (Change this) Base URL for local API (P.S: Ollama supports the OpenAI API format.)
api_key="", # API key (empty if not required)
init_translate_model="gpt-oss:latest", # Initial translation model
reflection_model="gpt-oss:latest", # Reflection model for improvement
improvement_model="gpt-oss:latest", # Model for translation improvement
hf_tokenizer="openai/gpt-oss-20b" , # (Change this) HuggingFace model for tokenization
hf_token=None, # (Change this) HuggingFace authentication token
temperature=1.0, # Sampling temperature for generation
top_p=1.0, # Nucleus sampling parameter
max_tokens=24576, # Maximum tokens for input
stop=["<|return|>","<|endoftext|>", "<|call|>"], # (Change this) Stop TOKEN sequences
max_token_per_chunk=5000, # Max tokens per chunk for translation
source_lang="English", # Source language
target_lang="Traditional Chinese", # Target language
country="Taiwan", # (Optional) Country context for translation
)
```

### YAML-Based Configuration

You can also configure the generator using a YAML file:

```python
# config/translation_config.yaml
# See the provided example in the repository

generator_yaml = TranslationDataGenerator.from_yaml("config/translation_config.yaml")
```

---

## Translation Workflow

### Translate a Single Text

```python
text = "Once upon a time, there were three little pig brothers..."
translations = generator.generate(text)
print(generator.parse_response(translations))
```

### Translate Using YAML Configuration

```python
translations = generator_yaml.generate_from_yaml("config/translation_config.yaml", text)
print(translations)
```

### Batch Translation with DataFrames

Efficiently translate multiple texts in a pandas DataFrame:

```python
import pandas as pd

df = pd.DataFrame({
"text": [
"Once upon a time, there were three little pig brothers...",
"The quick brown fox jumps over the lazy dog."
]
})

df_translated = generator_yaml.generate_from_dataframe(df, text_column="text", batch_size=16)
print(df_translated.head())
```

### Asynchronous Batch Translation

For large-scale translation tasks, use the async pipeline:

```python
import asyncio

async def async_translate():
df_translated = await generator_yaml.async_generate_from_dataframe(df, text_column="text", batch_size=16)
print("[Async]", df_translated.head())

asyncio.run(async_translate())
```

---

## Pipeline Steps Explained

1. **Initial Translation**: The input text is translated using the specified LLM model.
2. **Reflection**: The initial translation is reviewed and refined, optionally with country-specific context.
3. **Improvement**: The translation is finalized using feedback from the reflection step.

This multi-step approach improves translation quality and contextual accuracy.

---

## Example YAML Configuration

```yaml
base_url: "http://localhost:8000/v1"
api_key: ""
init_translate_model: "gpt-oss:latest"
reflection_model: "gpt-oss:latest"
improvement_model: "gpt-oss:latest"
hf_tokenizer: "openai/gpt-oss-20b"
hf_token: ""
max_token_per_chunk: 5000
temperature: 1.0
top_p: 1.0
max_tokens: 8192
source_lang: "English"
target_lang: "Traditional Chinese"
country: "Taiwan"
```

---

## Dataset Translation Example (using HuggingFace Datasets as a template)

### Initial TranslationDataGenerator

```python
# (Optional) Import BaseSettings from pydantic for configuration management
from pydantic.v1 import BaseSettings

# (Optional) Define a Settings class to store model and API configuration
class Settings(BaseSettings):
hf_token: str = None # (Change this) HuggingFace token for authentication
hf_model: str = "openai/gpt-oss-20b" # (Change this) HuggingFace model for tokenization
model_name: str = "gpt-oss:latest" # (Change this) Local model name
base_url: str = "http://localhost:11434/v1" # (Change this) Base URL for local API (P.S: Ollama supports the OpenAI API format.)

# Instantiate the Settings object to access configuration
setting = Settings()

# Import the TranslationDataGenerator for synthetic translation tasks
from nemo_curator.synthetic.translate import TranslationDataGenerator

# Create a TranslationDataGenerator instance with specified parameters
generator = TranslationDataGenerator(
base_url=setting.base_url, # API endpoint
api_key="", # API key (empty if not required)
init_translate_model=setting.model_name, # Initial translation model
reflection_model=setting.model_name, # Reflection model for improvement
improvement_model=setting.model_name, # Model for translation improvement
hf_tokenizer=setting.hf_model, # Tokenizer model from HuggingFace
hf_token=setting.hf_token, # HuggingFace authentication token
temperature=1.0, # Sampling temperature for generation
top_p=1.0, # Nucleus sampling parameter
max_tokens=24576, # Maximum tokens for input
stop=["<|return|>","<|endoftext|>", "<|call|>"], # Stop TOKEN sequences
max_token_per_chunk=5000, # Max tokens per chunk for translation
source_lang="English", # Source language
target_lang="Traditional Chinese", # Target language
country="Taiwan", # (Optional) Country context for translation
)
```

###

---

## References

- [NeMo Curator Documentation](https://github.com/NVIDIA/NeMo-Curator)
- [Translation Agent Reference](https://github.com/andrewyng/translation-agent)

---
1 change: 1 addition & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ These include:
| raw_download_common_crawl.py | Download the raw compressed WARC files from Common Crawl without extracting them. |
| semdedup_example.py | Use the `SemDedup` class to perform semantic deduplication on text data. |
| task_decontamination.py | Remove segments of downstream evaluation tasks from a dataset. |
| translation.py | Language/region translation using `TranslationDataGenerator` for any Base-LLM supported language. |

Before running any of these scripts, we strongly recommend displaying `python <script name>.py --help` to ensure that any needed or relevant arguments are specified.

Expand Down
Loading