Translate text documents using a multi-step LLM #943

TsukiSama9292 · 2025-08-23T16:16:20Z

Description

Added an LLM-based multi-step text translation feature.

TranslationDataGenerator Workflows:
- Create class instance (variable or YAML): Initialize OpenAIClient (for initial translation, reflection, and refinement of translation) & Tokenizer (used to calculate token count for long texts)...
- Perform translation:
  - Long text segmentation: Softly segment text based on content, Tokenizer, and configuration parameters (this method first splits by sentence, then applies a soft token length limit).
  - Translate each chunk one by one, referencing the implementation from Github: andrewyng/translation-agent:
    - Initial Translation Agent: Uses LLM to perform an initial translation based on source language, source text, and target language.
    - Reflection Agent: Uses LLM with the same inputs, but also includes the initial translation and target country, to reflect on and evaluate the initial translation.
    - Improvement Agent: Uses all of the above to improve the translation, enhancing overall quality.

Usage

from nemo_curator.synthetic.translate import TranslationDataGenerator

# Example usage for both function and YAML config
if __name__ == "__main__":
    # Function-based usage
    text = "Once upon a time, there were three little pig brothers..."
    generator = TranslationDataGenerator(
        base_url="http://localhost:11434/v1",                   # (Change this) Base URL for local API (P.S: Ollama supports the OpenAI API format.)
        api_key="",                                             # API key (empty if not required)
        init_translate_model="gpt-oss:latest",                  # Initial translation model
        reflection_model="gpt-oss:latest",                      # Reflection model for improvement
        improvement_model="gpt-oss:latest",                     # Model for translation improvement
        hf_tokenizer="openai/gpt-oss-20b" ,                     # (Change this) HuggingFace model for tokenization
        hf_token=None,                                          # (Change this) HuggingFace authentication token
        temperature=1.0,                                        # Sampling temperature for generation
        top_p=1.0,                                              # Nucleus sampling parameter
        max_tokens=24576,                                       # Maximum tokens for input
        stop=["<|return|>","<|endoftext|>", "<|call|>"],        # (Change this) Stop TOKEN sequences
        max_token_per_chunk=5000,                               # Max tokens per chunk for translation
        source_lang="English",                                  # Source language
        target_lang="Traditional Chinese",                      # Target language
        country="Taiwan",                                       # (Optional) Country context for translation
    )
    translations = generator.generate(text)
    print(generator.parse_response(translations))

    # YAML-based usage (parameters only, text provided in code)
    generator_yaml = TranslationDataGenerator.from_yaml("config/translation_config.yaml")
    print(generator_yaml.generate_from_yaml("config/translation_config.yaml", text))

    # Pipeline DataFrame usage
    import pandas as pd
    df = pd.DataFrame(
        {
            "text": [
                "Once upon a time, there were three little pig brothers...",
                "The quick brown fox jumps over the lazy dog.",
            ]
        }
    )
    df_translated = generator_yaml.generate_from_dataframe(df, text_column="text", batch_size=16)
    print(df_translated.head())

    # Async test for async_generate_from_dataframe
    import asyncio
    async def test_async_generate_from_dataframe():
        df_translated = await generator_yaml.async_generate_from_dataframe(df, text_column="text", batch_size=16)
        print("[Async]\n", df_translated.head())

    asyncio.run(test_async_generate_from_dataframe())

Requesting review from NLP maintainers:
@ericharper @ekmb @yzhang123 @VahidooX @vladgets @okuchaiev

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

…anagement Signed-off-by: TsukiSama9292 <[email protected]>

Signed-off-by: TsukiSama9292 <[email protected]>

copy-pr-bot · 2025-08-23T16:16:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: TsukiSama9292 <[email protected]>

shuoyangd · 2025-08-26T17:57:48Z

Hi @TsukiSama9292 ! Thanks for the great contribution. We have a few requests before we can merge this one but me, @ayushdg and @abhinavg4 will support you along the way.

The main request we have as of now is that we are migrating away from Dask and switching to Ray as the new backend, so it would be nice if you can refactor your current workflow into something compatible with the rest of the repo. Good starting points would be the API design doc and our quick start guide. Please take a look and let us know if you have questions!

After finishing the refactoring, let's do some benchmarks to ensure the default recipe generates good results.

In addition, I have two high-level questions upon glancing over your changes:

Could you please switching to some existing sentence segmenter rather than inventing your own? Those regex-based solutions tend to break when you scale to more languages. I'm fine with anything like spacy or ersatz as long as it has good multilingual support.
Correct me if I'm wrong, but currently your pipeline seems to assume that an LLM server has already been started (I only see client queries with OpenAI client but nothing that starts an LLM server, e.g. vllm or trtllm)? Did you give any thoughts about self-hosting cases?

Signed-off-by: TsukiSama9292 <[email protected]>

TsukiSama9292 · 2025-08-27T02:25:56Z

Hi @TsukiSama9292 ! Thanks for the great contribution. We have a few requests before we can merge this one but me, @ayushdg and @abhinavg4 will support you along the way.

The main request we have as of now is that we are migrating away from Dask and switching to Ray as the new backend, so it would be nice if you can refactor your current workflow into something compatible with the rest of the repo. Good starting points would be the API design doc and our quick start guide. Please take a look and let us know if you have questions!

After finishing the refactoring, let's do some benchmarks to ensure the default recipe generates good results.

In addition, I have two high-level questions upon glancing over your changes:

Could you please switching to some existing sentence segmenter rather than inventing your own? Those regex-based solutions tend to break when you scale to more languages. I'm fine with anything like spacy or ersatz as long as it has good multilingual support.

Correct me if I'm wrong, but currently your pipeline seems to assume that an LLM server has already been started (I only see client queries with OpenAI client but nothing that starts an LLM server, e.g. vllm or trtllm)? Did you give any thoughts about self-hosting cases?

I’m planning to migrate llm-translate from Dask to ray-api. I also updated split_sentences to use spaCy for sentence splitting, allowing users to choose different spaCy models.

nemo_curator/synthetic/translate.py

…tation. However, in practical use cases, it is recommended to use a source-language-specific SpaCy model that allows sentence segmentation. Signed-off-by: TsukiSama9292 <[email protected]>

TsukiSama9292 added 5 commits August 23, 2025 12:57

feat: Implement TextSplitter class for sentence splitting and token m…

1c4b42f

…anagement Signed-off-by: TsukiSama9292 <[email protected]>

feat: Implement TranslationWorkflow class for Translate text

5a6534f

Signed-off-by: TsukiSama9292 <[email protected]>

feat: Translation setting can use YAML

9bff6e6

Signed-off-by: TsukiSama9292 <[email protected]>

feat: Translation for pd datasets

aa13245

Signed-off-by: TsukiSama9292 <[email protected]>

docs: Add translation pipeline documentation

60bae69

Signed-off-by: TsukiSama9292 <[email protected]>

github-actions bot added ray-api Pick this label for auto-cherry-picking into the ray-api branch community-request labels Aug 23, 2025

optimize: translation feature & add dataset translate example

1937399

Signed-off-by: TsukiSama9292 <[email protected]>

TsukiSama9292 force-pushed the llm-translate branch from e44f0a2 to 1937399 Compare August 25, 2025 04:18

optimize: translation default config(max_token)

a7bf3f1

Signed-off-by: TsukiSama9292 <[email protected]>

abhinavg4 requested review from abhinavg4, ayushdg and shuoyangd August 25, 2025 18:40

shuoyangd self-assigned this Aug 25, 2025

optimize: translation.py - TextSplitter use spaCy tosplit_sentences

dd4412f

Signed-off-by: TsukiSama9292 <[email protected]>

shuoyangd reviewed Aug 27, 2025

View reviewed changes

nemo_curator/synthetic/translate.py Outdated Show resolved Hide resolved

Modify the default SpaCy model to enable multilingual sentence segmen…

7657340

…tation. However, in practical use cases, it is recommended to use a source-language-specific SpaCy model that allows sentence segmentation. Signed-off-by: TsukiSama9292 <[email protected]>

TsukiSama9292 force-pushed the llm-translate branch from 484442e to 7657340 Compare August 28, 2025 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Translate text documents using a multi-step LLM #943

Translate text documents using a multi-step LLM #943

Uh oh!

TsukiSama9292 commented Aug 23, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Aug 23, 2025

Uh oh!

shuoyangd commented Aug 26, 2025

Uh oh!

TsukiSama9292 commented Aug 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Translate text documents using a multi-step LLM #943

Are you sure you want to change the base?

Translate text documents using a multi-step LLM #943

Uh oh!

Conversation

TsukiSama9292 commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Added an LLM-based multi-step text translation feature.

Usage

Checklist

Uh oh!

copy-pr-bot bot commented Aug 23, 2025

Uh oh!

shuoyangd commented Aug 26, 2025

Uh oh!

TsukiSama9292 commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TsukiSama9292 commented Aug 23, 2025 •

edited

Loading

TsukiSama9292 commented Aug 27, 2025 •

edited

Loading