Skip to content

Conversation

@TsukiSama9292
Copy link

@TsukiSama9292 TsukiSama9292 commented Aug 23, 2025

Description

Added an LLM-based multi-step text translation feature.

  • TranslationDataGenerator Workflows:
    • Create class instance (variable or YAML): Initialize OpenAIClient (for initial translation, reflection, and refinement of translation) & Tokenizer (used to calculate token count for long texts)...
    • Perform translation:
      • Long text segmentation: Softly segment text based on content, Tokenizer, and configuration parameters (this method first splits by sentence, then applies a soft token length limit).
      • Translate each chunk one by one, referencing the implementation from Github: andrewyng/translation-agent:
        • Initial Translation Agent: Uses LLM to perform an initial translation based on source language, source text, and target language.
        • Reflection Agent: Uses LLM with the same inputs, but also includes the initial translation and target country, to reflect on and evaluate the initial translation.
        • Improvement Agent: Uses all of the above to improve the translation, enhancing overall quality.

Usage

from nemo_curator.synthetic.translate import TranslationDataGenerator

# Example usage for both function and YAML config
if __name__ == "__main__":
    # Function-based usage
    text = "Once upon a time, there were three little pig brothers..."
    generator = TranslationDataGenerator(
        base_url="http://localhost:11434/v1",                   # (Change this) Base URL for local API (P.S: Ollama supports the OpenAI API format.)
        api_key="",                                             # API key (empty if not required)
        init_translate_model="gpt-oss:latest",                  # Initial translation model
        reflection_model="gpt-oss:latest",                      # Reflection model for improvement
        improvement_model="gpt-oss:latest",                     # Model for translation improvement
        hf_tokenizer="openai/gpt-oss-20b" ,                     # (Change this) HuggingFace model for tokenization
        hf_token=None,                                          # (Change this) HuggingFace authentication token
        temperature=1.0,                                        # Sampling temperature for generation
        top_p=1.0,                                              # Nucleus sampling parameter
        max_tokens=24576,                                       # Maximum tokens for input
        stop=["<|return|>","<|endoftext|>", "<|call|>"],        # (Change this) Stop TOKEN sequences
        max_token_per_chunk=5000,                               # Max tokens per chunk for translation
        source_lang="English",                                  # Source language
        target_lang="Traditional Chinese",                      # Target language
        country="Taiwan",                                       # (Optional) Country context for translation
    )
    translations = generator.generate(text)
    print(generator.parse_response(translations))

    # YAML-based usage (parameters only, text provided in code)
    generator_yaml = TranslationDataGenerator.from_yaml("config/translation_config.yaml")
    print(generator_yaml.generate_from_yaml("config/translation_config.yaml", text))

    # Pipeline DataFrame usage
    import pandas as pd
    df = pd.DataFrame(
        {
            "text": [
                "Once upon a time, there were three little pig brothers...",
                "The quick brown fox jumps over the lazy dog.",
            ]
        }
    )
    df_translated = generator_yaml.generate_from_dataframe(df, text_column="text", batch_size=16)
    print(df_translated.head())

    # Async test for async_generate_from_dataframe
    import asyncio
    async def test_async_generate_from_dataframe():
        df_translated = await generator_yaml.async_generate_from_dataframe(df, text_column="text", batch_size=16)
        print("[Async]\n", df_translated.head())

    asyncio.run(test_async_generate_from_dataframe())

Requesting review from NLP maintainers:
@ericharper @ekmb @yzhang123 @VahidooX @vladgets @okuchaiev

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added ray-api Pick this label for auto-cherry-picking into the ray-api branch community-request labels Aug 23, 2025
@shuoyangd shuoyangd self-assigned this Aug 25, 2025
@shuoyangd
Copy link
Contributor

Hi @TsukiSama9292 ! Thanks for the great contribution. We have a few requests before we can merge this one but me, @ayushdg and @abhinavg4 will support you along the way.

The main request we have as of now is that we are migrating away from Dask and switching to Ray as the new backend, so it would be nice if you can refactor your current workflow into something compatible with the rest of the repo. Good starting points would be the API design doc and our quick start guide. Please take a look and let us know if you have questions!

After finishing the refactoring, let's do some benchmarks to ensure the default recipe generates good results.

In addition, I have two high-level questions upon glancing over your changes:

  • Could you please switching to some existing sentence segmenter rather than inventing your own? Those regex-based solutions tend to break when you scale to more languages. I'm fine with anything like spacy or ersatz as long as it has good multilingual support.
  • Correct me if I'm wrong, but currently your pipeline seems to assume that an LLM server has already been started (I only see client queries with OpenAI client but nothing that starts an LLM server, e.g. vllm or trtllm)? Did you give any thoughts about self-hosting cases?

@TsukiSama9292
Copy link
Author

TsukiSama9292 commented Aug 27, 2025

Hi @TsukiSama9292 ! Thanks for the great contribution. We have a few requests before we can merge this one but me, @ayushdg and @abhinavg4 will support you along the way.

The main request we have as of now is that we are migrating away from Dask and switching to Ray as the new backend, so it would be nice if you can refactor your current workflow into something compatible with the rest of the repo. Good starting points would be the API design doc and our quick start guide. Please take a look and let us know if you have questions!

After finishing the refactoring, let's do some benchmarks to ensure the default recipe generates good results.

In addition, I have two high-level questions upon glancing over your changes:

  • Could you please switching to some existing sentence segmenter rather than inventing your own? Those regex-based solutions tend to break when you scale to more languages. I'm fine with anything like spacy or ersatz as long as it has good multilingual support.
  • Correct me if I'm wrong, but currently your pipeline seems to assume that an LLM server has already been started (I only see client queries with OpenAI client but nothing that starts an LLM server, e.g. vllm or trtllm)? Did you give any thoughts about self-hosting cases?

I’m planning to migrate llm-translate from Dask to ray-api. I also updated split_sentences to use spaCy for sentence splitting, allowing users to choose different spaCy models.

…tation.

However, in practical use cases, it is recommended to use a source-language-specific SpaCy model that allows sentence segmentation.

Signed-off-by: TsukiSama9292 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request ray-api Pick this label for auto-cherry-picking into the ray-api branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants