Skip to content

Sanderhoff-alt/longmemeval-zh

Repository files navigation

longmemeval-zh

longmemeval-zh is a utility project for creating a Chinese version of the LongMemEval benchmark dataset.

The goal is to make it easier to evaluate how agent memory systems perform in a Chinese-language environment. The project translates the English LongMemEval data into Chinese while preserving the original JSON structure, record order, and non-translatable metadata.

What This Project Does

The translation pipeline uses Argos Translate to translate only these JSON fields:

  • question
  • answer
  • content

All other fields are preserved exactly as they appear in the source dataset.

In addition to the local Argos pipeline, the repository also includes a separate LLM-based subset translation pipeline for higher-quality targeted translation on selected records. It uses the same repository entry point but writes to its own output file.

Markdown code blocks wrapped by paired triple backticks and other structured blocks are also preserved verbatim and are not sent to the translator. This avoids corrupting code, tables, JSON snippets, SQL queries, HTML/XML fragments, logs, formulas, and other structured text that should remain unchanged.

Project Status

This repository is intended to support dataset construction and reproducible translation runs. The generated Chinese dataset can then be used to benchmark agent memory systems under Chinese user interactions and Chinese retrieval contexts.

The default source file is:

datasets/longmemeval_s_cleaned.json.gz

The default generated Chinese output is:

datasets/longmemeval_s_cleaned.zh.json.gz

The repository stores dataset files as gzip-compressed JSON to stay within GitHub's regular file size limits. The pipeline reads .json.gz files directly and writes gzip-compressed output when the output path ends with .gz.

Requirements

This project uses uv for Python dependency management.

uv sync

If uv was installed with pip install --user uv, make sure ~/.local/bin is on your PATH:

export PATH="$HOME/.local/bin:$PATH"

Usage

Translate the full dataset with the default settings:

uv run python main.py

Run a smaller translation job for testing:

uv run python main.py --limit 64 --processes 64

Translate the first 500 records and write the result to the canonical Chinese output file:

uv run python main.py --limit 500 --processes 50 --output datasets/longmemeval_s_cleaned.zh.json.gz

Translate only the records listed in config/llm_question_ids.txt with the LLM pipeline:

uv run python main.py --backend llm

Command-Line Options

  • --input: source JSON file. Defaults to datasets/longmemeval_s_cleaned.json.gz.
  • --output: output JSON file. If omitted and --limit N is used, the default output becomes datasets/longmemeval_s_cleaned.zh.firstN.json.gz.
  • --shard-dir: directory for intermediate shards and logs. Defaults to datasets/argos_shards.
  • --processes: number of shards and independent worker processes. Defaults to 64.
  • --limit: translate only the first N records.

--limit must be greater than or equal to --processes; otherwise the program exits early to avoid empty shards.

LLM Subset Translation

The LLM pipeline uses the same main.py entry point as the Argos pipeline:

  • Entry point: main.py --backend llm
  • Default question list: config/llm_question_ids.txt
  • Default output: datasets/longmemeval_s_cleaned.llm.zh.json.gz
  • Default record state directory: datasets/llm_state/
  • API requirement: an OpenAI-compatible text generation endpoint
  • Model requirement: provide a model name via CLI or environment variables

This pipeline is designed for iterative, higher-quality translation of a selected subset of records identified by question_id. The current expected use case is to keep adding curated IDs into config/llm_question_ids.txt over time and rerun the LLM pipeline to incrementally expand the higher-quality subset.

The LLM pipeline:

  • Reads the full source dataset from datasets/longmemeval_s_cleaned.json.gz
  • Resolves the question_id list from config/llm_question_ids.txt
  • Translates only the selected records
  • Writes only those selected translated records into the LLM output file
  • Reuses existing LLM results in the output file and skips already translated question_ids
  • Stores per-record temporary progress in datasets/llm_state/ so failed runs can resume within a record
  • Deletes a record's temporary progress file only after that record has been successfully written into the final LLM output file

Per-record resume is batch-based. The checkpoint stores the completed batch index together with a signature of the current batch plan. If batching parameters or batching logic change, the old in-progress checkpoint is invalidated instead of being reused incorrectly.

This means the LLM output is a translated subset file, not a full 500-record replacement dataset.

Recommended .env file:

ONEAPI_BASE_URL=https://your-openai-compatible-endpoint.example
ONEAPI_API_KEY=your_api_key_here
ONEAPI_MODEL=your_model_name_here

Example:

uv run python main.py --backend llm \
  --question-list config/llm_question_ids.txt \
  --output datasets/longmemeval_s_cleaned.llm.zh.json.gz

Important options:

  • --env-file: .env file path. Defaults to .env.
  • --question-list: plain-text file containing one question_id per line
  • --output: standalone LLM translation output file
  • --state-dir: directory for per-record temporary progress files
  • --base-url: override the API endpoint from .env
  • --model: override the model from .env
  • --api-key: override the API key from .env
  • --max-unit-chars: maximum characters for one translated unit before additional splitting. Default: 6000
  • --group-char-budget: maximum total input characters for one LLM request batch. Default: 24000
  • --output-token-budget: estimated output-token safety budget used when packing batches. Default: 7000
  • --max-output-tokens: hard responses API output cap sent to the model. Default: 8192
  • --llm-workers: number of worker processes for parallel record-level LLM translation. Default: 4
  • --limit: translate only the first N IDs from the question list

LLM Translation Granularity

The LLM pipeline does not send an entire LongMemEval record in a single request. That would exceed practical context limits for many records.

Instead, it uses this hierarchy:

  1. The logical translation unit is one selected record.
  2. Within a record, only question, answer, and content fields are collected.
  3. Structured blocks are protected before translation.
  4. Long fields are further split into safe sub-units.
  5. Multiple sub-units from the same record are packed into one batch request up to both an input character budget and an estimated output-token budget.

Each batch still includes the record-level question_id, source question, and source answer as consistency context so the model can keep terminology and event order aligned.

LLM translation is parallelized across records, not within a single record. Each worker process handles one question_id at a time, while batches inside that record remain sequential to preserve per-record checkpointing and recovery semantics.

LLM Structured Outputs

The LLM pipeline uses an OpenAI-compatible structured-output API with JSON Schema validation. This is stricter than plain "please return JSON" prompting and helps ensure that each translation batch returns:

  • the same number of items as the input batch
  • the same item order as the input batch
  • exact unit_id, path, and field values
  • one validated translation string per unit

The program still performs local validation after each LLM response. Structured outputs reduce formatting failures, but they do not replace application-side verification.

To reduce the risk of overlong model responses, the pipeline also:

  • estimates output tokens before each request batch
  • splits oversized batches before sending them
  • sets max_output_tokens explicitly on every request

LLM Translation Rules

The LLM pipeline follows the same structural protection policy as the Argos pipeline. The following blocks are not translated directly and are preserved verbatim:

  • Paired triple-backtick code blocks
  • Markdown tables
  • Indented code blocks
  • SQL blocks
  • JSON-like blocks
  • HTML/XML blocks
  • CSV/TSV multi-line table blocks

These protected blocks are replaced with placeholders before an LLM request and restored afterward. This avoids breaking code, tables, or machine-readable content while still letting the model translate the surrounding prose.

Output Naming

Without --limit, the default output is:

datasets/longmemeval_s_cleaned.zh.json.gz

With --limit N and no explicit --output, the default output is:

datasets/longmemeval_s_cleaned.zh.firstN.json.gz

For example:

datasets/longmemeval_s_cleaned.zh.first64.json.gz

If --output is explicitly provided, that path is always used exactly as given.

Sharding And Parallelism

Each run creates an isolated run directory:

datasets/argos_shards/run-YYYYMMDD-HHMMSS/

The run directory contains:

  • *.en.json: English input shards.
  • *.zh.json: translated Chinese shards.
  • *.log: per-shard progress logs.
  • manifest.json: shard metadata.

The pipeline splits the dataset into exactly --processes shards and starts the same number of independent worker processes. Shards are balanced by estimated translatable character count, not by record count.

The final merge restores the original record order using an internal _longmemeval_original_index field. This helper field is removed from the final merged output.

Translation Rules

Within each shard, fields are translated one by one.

The pipeline first splits each field into translatable text and protected structured blocks. Protected blocks are copied to the output verbatim and are never sent to Argos Translate. This is intentionally conservative: preserving structure is more important than translating every token inside tables, code, or machine-readable snippets.

Protected block types:

  • Paired triple-backtick code blocks.
  • Markdown tables.
  • Indented Markdown code blocks.
  • SQL query blocks.
  • JSON-like object or array blocks.
  • HTML/XML blocks.
  • CSV/TSV multi-line table blocks.

If a field does not contain protected blocks, the full field is sent to the translator as one unit unless it exceeds the maximum chunk length.

Triple-backtick code blocks are handled like this:

text before the code block -> translated
```code block``` -> preserved verbatim
text after the code block -> translated

Markdown tables are handled like this:

text before the table -> translated
| Header | Header |
| --- | --- |
| Cell | Cell |
table block -> preserved verbatim
text after the table -> translated

Other structured blocks are handled the same way:

text before the structured block -> translated
structured block -> preserved verbatim
text after the structured block -> translated

When translatable text exceeds the chunk length limit, the splitter first tries to cut at the nearest newline before the limit. If no suitable newline exists, it falls back to a hard split.

Progress Logs

The program prints per-shard progress while workers are running. Example:

[STATUS] done=49/50 alive=1 zh=49 active={shard023 fields=5170 field=5166/5170 key=content chunk=1/1}

Meaning:

  • done=49/50: 49 of 50 workers have finished.
  • alive=1: one worker is still running.
  • zh=49: 49 translated shard files have been written.
  • active={...}: current progress for an active shard.

Validation

After generating a translated dataset, validate that it still matches the source schema and order.

The expected properties are:

  • The translated file has the same number of records as the selected source range.
  • Dictionary keys match exactly.
  • List lengths match exactly.
  • Non-translatable fields are unchanged.
  • _longmemeval_original_index does not appear in the final output.

Performance Notes

The dataset is large. Even a 500-record run can contain hundreds of thousands of translatable fields and hundreds of millions of characters.

Argos Translate runs locally and can be CPU-intensive. Increasing --processes increases parallelism, but each worker loads its own model instance. Too many processes can increase CPU and memory-bandwidth contention and may not improve total throughput.

A practical starting point is:

uv run python main.py --limit 500 --processes 50 --output datasets/longmemeval_s_cleaned.zh.json.gz

Depending on the machine, lower values such as --processes 32 may be more efficient.

Interrupting A Run

Use Ctrl+C to stop a run.

The parent process catches the interrupt and stops worker processes to avoid leaving translation workers running in the background.

Data And License Notes

This repository is a translation pipeline and dataset-construction utility. If you publish generated datasets or benchmark results, make sure to comply with the license and usage terms of the original LongMemEval dataset and any upstream data sources it depends on.

If you distribute generated Chinese data, clearly document:

  • The source dataset version.
  • The translation tool and version.
  • The exact command used to generate the output.
  • Any manual filtering, correction, or validation steps.

License

This project's code is released under the MIT License. See LICENSE for details.

The dataset files and any generated translations may be subject to the license and usage terms of the original LongMemEval dataset and its upstream sources. The MIT License for this repository does not override those dataset-specific terms.

About

A Chinese version of the LongMemEval benchmark dataset.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages