longmemeval-zh is a utility project for creating a Chinese version of the LongMemEval benchmark dataset.
The goal is to make it easier to evaluate how agent memory systems perform in a Chinese-language environment. The project translates the English LongMemEval data into Chinese while preserving the original JSON structure, record order, and non-translatable metadata.
The translation pipeline uses Argos Translate to translate only these JSON fields:
questionanswercontent
All other fields are preserved exactly as they appear in the source dataset.
In addition to the local Argos pipeline, the repository also includes a separate LLM-based subset translation pipeline for higher-quality targeted translation on selected records. It uses the same repository entry point but writes to its own output file.
Markdown code blocks wrapped by paired triple backticks and other structured blocks are also preserved verbatim and are not sent to the translator. This avoids corrupting code, tables, JSON snippets, SQL queries, HTML/XML fragments, logs, formulas, and other structured text that should remain unchanged.
This repository is intended to support dataset construction and reproducible translation runs. The generated Chinese dataset can then be used to benchmark agent memory systems under Chinese user interactions and Chinese retrieval contexts.
The default source file is:
datasets/longmemeval_s_cleaned.json.gz
The default generated Chinese output is:
datasets/longmemeval_s_cleaned.zh.json.gz
The repository stores dataset files as gzip-compressed JSON to stay within GitHub's regular file size limits. The pipeline reads .json.gz files directly and writes gzip-compressed output when the output path ends with .gz.
This project uses uv for Python dependency management.
uv syncIf uv was installed with pip install --user uv, make sure ~/.local/bin is on your PATH:
export PATH="$HOME/.local/bin:$PATH"Translate the full dataset with the default settings:
uv run python main.pyRun a smaller translation job for testing:
uv run python main.py --limit 64 --processes 64Translate the first 500 records and write the result to the canonical Chinese output file:
uv run python main.py --limit 500 --processes 50 --output datasets/longmemeval_s_cleaned.zh.json.gzTranslate only the records listed in config/llm_question_ids.txt with the LLM pipeline:
uv run python main.py --backend llm--input: source JSON file. Defaults todatasets/longmemeval_s_cleaned.json.gz.--output: output JSON file. If omitted and--limit Nis used, the default output becomesdatasets/longmemeval_s_cleaned.zh.firstN.json.gz.--shard-dir: directory for intermediate shards and logs. Defaults todatasets/argos_shards.--processes: number of shards and independent worker processes. Defaults to64.--limit: translate only the firstNrecords.
--limit must be greater than or equal to --processes; otherwise the program exits early to avoid empty shards.
The LLM pipeline uses the same main.py entry point as the Argos pipeline:
- Entry point:
main.py --backend llm - Default question list:
config/llm_question_ids.txt - Default output:
datasets/longmemeval_s_cleaned.llm.zh.json.gz - Default record state directory:
datasets/llm_state/ - API requirement: an OpenAI-compatible text generation endpoint
- Model requirement: provide a model name via CLI or environment variables
This pipeline is designed for iterative, higher-quality translation of a selected subset of records identified by
question_id. The current expected use case is to keep adding curated IDs into config/llm_question_ids.txt over
time and rerun the LLM pipeline to incrementally expand the higher-quality subset.
The LLM pipeline:
- Reads the full source dataset from
datasets/longmemeval_s_cleaned.json.gz - Resolves the
question_idlist fromconfig/llm_question_ids.txt - Translates only the selected records
- Writes only those selected translated records into the LLM output file
- Reuses existing LLM results in the output file and skips already translated
question_ids - Stores per-record temporary progress in
datasets/llm_state/so failed runs can resume within a record - Deletes a record's temporary progress file only after that record has been successfully written into the final LLM output file
Per-record resume is batch-based. The checkpoint stores the completed batch index together with a signature of the current batch plan. If batching parameters or batching logic change, the old in-progress checkpoint is invalidated instead of being reused incorrectly.
This means the LLM output is a translated subset file, not a full 500-record replacement dataset.
Recommended .env file:
ONEAPI_BASE_URL=https://your-openai-compatible-endpoint.example
ONEAPI_API_KEY=your_api_key_here
ONEAPI_MODEL=your_model_name_hereExample:
uv run python main.py --backend llm \
--question-list config/llm_question_ids.txt \
--output datasets/longmemeval_s_cleaned.llm.zh.json.gzImportant options:
--env-file:.envfile path. Defaults to.env.--question-list: plain-text file containing onequestion_idper line--output: standalone LLM translation output file--state-dir: directory for per-record temporary progress files--base-url: override the API endpoint from.env--model: override the model from.env--api-key: override the API key from.env--max-unit-chars: maximum characters for one translated unit before additional splitting. Default:6000--group-char-budget: maximum total input characters for one LLM request batch. Default:24000--output-token-budget: estimated output-token safety budget used when packing batches. Default:7000--max-output-tokens: hardresponsesAPI output cap sent to the model. Default:8192--llm-workers: number of worker processes for parallel record-level LLM translation. Default:4--limit: translate only the firstNIDs from the question list
The LLM pipeline does not send an entire LongMemEval record in a single request. That would exceed practical context limits for many records.
Instead, it uses this hierarchy:
- The logical translation unit is one selected record.
- Within a record, only
question,answer, andcontentfields are collected. - Structured blocks are protected before translation.
- Long fields are further split into safe sub-units.
- Multiple sub-units from the same record are packed into one batch request up to both an input character budget and an estimated output-token budget.
Each batch still includes the record-level question_id, source question, and source answer as consistency
context so the model can keep terminology and event order aligned.
LLM translation is parallelized across records, not within a single record. Each worker process handles one
question_id at a time, while batches inside that record remain sequential to preserve per-record checkpointing and
recovery semantics.
The LLM pipeline uses an OpenAI-compatible structured-output API with JSON Schema validation. This is stricter than plain "please return JSON" prompting and helps ensure that each translation batch returns:
- the same number of items as the input batch
- the same item order as the input batch
- exact
unit_id,path, andfieldvalues - one validated
translationstring per unit
The program still performs local validation after each LLM response. Structured outputs reduce formatting failures, but they do not replace application-side verification.
To reduce the risk of overlong model responses, the pipeline also:
- estimates output tokens before each request batch
- splits oversized batches before sending them
- sets
max_output_tokensexplicitly on every request
The LLM pipeline follows the same structural protection policy as the Argos pipeline. The following blocks are not translated directly and are preserved verbatim:
- Paired triple-backtick code blocks
- Markdown tables
- Indented code blocks
- SQL blocks
- JSON-like blocks
- HTML/XML blocks
- CSV/TSV multi-line table blocks
These protected blocks are replaced with placeholders before an LLM request and restored afterward. This avoids breaking code, tables, or machine-readable content while still letting the model translate the surrounding prose.
Without --limit, the default output is:
datasets/longmemeval_s_cleaned.zh.json.gz
With --limit N and no explicit --output, the default output is:
datasets/longmemeval_s_cleaned.zh.firstN.json.gz
For example:
datasets/longmemeval_s_cleaned.zh.first64.json.gz
If --output is explicitly provided, that path is always used exactly as given.
Each run creates an isolated run directory:
datasets/argos_shards/run-YYYYMMDD-HHMMSS/
The run directory contains:
*.en.json: English input shards.*.zh.json: translated Chinese shards.*.log: per-shard progress logs.manifest.json: shard metadata.
The pipeline splits the dataset into exactly --processes shards and starts the same number of independent worker processes. Shards are balanced by estimated translatable character count, not by record count.
The final merge restores the original record order using an internal _longmemeval_original_index field. This helper field is removed from the final merged output.
Within each shard, fields are translated one by one.
The pipeline first splits each field into translatable text and protected structured blocks. Protected blocks are copied to the output verbatim and are never sent to Argos Translate. This is intentionally conservative: preserving structure is more important than translating every token inside tables, code, or machine-readable snippets.
Protected block types:
- Paired triple-backtick code blocks.
- Markdown tables.
- Indented Markdown code blocks.
- SQL query blocks.
- JSON-like object or array blocks.
- HTML/XML blocks.
- CSV/TSV multi-line table blocks.
If a field does not contain protected blocks, the full field is sent to the translator as one unit unless it exceeds the maximum chunk length.
Triple-backtick code blocks are handled like this:
text before the code block -> translated
```code block``` -> preserved verbatim
text after the code block -> translated
Markdown tables are handled like this:
text before the table -> translated
| Header | Header |
| --- | --- |
| Cell | Cell |
table block -> preserved verbatim
text after the table -> translated
Other structured blocks are handled the same way:
text before the structured block -> translated
structured block -> preserved verbatim
text after the structured block -> translated
When translatable text exceeds the chunk length limit, the splitter first tries to cut at the nearest newline before the limit. If no suitable newline exists, it falls back to a hard split.
The program prints per-shard progress while workers are running. Example:
[STATUS] done=49/50 alive=1 zh=49 active={shard023 fields=5170 field=5166/5170 key=content chunk=1/1}
Meaning:
done=49/50: 49 of 50 workers have finished.alive=1: one worker is still running.zh=49: 49 translated shard files have been written.active={...}: current progress for an active shard.
After generating a translated dataset, validate that it still matches the source schema and order.
The expected properties are:
- The translated file has the same number of records as the selected source range.
- Dictionary keys match exactly.
- List lengths match exactly.
- Non-translatable fields are unchanged.
_longmemeval_original_indexdoes not appear in the final output.
The dataset is large. Even a 500-record run can contain hundreds of thousands of translatable fields and hundreds of millions of characters.
Argos Translate runs locally and can be CPU-intensive. Increasing --processes increases parallelism, but each worker loads its own model instance. Too many processes can increase CPU and memory-bandwidth contention and may not improve total throughput.
A practical starting point is:
uv run python main.py --limit 500 --processes 50 --output datasets/longmemeval_s_cleaned.zh.json.gzDepending on the machine, lower values such as --processes 32 may be more efficient.
Use Ctrl+C to stop a run.
The parent process catches the interrupt and stops worker processes to avoid leaving translation workers running in the background.
This repository is a translation pipeline and dataset-construction utility. If you publish generated datasets or benchmark results, make sure to comply with the license and usage terms of the original LongMemEval dataset and any upstream data sources it depends on.
If you distribute generated Chinese data, clearly document:
- The source dataset version.
- The translation tool and version.
- The exact command used to generate the output.
- Any manual filtering, correction, or validation steps.
This project's code is released under the MIT License. See LICENSE for details.
The dataset files and any generated translations may be subject to the license and usage terms of the original LongMemEval dataset and its upstream sources. The MIT License for this repository does not override those dataset-specific terms.