[recipe] feat: add open math reasoning (#3767)

vermouth1992 · gemini-code-assist[bot] · web-flow · commit 22d082f9a4f0 · 2025-10-15T12:11:41.000+08:00
### What does this PR do? - Add open math reasoning recipe using sft trainer with model engine - Support setting none to val dataset in sft trainer - Fix main_eval - Using aiohttp for main_generation_server to avoid hang in AsyncOpenAI ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
diff --git a/recipe/open_math_reasoning/README.md b/recipe/open_math_reasoning/README.md
@@ -0,0 +1,55 @@
+# Open math reasoning
+## Introduction
+In this recipe, we perform SFT on the [open math reasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning) dataset using the new SFT trainer with backend agostic model engine. Note that our goal is not to replicate the [AIMO-2 Winning Solution](https://arxiv.org/abs/2504.16891) work, but to demonstrate a SFT demo from end to end.
+
+Note that you may need to modify the path as needed in the following scripts.
+## Dataset Preprocessing
+### Download Dataset
+```bash
+hf download nvidia/OpenMathReasoning --repo-type dataset --include data/cot* --local-dir /path/to/dataset/nvidia/OpenMathReasoning
+hf download math-ai/aime24 --repo-type dataset --local-dir /path/to/dataset/math-ai/aime24
+hf download math-ai/aime25 --repo-type dataset --local-dir /path/to/dataset/math-ai/aime25
+```
+
+### Preprocess the dataset
+```bash
+python3 recipe/open_math_reasoning/prepare_nvidia-OpenMathReasoning_sft.py --local_dataset_path /path/to/nvidia/OpenMathReasoning --local_save_dir /path/to/open_math_reasoning
+```
+
+### Prepare the eval dataset
+```bash
+python3 recipe/open_math_reasoning/prepare_eval_dataset.py --local_dataset_path /path/to/dataset --local_save_dir /path/to/eval_dataset
+```
+
+## Train the model using SFT
+### FSDP backend
+export CKPT_HOME=/path/to/ckpt
+export BACKEND=fsdp2
+export MODEL_ID=Qwen/Qwen3-8B-Base
+export TRAIN_FILES=/path/to/open_math_reasoning/cot_dataset.parquet
+bash recipe/open_math_reasoning/run_sft_qwen3_8b.sh
+
+### Megatron backend
+TODO
+
+## Eval the model
+### Merge checkpoint into huggingface format
+```bash
+python -m verl.model_merger merge --backend fsdp --local_dir /path/to/ckpt/global_step_19751 --target_dir /path/to/ckpt/global_step_19751/huggingface
+```
+
+### Generate the responses
+```bash
+export MODEL_PATH=/path/to/ckpt/global_step_19751/huggingface
+bash recipe/open_math_reasoning/run_generation.sh
+```
+
+### Evaluate the responses
+```bash
+bash recipe/open_math_reasoning/run_eval.sh
+```
+
+You should see the results like:
+```python
+{'test_score/aime24': 0.584375, 'test_score/aime25': 0.43333333333333335}
+```
diff --git a/recipe/open_math_reasoning/compute_score.py b/recipe/open_math_reasoning/compute_score.py
@@ -0,0 +1,22 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def compute_score_data_source(data_source, response, ground_truth):
+    from verl.utils.reward_score.math_reward import compute_score
+
+    if data_source in ["aime24", "aime25"]:
+        return compute_score(response, ground_truth)
+    else:
+        raise ValueError(f"Unknown data source: {data_source}")
diff --git a/recipe/open_math_reasoning/prepare_eval_dataset.py b/recipe/open_math_reasoning/prepare_eval_dataset.py
@@ -0,0 +1,96 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# prepare eval dataset including AIME'24, AIME'25
+
+# hf download math-ai/aime24 --repo-type dataset --local-dir /opt/tiger/datasets/math-ai/aime24
+# hf download math-ai/aime25 --repo-type dataset --local-dir /opt/tiger/datasets/math-ai/aime25
+
+import os
+
+import datasets
+
+from verl.utils.reward_score.math_reward import remove_boxed
+
+instruction_following = "Please reason step by step, and put your final answer within \\boxed{}."
+
+
+def make_map_fn(data_source):
+    def process_fn(example, idx):
+        question_raw = example.pop("problem")
+
+        question = question_raw + " " + instruction_following
+
+        if "solution" not in example:
+            example["solution"] = example["answer"]
+
+        answer_raw = example.pop("solution")
+
+        example.clear()
+
+        try:
+            solution = remove_boxed(answer_raw)
+        except Exception:
+            solution = answer_raw
+
+        data = {
+            "data_source": data_source,
+            "prompt": [
+                {
+                    "role": "user",
+                    "content": question,
+                }
+            ],
+            "ability": "math",
+            "reward_model": {"style": "rule", "ground_truth": solution},
+            "extra_info": {
+                "index": idx,
+                "answer": answer_raw,
+                "question": question_raw,
+            },
+        }
+        return data
+
+    return process_fn
+
+
+if __name__ == "__main__":
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--local_dataset_path", default=None, help="The local path to the raw dataset, if it exists.")
+    parser.add_argument(
+        "--local_save_dir", default="~/data/math-ai", help="The save directory for the preprocessed dataset."
+    )
+
+    args = parser.parse_args()
+
+    if args.local_dataset_path is not None:
+        aime24_dataset_path = os.path.join(args.local_dataset_path, "math-ai/aime24")
+        aime25_dataset_path = os.path.join(args.local_dataset_path, "math-ai/aime25")
+    else:
+        aime24_dataset_path = "math-ai/aime24"
+        aime25_dataset_path = "math-ai/aime25"
+
+    aime24_dataset = datasets.load_dataset(aime24_dataset_path, split="test")
+    aime25_dataset = datasets.load_dataset(aime25_dataset_path, split="test")
+
+    aime24_dataset = aime24_dataset.map(function=make_map_fn("aime24"), with_indices=True)
+    aime25_dataset = aime25_dataset.map(function=make_map_fn("aime25"), with_indices=True)
+
+    local_save_dir = os.path.expanduser(args.local_save_dir)
+    os.makedirs(local_save_dir, exist_ok=True)
+
+    aime24_dataset.to_parquet(os.path.join(local_save_dir, "aime24_test.parquet"))
+    aime25_dataset.to_parquet(os.path.join(local_save_dir, "aime25_test.parquet"))
diff --git a/recipe/open_math_reasoning/prepare_nvidia-OpenMathReasoning_sft.py b/recipe/open_math_reasoning/prepare_nvidia-OpenMathReasoning_sft.py
@@ -0,0 +1,72 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+huggingface-cli download nvidia/OpenMathReasoning --repo-type dataset --include data/cot* \
+    --local-dir /path/to/nvidia/OpenMathReasoning
+huggingface-cli download nvidia/OpenMathReasoning --repo-type dataset --include data/cot* \
+    --local-dir /opt/tiger/nvidia/OpenMathReasoning
+"""
+
+import argparse
+import os
+
+import datasets
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--local_dataset_path", default=None, help="The local path to the raw dataset, if it exists.")
+    parser.add_argument(
+        "--local_save_dir",
+        default="~/data/open_math_reasoning",
+        help="The save directory for the preprocessed dataset.",
+    )
+
+    args = parser.parse_args()
+    local_dataset_path = args.local_dataset_path
+
+    data_source = "nvidia/OpenMathReasoning"
+
+    if local_dataset_path is not None:
+        dataset = datasets.load_dataset(local_dataset_path, split="cot")
+    else:
+        dataset = datasets.load_dataset(data_source, split="cot")
+
+    def make_map_fn(split):
+        def process_fn(example, idx):
+            question = example.pop("problem")
+            solution = example.pop("generated_solution")
+
+            extra_info = {}
+            for key, value in example.items():
+                extra_info[key] = value
+            example.clear()
+
+            data = {
+                "messages": [
+                    {"role": "user", "content": question, "loss_mask": 0},
+                    {"role": "assistant", "content": solution, "loss_mask": 1},
+                ],
+                "extra_info": extra_info,
+            }
+            return data
+
+        return process_fn
+
+    # filter out data where the problem_type is not has_answer_extracted
+    dataset = dataset.filter(lambda example: example["problem_type"] == "has_answer_extracted")
+    dataset = dataset.map(function=make_map_fn("cot"), with_indices=True)
+    local_save_dir = os.path.expanduser(args.local_save_dir)
+    os.makedirs(local_save_dir, exist_ok=True)
+    dataset.to_parquet(os.path.join(local_save_dir, "cot_dataset.parquet"))
diff --git a/recipe/open_math_reasoning/run_eval.sh b/recipe/open_math_reasoning/run_eval.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+# Evaluation
+python3 -m verl.trainer.main_eval \
+    data.path=$HOME/data/gen/qwen_8b_gen_test.parquet \
+    custom_reward_function.path=recipe/open_math_reasoning/compute_score.py \
+    custom_reward_function.name=compute_score_data_source
diff --git a/recipe/open_math_reasoning/run_generation.sh b/recipe/open_math_reasoning/run_generation.sh
@@ -0,0 +1,32 @@
+#!/usr/bin/env bash
+
+MODEL_PATH=${MODEL_PATH:-/path/to/ckpt/global_step_19751/huggingface}
+
+NGPUS_PER_NODE=${NGPUS_PER_NODE:-8}
+NNODES=${NNODES:-1}
+OUTPUT_PATH=${OUTPUT_PATH:-$HOME/data/gen/qwen_8b_gen_test.parquet}
+GEN_TP=${GEN_TP:-1}  # Default tensor parallel size to 2
+
+aime24_test_path=${HOME}/data/math-ai/aime24_test.parquet
+aime25_test_path=${HOME}/data/math-ai/aime25_test.parquet
+train_files="['$aime24_test_path', '$aime25_test_path']"
+
+python3 -m verl.trainer.main_generation_server \
+    trainer.nnodes="${NNODES}" \
+    trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \
+    actor_rollout_ref.model.path="${MODEL_PATH}" \
+    actor_rollout_ref.model.trust_remote_code=True \
+    actor_rollout_ref.rollout.temperature=1.0 \
+    actor_rollout_ref.rollout.top_p=0.7 \
+    actor_rollout_ref.rollout.prompt_length=2048 \
+    actor_rollout_ref.rollout.response_length=20480 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size="${GEN_TP}" \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.n=32 \
+    data.train_files="$train_files" \
+    data.prompt_key=prompt \
+    +data.output_path="${OUTPUT_PATH}" \
+
+
+
diff --git a/recipe/open_math_reasoning/run_sft_qwen3_8b.sh b/recipe/open_math_reasoning/run_sft_qwen3_8b.sh
@@ -0,0 +1,94 @@
+#!/usr/bin/env bash
+set -xeuo pipefail
+
+ENTRYPOINT=${ENTRYPOINT:-"-m verl.trainer.sft_trainer"}
+
+TRAIN_FILES=${TRAIN_FILES:-/path/to/cot_dataset.parquet}
+
+backend=${BACKEND:-fsdp}
+
+project_name=verl_sft_test
+
+RESUME_MODE=auto
+MODEL_ID=${MODEL_ID:-Qwen/Qwen3-8B-Base}
+
+SP_SIZE=${SP_SIZE:-8}
+FSDP_SIZE=${FSDP_SIZE:-16}
+FSDP_STRATEGY=${FSDP_STRATEGY:-"fsdp2"}
+
+TP_SIZE=${TP_SIZE:-1}
+PP_SIZE=${PP_SIZE:-1}
+VPP_SIZE=${VPP_SIZE:-null}
+CP_SIZE=${CP_SIZE:-1}
+
+PAD_MODE=${PAD_MODE:-no_padding}
+
+USE_REMOVE_PADDING=${USE_REMOVE_PADDING:-True}
+
+FSDP_ENGINE_CONFIG="\
+    engine=${backend} \
+    optim=${backend} \
+    optim.lr=2e-5 \
+    optim.lr_warmup_steps_ratio=0.01 \
+    optim.weight_decay=0.1 \
+    optim.betas="[0.9,0.95]" \
+    optim.clip_grad=1.0 \
+    optim.min_lr_ratio=0.1 \
+    optim.warmup_style=cosine \
+    engine.ulysses_sequence_parallel_size=${SP_SIZE} \
+    engine.strategy=${FSDP_STRATEGY} \
+    engine.fsdp_size=${FSDP_SIZE}"
+
+
+MEGATRON_ENGINE_CONFIG="\
+    engine=${backend} \
+    optim=${backend} \
+    optim.lr=1e-5 \
+    optim.lr_warmup_steps_ratio=0.2 \
+    optim.weight_decay=0.1 \
+    optim.betas="[0.9,0.95]" \
+    optim.clip_grad=1.0 \
+    optim.lr_warmup_init=0 \
+    optim.lr_decay_style=cosine \
+    optim.min_lr=1e-6 \
+    engine.tensor_model_parallel_size=${TP_SIZE} \
+    engine.pipeline_model_parallel_size=${PP_SIZE} \
+    engine.virtual_pipeline_model_parallel_size=${VPP_SIZE} \
+    engine.context_parallel_size=${CP_SIZE}"
+
+if [ "$backend" = "fsdp" ]; then
+    ENGINE_CONFIG="$FSDP_ENGINE_CONFIG"
+    echo "Using fsdp engine"
+    exp_name=nvidia-openmathreasoning-qwen3-8b-${backend}-${FSDP_STRATEGY}-sp${SP_SIZE}-fsdp-1008a1
+else
+    ENGINE_CONFIG="$MEGATRON_ENGINE_CONFIG"
+    echo "Using megatron engine"
+    exp_name=nvidia-openmathreasoning-${backend}-tp${TP_SIZE}-pp${PP_SIZE}-vpp${VPP_SIZE}-cp${CP_SIZE}-pad-${PAD_MODE}-use_remove_padding-${USE_REMOVE_PADDING}
+fi
+
+CKPT_HOME=${CKPT_HOME:-$HOME/open_verl/sft/${project_name}/${exp_name}}
+mkdir -p "${CKPT_HOME}"
+
+torchrun --standalone --nnodes=1 --nproc-per-node=${NUM_TRAINERS:-8} \
+    ${ENTRYPOINT} \
+    data.train_files="${TRAIN_FILES}" \
+    data.train_batch_size=96 \
+    data.max_length=32768 \
+    data.pad_mode=${PAD_MODE} \
+    data.truncation=error \
+    data.use_dynamic_bsz=True \
+    data.max_token_len_per_gpu=65536 \
+    data.messages_key=messages \
+    model.path=$MODEL_ID \
+    model.use_remove_padding=${USE_REMOVE_PADDING} \
+    ${ENGINE_CONFIG} \
+    trainer.test_freq=-1 \
+    trainer.save_freq=4000 \
+    trainer.logger=['console','wandb'] \
+    trainer.project_name="${project_name}" \
+    trainer.experiment_name="${exp_name}" \
+    trainer.total_epochs=1 \
+    trainer.default_local_dir="${CKPT_HOME}" \
+    trainer.resume_mode=${RESUME_MODE} \
+    trainer.max_ckpt_to_keep=5 \
+    checkpoint.save_contents=[model,optimizer,extra]
diff --git a/verl/trainer/config/sft_trainer_engine.yaml b/verl/trainer/config/sft_trainer_engine.yaml
@@ -18,7 +18,7 @@ data:
   max_token_len_per_gpu: 8192
   use_dynamic_bsz: True
   train_files: ~/data/gsm8k/train.parquet
-  val_files: ~/data/gsm8k/test.parquet
+  val_files: null
   # Multi-turn settings
   messages_key: messages  # Key for messages list in multi-turn mode
   tools_key: tools  # Key for tools list in multi-turn mode
diff --git a/verl/trainer/main_eval.py b/verl/trainer/main_eval.py
diff --git a/verl/trainer/main_generation_server.py b/verl/trainer/main_generation_server.py
diff --git a/verl/trainer/sft_trainer.py b/verl/trainer/sft_trainer.py
diff --git a/verl/utils/checkpoint/checkpoint_manager.py b/verl/utils/checkpoint/checkpoint_manager.py