Skip to content

Releases: NVIDIA-NeMo/Gym

v0.4.0

Choose a tag to compare

@nemo-automation-bot nemo-automation-bot released this 02 Jul 19:28
d67ad66

Release Summary

NeMo Gym v0.4.0 expands evaluation tooling and agent integrations. It establishes a new monthly release cadence; we will continue to provide day-zero support for Nemotron models, datasets, and environments.

Highlights:

  • Unified gym CLI: find agents and benchmarks by name with gym list, and catch config mistakes early with gym env validate
  • Diagnose evaluations with BLADE, an analysis skill for agents that reads your evaluation results and produces an evidence-backed report of which tasks failed, why, and the highest-impact fix (e.g. to the agent harness, training, verifier, or prompt)
  • Measure the impact of agent skills: run the same tasks with different skill sets and compare how each changes agent performance
  • Run agents in isolated sandboxes through a new pluggable provider framework
  • More agent harnesses out of the box, including OpenClaw, Pi, and OpenCode
  • Connect to hosted inference providers: Fireworks, Together.ai, OpenRouter, and more
  • New benchmarks across science, long-context, and interactive tasks

First-Time Contributors

We welcomed 20+ new contributors to this release! A few highlights:

  • @marta-sd and @wprazuch led the CLI refactor and clearer config errors
  • @hemildesai added the pluggable sandbox provider infrastructure and OpenSandbox as the first built-in
  • @adil-a laid the groundwork for Gym-owned MCP resources servers, letting a server expose its tools over MCP
  • @eric-tramel added the BunsenChem chemistry benchmark
  • @jeffwillette added the long machine translation datasets and servers

Thank you to all the new contributors for helping make NeMo Gym better!

Command Line Interface

  • One gym command for the full workflow, with gym env, gym eval, gym list, and gym dataset subcommands
  • Reference agents, benchmarks, and environments by name: use gym list to see what is available
  • gym env validate checks your config for missing, malformed, or empty values before a run and reports actionable errors

Evaluation & Diagnostics

  • Skill evaluation: measure how agent skills affect performance by running the same tasks with different skill sets. Skills apply at rollout time as a run-level knob, so one dataset works across all skill variants and every rollout is tagged for comparison
  • BLADE (Benchmark Level Analysis and Diagnostics Engine): a built-in analysis skill that reads an agent run's rollouts, metrics, and configs and produces an evidence-backed report of which tasks failed, why, and the highest-impact fix (e.g. harness, training, verifier, or prompt)

Sandboxing

  • Run tool-using and coding agents in isolated sandboxes through a pluggable provider framework
  • Built-in OpenSandbox and Apptainer providers, with third-party providers discoverable via entry points

Configure Agent Harnesses

New harnesses join the existing built-in set (Claude Code, Hermes, OpenHands, and more):

  • Added OpenCode, OpenClaw, and Pi agents for evaluation
  • Claude Code runtime capabilities (tool access, MCP servers, and bare vs. native auto-discovery mode) are now easily set via the server config

Configure Models

  • New inference_provider model server connects to any OpenAI-compatible hosted provider (Fireworks, Together.ai, OpenRouter, DeepInfra, Gemini, and more) with ready-made configs
  • Every Gym model server now speaks the Anthropic Messages API, so Anthropic-native harnesses like the Claude Code CLI can run against any model you serve with Gym

New Benchmarks

  • Science: CritPt (research-level physics), SciCode (scientific coding), BunsenChem (chemistry multiple-choice), and FrontierScience Research (rubric-scored science)
  • Long context: Graphwalks (long-context graph reasoning) and Long Machine Translation (PG19, WMT24++)
  • Interactive: TALES, a text-adventure game suite

See the Available Environments table for the full list.

Deprecation Notices

  • The legacy ng_* and nemo_gym_* CLI commands (such as ng_run and ng_collect_rollouts) are deprecated in favor of the unified gym CLI. They still work for now but will be removed in a future release.

Bug Fixes

  • Fixed intermittent connection errors during high-concurrency rollout collection
  • Clear error messages instead of crashes when a config file contains invalid YAML

Documentation

  • New Build Verifiers section with verification patterns and multi-reward verification
  • New Evaluate section covering benchmarks, evaluation metrics, and a guide to agent-native results diagnostics
  • New page for configuring and evaluating agent skills
Full Changelog
Read more

v0.3.0

Choose a tag to compare

@nemo-automation-bot nemo-automation-bot released this 04 Jun 15:53
4c44cf9

Release Summary

NeMo Gym v0.3.0 ships alongside the NVIDIA Nemotron 3 Ultra model release, open sourcing the environments and corresponding datasets used during training.

Highlights:

  • 70+ new environments, including benchmarks such as Tau2 and Nemotron RL training environments
  • Popular harness available out-of-the-box such as Claude Code and Hermes
  • Integrations with OpenEnv and Harbor - use environments from these libraries directly with NeMo Gym
  • Integration with VeRL - train with VeRL and scale rollout collection with NeMo Gym

First-Time Contributors

We welcomed 30+ new contributors to this release! Here are a few highlights:

  • @grace-lam added the integration to run Harbor environments with NeMo Gym
  • @aleksficek — added Competitive Coding Challenges environment
  • @jthomson04 improved rollout resilience when models emit malformed tool-call arguments or missing message content

Thank you to all the new contributors for helping make NeMo Gym better!

New Environments & Benchmarks

Added 70+ new environments including novel datasets and integrations of popular benchmarks. New coverage spans:

  • Coding — competitive programming, code infilling, SQL generation, and software-engineering benchmarks with execution-based verification
  • Math & proofs — olympiad-style problems, proof grading and validation, and formal verification (including Lean)
  • Knowledge & science — graduate-level QA, chemistry and physics tasks, and lab-style reasoning (including multimodal figure, table, and protocol tasks)
  • Agentic — multi-turn tool use, search, sandboxed execution, finance workflows, and tau-bench-style conversational agents
  • Instruction following — format constraints, citation compliance, and IFBench-style rule verification
  • Safety & RLHF — jailbreak detection, abstention calibration, prompt-injection resistance, and generative reward modeling
  • Multimodal, speech & translation — VLM benchmarks, visual grounding, ASR evaluation, and machine-translation quality metrics
  • Chat & broad knowledge — arena-style preference evaluation and MMLU-family benchmarks
  • Interactive RL — Gymnasium-style multi-step environments for spatial and game-based training

See the Available Environments table for the full list.

Configure Agent Harnesses

  • Claude Code — available out of the box in NeMo Gym
  • Hermes — available out of the box in NeMo Gym
  • LangGraph agent — an adapter that lets you build custom agents using LangGraph patterns (reflection, subagent orchestration, parallel thinking, rewoo)
  • Gymnasium agent — generic multi-turn harness for use with OpenAI Gym-style environments

Configure Models

  • Optional max_concurrent_requests on the OpenAI model server to cap in-flight API calls — useful for rate-limited external endpoints when rollout concurrency is high

Rollout Collection & Profiling

  • New ng_aggregate_rollouts command to merge rollout shards collected independently across multiple nodes, enabling distributed eval without requiring a single coordinated collection job

Environment Library Integrations

  • OpenEnv — combine OpenEnv environments with NeMo Gym environments
  • Harbor — combine Harbor environments with NeMo Gym environments

Deprecation Notices

  • Documentation has moved from Sphinx to Fern. Old Sphinx URLs redirect to the new site at docs.nvidia.com/nemo/gym. The docs/ directory is no longer used for publishing.

Bug Fixes

  • Fixed aiohttp connection limit exhaustion under FastAPI/Uvicorn with multiple workers
  • Fixed session cookie propagation for Starlette >= 1.0.0
  • Fixed duplicated usage counting and errors on empty usage in subsequent model calls
  • Improved rollout resilience when models emit malformed tool-call arguments or missing message content
  • Fixed prompt-key hashing when inputs contain Pydantic BaseModel objects

Documentation

  • New concepts pages for environments, evaluation, and training
  • Improved Architecture page to clarify how environments map to NeMo Gym components
  • Consolidated detailed setup and quickstart into a single improved quickstart with clearer descriptions
  • Expanded Ecosystem page with environment library, training framework, and agent harness integrations
Changelog Details
Read more

v0.2.1

Choose a tag to compare

@chtruong814 chtruong814 released this 15 Apr 22:52
27e9211

v0.2.0

Choose a tag to compare

@bxyu-nvidia bxyu-nvidia released this 11 Mar 15:03
3e587db

Release Summary

NeMo Gym v0.2.0 ships alongside the NVIDIA Nemotron 3 Super model release, open sourcing the RL environments and corresponding datasets used during training. Highlights:

  • 17 new training environments across coding, math, science, reasoning, agentic tasks, and safety.
  • Integrations with Future House Aviary, Open-Thought Reasoning Gym, and Prime Intellect Verifiers let you use environments from these libraries directly within NeMo Gym
  • End-to-end rollout collection with a locally managed vLLM server
  • Install directly from PyPI with pip install nemo-gym

First-Time Contributors

We welcomed 15 new contributors to this release! Here are a few highlights:

  • @sidnarayanan added the Aviary integration to enable training on any Aviary environment, a library of interactive RL environments spanning math, science, biology, and more
  • @3mei added the text-to-SQL environment to generate SQL queries from natural language across multiple SQL dialects
  • @Kelvin0110 added the NewtonBench environment to discover scientific laws through interactive experimentation

Thank you to all the new contributors for helping make NeMo Gym better!

Major Features & Improvements

New Environments

  • Added 17 new resources servers spanning:
    • Coding: Text to SQL (#648), SWE RL Gen (#561), SWE RL LLM Judge (#561)
    • Math: Lean4 Mathematical Proofs (#563)
    • Science: Aviary (#55), NewtonBench (#650)
    • Reasoning: MultiChallenge (#654), ARC-AGI (#105), Reasoning Gym (#113)
    • Agent tasks: xLAM Function Calling (#262), Tavily Search (#825), Single Step Tool Use with Argument Comparison (#825), Terminus Judge (#594), NeMo Skills Tools (#571)
    • Safety: Jailbreak Detection (#825), Over Refusal Detection (#825)
    • RLHF: Generative Reward Model Compare (#674)
  • Added 5 new agent servers: Aviary agent (#55), proof refinement agent (#563), SWE agents (#343), tool simulation agent (#826), and verifiers agent (#573)

Environment Library Integrations
Combine environments from other libraries with NeMo Gym environments

  • Future House Aviary (#55, #590)
  • Open-Thought Reasoning Gym (#113)
  • Prime Intellect Verifiers (#573)

Model Serving

  • Local vLLM model server with end-to-end rollout collection without an external API (#558, #762)
  • vLLM 0.16+ support for the reasoning field in responses (#816)
  • VLLMModel chat template kwargs support (#538, #636)
  • Per-task chat template and extra body args, enabling per-task control of reasoning mode and thinking budget (#672)

Rollout Collection & Profiling

  • New ng_reward_profile command to compute per-task pass rates and aggregate metrics (#83, #621)
  • CPU profiling for rollout performance analysis (#763)
  • Add option for seeding on num_repeats for rollouts (#740)

Infrastructure & Developer Experience

  • PyPI compatibility: install via pip install nemo-gym (#649)
  • Dry run mode: ng_run +dryrun=true to validate configs and install environments without starting servers (#743)
  • ng_status command to list running servers and their health (#290)
  • Server stdout/stderr redirection with server name prefixes (#703)
  • FastAPI worker support for higher throughput across multiple workers (#566)

Model Recipes

  • Nemotron 3 Nano training recipe (#699)
  • Nemotron 3 Super training recipe (#863)

Deprecation Notices

  • Deprecated ng_viewer due to a Gradio security vulnerability. We plan to revisit rollout viewing with a more robust solution in a future release.

Bug Fixes

  • Fixed 0.1.1 environments to work correctly with RL training pipelines (#768)
  • Fixed crash when server receives malformed JSON during rollout collection (#770)
  • Fixed dry run mode failing (#746)
  • Fixed nested responses_create_params overrides not merging correctly from CLI (#827)
  • Fixed ng_prepare_data failing when multiple environments define overlapping metrics (#738)
  • Fixed reward profiling failing when model response doesn't include usage stats (#824)
  • Fixed NeMo-Skills python tool to use HTTP calls instead of subprocess execution (#606)
  • Bumped Pillow and other packages to address security vulnerabilities (#667, #739)
  • ng_dump_config now redacts API key values from output (#567)

Documentation

  • New training tutorials: Unsloth training with NeMo Gym, multi-environment training
  • New environment tutorials: creating a training environment, custom data preparation, integrating external environment libraries, environment best practices
  • Model recipes: reproduce the training for Nemotron 3 Nano and Nemotron 3 Super
  • Concepts & architecture overhaul: rewrote concepts docs, added architecture diagrams, added agent server and resources server docs
  • Training approaches: added training approaches docs page covering SFT, RL (GRPO), and RLVR
  • Ecosystem page: revamped ecosystem page with training framework integrations and environment library integrations
  • Infrastructure: added SWE RL infrastructure case study, deployment topology docs
  • Quality pass: redirect sweep, style guide sweep, consistent naming, FAQ additions, broken link fixes

Looking Ahead

  • VLM support: add support for VLM models and environments with images, e.g. browser environments and computer use agent (CUA) environments
  • Benchmark environments: add popular OSS environments such as OSWorld, Tau Bench, BrowseComp
  • Integrate existing agents: integrate popular existing agents, e.g. coding harnesses, as well as agents developed via popular agent frameworks, e.g. LangGraph
  • Environment tutorials: incorporate more complex agentic loops during training such as multi-turn conversation and user modeling

Release Assets

GitHub Release: https://github.com/NVIDIA-NeMo/Gym/releases/tag/v0.2.0
Container: nvcr.io/nvidia/nemo-rl:v0.5.0.nemotron_3_super

What's Changed

Read more

v0.1.1

Choose a tag to compare

@bxyu-nvidia bxyu-nvidia released this 15 Dec 00:32

What's Changed

New Contributors

Full Changelog: v0.1.0...v0.1.1

v0.1.0

Choose a tag to compare

@bxyu-nvidia bxyu-nvidia released this 15 Nov 01:06

What's Changed

Read more