GitHub - camel-ai/loong: 🐉 Loong: Synthesize Long CoTs at Scale through Verifiers.

Community | Cookbook | Datasets | Loong Blog | Contributing | CAMEL-AI

🐉 Loong Project is a collaborative effort to explore whether reasoning-capable models can bootstrap themselves from small, high-quality seed datasets by generating synthetic data and verifying LLM agent responses.

🌟 Star Loong on GitHub to stay updated, or join our Initiative Program

We invite researchers and developers to contribute seed datasets, verifiers, and ideas to help improve and extend our project. Ready to join? Click the link below to apply now.

Agent-Environment Loop

Project Loong leverages a Generator to create synthetic questions/answers from seed datasets, while a Verifier evaluates the correctness of those responses. A Trainable Agent then learns iteratively from these verified Q&As, enabling scalable self-improvement through reinforcement learning and more advanced strategies.

🔍 What's in this Repo?

1. 📊 Seed Datasets — Real, human-vetted data from computable domains like math, physics, finance, etc.

2. 📘 Cookbooks — Modular scripts for synthetic data generation, verification, and RL training loops.

📊 Seed Datasets →

A collection of seed datasets, structured for generation and verification, divded for each domain.

Each datapoint includes:

question
final_answer
rationale (typically code)
metadata (license, source, domain, required_dependencies, name, contributor, date_created, difficulty, tags, anything else)

Each dataset is designed to allow automatic evaluation via verifiers, usually by executing the rationale code and comparing the output to the known answer.

Dataset Overview

The repository currently includes a total of 3,551 questions spanning 8 diverse domains (and growing!):

🧮 Advanced Math: 1,615 questions
⚛️ Advanced Physics: 434 questions
🧬 Computational Biology: 304 questions
💹 Finance: 320 questions
📈 Graph & Discrete Math: 179 questions
🧠 Logic: 110 questions
📐 Mathematical Programming: 68 questions
🔒 Security & Safety: 521 questions

We have combined all the datasets into a single file: data/all_seed_dataset.json. You can also find each domain's dataset in the corresponding folder.

Tip

Want to contribute your own? See the CONTRIBUTING.md for seed datasets.

📘 Cookbooks →

Reusable scripts and notebooks for:

Few-shot prompting from seed data
Generating synthetic questions, rationales, and answers
Running verifiers over generations
Exporting datasets for supervised fine-tuning or RL

These pipelines allow you to condition generations on real data, verify outputs, and build consistent synthetic traces.

🧬 Contributuing to Project Loong 🐉

We're looking for:

Seed datasets in verifiable domains
New verifiers
Cookbook improvements
Experimental environments for RL

We greatly appreciate your interest in contributing to our open-source initiative. To ensure a smooth collaboration and the success of contributions, we adhere to a set of contributing guidelines similar to those established by CAMEL. For a comprehensive understanding of the steps involved in contributing to our project, please refer to the CAMEL Contributing Guidelines. 🤝

📜 License

Code: LICENSE
Data: Per-dataset license in metadata.json

👥 Maintainers & Contact

Project Loong is led by the CAMEL team, with contributors from across the open-source AI research community.

If you're keen on exploring new research opportunities or discoveries with our platform and wish to dive deeper or suggest new features, we're here to talk. Feel free to get in touch for more details at camel-ai@eigent.ai.

Join us (Discord or WeChat) in pushing the boundaries of finding the scaling laws of agents.
Join WechatGroup for further discussions!

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
.github/workflows		.github/workflows
assets		assets
cookbooks		cookbooks
data		data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Community | Cookbook | Datasets | Loong Blog | Contributing | CAMEL-AI

Agent-Environment Loop

🔍 What's in this Repo?

1. 📊 Seed Datasets — Real, human-vetted data from computable domains like math, physics, finance, etc.

2. 📘 Cookbooks — Modular scripts for synthetic data generation, verification, and RL training loops.

📊 Seed Datasets →

Dataset Overview

📘 Cookbooks →

🧬 Contributuing to Project Loong 🐉

📜 License

👥 Maintainers & Contact

About

Releases

Packages

Contributors 16

Languages

License

camel-ai/loong

Folders and files

Latest commit

History

Repository files navigation

Community | Cookbook | Datasets | Loong Blog | Contributing | CAMEL-AI

Agent-Environment Loop

🔍 What's in this Repo?

1. 📊 Seed Datasets — Real, human-vetted data from computable domains like math, physics, finance, etc.

2. 📘 Cookbooks — Modular scripts for synthetic data generation, verification, and RL training loops.

📊 Seed Datasets →

Dataset Overview

📘 Cookbooks →

🧬 Contributuing to Project Loong 🐉

📜 License

👥 Maintainers & Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 16

Languages

Packages