๐ Loong Project is a collaborative effort to explore whether reasoning-capable models can bootstrap themselves from small, high-quality seed datasets by generating synthetic data and verifying LLM agent responses.
๐ Star Loong on GitHub to stay updated, or join our Initiative Program


We invite researchers and developers to contribute seed datasets, verifiers, and ideas to help improve and extend our project. Ready to join? Click the link below to apply now.

Project Loong leverages a Generator to create synthetic questions/answers from seed datasets, while a Verifier evaluates the correctness of those responses. A Trainable Agent then learns iteratively from these verified Q&As, enabling scalable self-improvement through reinforcement learning and more advanced strategies.
1. ๐ Seed Datasets โ Real, human-vetted data from computable domains like math, physics, finance, etc.
2. ๐ Cookbooks โ Modular scripts for synthetic data generation, verification, and RL training loops.
๐ Seed Datasets โ
A collection of seed datasets, structured for generation and verification, divded for each domain.
Each datapoint includes:
question
final_answer
rationale
(typically code)metadata
(license, source, domain, required_dependencies, name, contributor, date_created, difficulty, tags, anything else)
Each dataset is designed to allow automatic evaluation via verifiers, usually by executing the rationale code and comparing the output to the known answer.
The repository currently includes a total of 3,551 questions spanning 8 diverse domains (and growing!):
- ๐งฎ Advanced Math: 1,615 questions
- โ๏ธ Advanced Physics: 434 questions
- ๐งฌ Computational Biology: 304 questions
- ๐น Finance: 320 questions
- ๐ Graph & Discrete Math: 179 questions
- ๐ง Logic: 110 questions
- ๐ Mathematical Programming: 68 questions
- ๐ Security & Safety: 521 questions
We have combined all the datasets into a single file: data/all_seed_dataset.json
. You can also find each domain's dataset in the corresponding folder.
Tip
Want to contribute your own? See the CONTRIBUTING.md for seed datasets.
๐ Cookbooks โ
Reusable scripts and notebooks for:
- Few-shot prompting from seed data
- Generating synthetic questions, rationales, and answers
- Running verifiers over generations
- Exporting datasets for supervised fine-tuning or RL
These pipelines allow you to condition generations on real data, verify outputs, and build consistent synthetic traces.
We're looking for:
- Seed datasets in verifiable domains
- New verifiers
- Cookbook improvements
- Experimental environments for RL
We greatly appreciate your interest in contributing to our open-source initiative. To ensure a smooth collaboration and the success of contributions, we adhere to a set of contributing guidelines similar to those established by CAMEL. For a comprehensive understanding of the steps involved in contributing to our project, please refer to the CAMEL Contributing Guidelines. ๐ค
- Code: LICENSE
- Data: Per-dataset license in
metadata.json
Project Loong is led by the CAMEL team, with contributors from across the open-source AI research community.
If you're keen on exploring new research opportunities or discoveries with our platform and wish to dive deeper or suggest new features, we're here to talk. Feel free to get in touch for more details at camel-ai@eigent.ai.
- Join us (Discord or WeChat) in pushing the boundaries of finding the scaling laws of agents.
- Join WechatGroup for further discussions!