A curated collection of 50+ open-source projects that use AI agents for machine learning research, training, and experimentation.
LLM-powered agents are fundamentally transforming ML research and engineering — from autonomous scientific discovery to automated Kaggle competitions. This list tracks the best open-source projects in this fast-moving space.
AI Agents for ML Landscape
┌─────────────────────────────────────────────────────────────────┐
│ │
│ Research Training Data Science │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │AI-Scien- │ │ AIDE │ │ Deep- │ │
│ │tist, auto│ │ AutoML │ │ Analyze │ │
│ │research │ │ Agent │ │ DS-Agent│ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ └────────────────┼───────────────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ Agent Frameworks │ │
│ │ AutoGen / CrewAI │ │
│ │ MetaGPT / DSPy │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ │ │ │ │
│ ┌────┴─────┐ ┌────┴─────┐ ┌─────┴────┐ │
│ │ MLOps │ │Benchmarks│ │ RL Agent │ │
│ │ MLflow │ │ MLE-bench│ │ Training │ │
│ │ ZenML │ │ ML-Bench │ │ rllm,R1 │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
- Automated ML Research Agents — autonomous scientific discovery & paper generation
- ML Training & Engineering Agents — model training, AutoML, Kaggle automation
- Data Science Agents — data analysis, feature engineering, EDA
- Agent Frameworks for ML — general frameworks powering ML workflows
- RL for Training LLM Agents — reinforcement learning to train better agents
- ML Agent Benchmarks & Evaluation — evaluating agent ML capabilities
- MLOps & Platform Agents — deployment, monitoring, pipelines
- Research Assistants & Paper Agents — paper search, reading, literature review
- AI for Science — scientific discovery beyond ML
- Software Engineering Agents — code agents applicable to ML
- Key Trends (2024-2026)
Agents that autonomously conduct ML research — from generating hypotheses to running experiments and writing papers.
| Project | Stars | Description | Highlights |
|---|---|---|---|
| AI-Scientist | 12.8k | First fully autonomous system for end-to-end scientific discovery. Published in Nature. | Idea generation -> experiment -> paper writing -> automated peer review |
| AI-Scientist-v2 | 2.7k | Enhanced version using agentic tree search. First AI-generated paper accepted at ICLR 2025 workshop. | Template-free, open-ended exploration across ML domains |
| autoresearch | 58k | Karpathy's 630-line tool for autonomous ML experiments. Ran ~700 experiments, found ~20 genuine improvements cutting GPT-2 training time by 11%. | Single GPU, markdown-defined research, git-tracked results |
| AI-Researcher | 5k | Autonomously identifies research gaps and executes full research pipeline. NeurIPS 2025 Spotlight. | Writer Agent for hierarchical paper generation, web GUI |
| AgentLaboratory | 5.4k | End-to-end autonomous research workflow with specialized LLM agents. Introduced AgentRxiv preprint server. | Literature review -> experimentation -> report writing |
| Auto-Research | 13 | Framework for fully automated research agents across the entire scientific lifecycle. | Dual-layer memory, Docker/SSH sandbox, session persistence |
What makes a good ML research agent?
The best ML research agents share these traits:
- End-to-end automation: from idea to validated result
- Tree search over hypotheses: exploring multiple directions, not just one linear path
- Self-evaluation: automated review/critique of generated results
- Reproducibility: git-tracked experiments with clear provenance
Agents that automate model training, hyperparameter tuning, ML code generation, and experiment management.
| Project | Stars | Description | Highlights |
|---|---|---|---|
| aideml (AIDE) | 1.2k | ML engineering agent using tree-structured search over solution space. SOTA on Kaggle/MLE-Bench. ICLR 2025. | Surpasses 50% of Kaggle participants across 60+ competitions |
| ML-Agent | 58 | First LLM agent trained via online RL for autonomous ML engineering. 7B model outperforms DeepSeek-R1 (671B). | RL-based training, cross-task generalization |
| automl-agent | 112 | Multi-agent LLM framework for full-pipeline AutoML. ICML 2025. | Data retrieval -> preprocessing -> NAS -> deployment |
| autogluon-assistant | 263 | Multi-agent system (MLZero) for end-to-end multimodal ML automation. NeurIPS 2025. | 6 gold medals on MLE-Bench Lite, works with 8B LLM |
| AutoKaggle | 287 | Multi-agent system with 5 specialized agents for automating Kaggle competitions. | Reader -> Planner -> Developer -> Reviewer -> Summarizer |
| FLAML | 4.3k | Microsoft's fast library for AutoML and tuning. Economical automation for ML workflows. | Low cost, MLflow integration, foundation model tuning |
| DATAGEN | 1.7k | AI-driven multi-agent assistant automating hypothesis generation, data analysis, and report writing. | LangChain + LangGraph, specialized agents, visualization |
Agents for data analysis, feature engineering, data preprocessing, and end-to-end data science workflows.
| Project | Stars | Description | Highlights |
|---|---|---|---|
| ai-data-science-team | 5.1k | Library of specialized agents for data science workflows + AI Pipeline Studio. | Loading/cleaning/EDA/SQL/feature engineering agents |
| DeepAnalyze | 1.9k | First end-to-end agentic LLM (8B) for autonomous data science. Analyst-grade reports. | Full DS pipeline, multi-format support (CSV/Excel/JSON/XML) |
| DS-Agent | 231 | Automated data science via LLMs with case-based reasoning. ICML 2024. | Case-based reasoning for pipeline construction |
| DataMind | 73 | Scalable agent training for generalist data-analytic agents. 14B model outperforms GPT-5. ICLR/AAAI 2026. | DataMind-12K trajectories, open-source 7B/14B models |
| LAMBDA | — | Large Model Based Data Agent. Published in Journal of the American Statistical Association (2025). | Statistical analysis powered by LLM |
General-purpose agent frameworks widely used for building ML workflows and multi-agent ML systems.
| Project | Stars | Description | Highlights |
|---|---|---|---|
| OpenHands | 70k | AI-driven development platform. 72% on SWE-Bench Verified. ICLR 2025. | Agent SDK v1.0, sandboxed execution, MCP integration |
| MetaGPT | 66k | Multi-agent framework with Data Interpreter achieving SOTA on ML tasks. | AFlow for automated workflow generation (ICLR 2025 oral) |
| autogen | 56k | Microsoft's framework for agentic AI with multi-agent conversations. | Code execution, tool use, no-code Studio, .NET support |
| crewAI | 47k | Framework for orchestrating role-playing autonomous AI agents. | Role-based design, A2A support, fast setup |
| dspy | 33k | Framework for programming — not prompting — language models. Stanford. | Automatic prompt optimization, composable modules |
| langgraph | 28k | Build resilient language agents as graphs with durable execution. | Stateful workflows, checkpointing, human-in-the-loop |
| camel | 16.5k | First multi-agent framework. Role-playing collaboration. NeurIPS 2023. | OWL multi-agent, OASIS million-agent simulation |
| AutoAgent | 8.7k | Fully-automated zero-code LLM agent framework. | Self-developing agent systems, auto orchestration |
Using reinforcement learning to train better LLM agents for ML tasks and beyond.
| Project | Stars | Description | Highlights |
|---|---|---|---|
| rllm | 5.3k | Democratizing RL for LLMs. Agents beat models 50x their size. | GRPO/REINFORCE/RLOO, multi-GPU + single-machine |
| RLinf | 2.9k | RL infrastructure for embodied and agentic AI. | PPO/GRPO/SAC, scalable to large GPU clusters |
| Agent-R1 | 1.3k | Training powerful LLM agents with end-to-end RL. | Multi-turn tool calling, process rewards per tool call |
| AgentGym-RL | 650 | Training LLM agents for long-horizon decision making via multi-turn RL. | Long-horizon task training, multi-turn RL |
| MARTI | 467 | Multi-agent reinforced training and inference. Tsinghua. | Tree search-augmented RL, multi-agent collaboration |
Benchmarks and tools for evaluating how well AI agents perform ML tasks.
| Project | Stars | Description | Highlights |
|---|---|---|---|
| mle-bench | 1.4k | OpenAI's benchmark of 75 Kaggle competitions for ML engineering agents. | AIDE/MLAB/OpenHands scaffolds, pass@k evaluation |
| MLAgentBench | 335 | Stanford. 13 end-to-end ML experimentation tasks (CIFAR-10, BabyLM, etc.). | LangChain/AutoGPT agents, multi-LLM support |
| ML-Bench | 316 | Yale. Evaluating LLMs and agents on repository-level ML code. | Real-world ML codebase evaluation |
| mlrbench | 24 | 201 tasks from ICLR/ICML/NeurIPS workshops for open-ended ML research evaluation. | MLR-Agent scaffold, MLR-Judge automated review |
Agents and platforms for ML deployment, monitoring, and pipeline management.
| Project | Stars | Description | Highlights |
|---|---|---|---|
| mlflow | 25k | Open-source AI engineering platform. 30M+ monthly downloads. | Experiment tracking, agent tracing/evaluation/monitoring |
| opik | 18.5k | Debug, evaluate, and monitor LLM apps and agentic workflows. Agent Optimizer SDK. | Comprehensive tracing, automated evaluations, self-hostable |
| metaflow | 10k | Netflix's framework for data science and ML pipelines. Agentic support since 2025. | Recursive/conditional steps for agents, Kubernetes |
| zenml | 5.3k | "One AI Platform from Pipelines to Agents." Run on any infrastructure. | Infrastructure-agnostic, MLflow/W&B integration |
| weave | — | W&B toolkit for developing, evaluating, and monitoring AI apps and agents. | LLM-as-judge, execution metrics, GenAI observability |
Agents that help read, search, summarize, and manage ML research papers.
| Project | Stars | Description | Highlights |
|---|---|---|---|
| gpt-researcher | 26k | Autonomous deep research agent producing factual reports with citations. | Parallelized work, MCP server, multi-LLM support |
| open_deep_research | 11k | LangChain's open-source deep research solution with Open Agent Platform UI. | Any LLM via init_chat_model, customizable MCP tools |
| pasa | 1.4k | ByteDance's paper search agent. Surpasses Google Scholar by 37.78% in recall@20. | Autonomous search/read/reference selection, RL-optimized |
| openpaper | 243 | Research library workbench with AI assistant for literature review. | Annotation, AI-powered paper understanding |
Agents designed for scientific research and discovery powered by ML.
| Project | Stars | Description | Highlights |
|---|---|---|---|
| virtual-lab | 652 | Stanford's virtual lab of LLM agents for science. Nature (2025) — SARS-CoV-2 nanobody design. | PI agent + specialist team, AlphaFold/Rosetta integration |
| chemcrow-public | 888 | LLM agent with 18 chemistry tools for synthesis and drug discovery. Nature Machine Intelligence (2024). | RDKit/PubChem tools, autonomous synthesis planning |
| SciToolAgent | 399 | Agent framework integrating scientific tools via knowledge graph. Nature Computational Science (2025). | Planner/Executor/Summarizer, SciToolKG |
Originally built for software engineering, increasingly used for ML codebases and research.
| Project | Stars | Description | Highlights |
|---|---|---|---|
| SWE-agent | 19k | Takes a GitHub issue and automatically fixes it. NeurIPS 2024. | Custom agent-computer interface, multi-LLM |
| mini-swe-agent | 3.5k | 100-line AI agent scoring >74% on SWE-bench Verified. | Minimal, hackable, high-performance |
| SWE-Gym | 651 | First environment for training real-world SWE agents. ICML 2025. | Training data generation for SWE agents |
| SWE-smith | 606 | Toolkit for scaling training data for SWE-agents. NeurIPS 2025 Spotlight. | Automated training data generation |
| Archon | 190 | Stanford. Architecture search for inference-time techniques. Outperforms GPT-4o by 11-15%. | Generators/fusers/critics/rankers/verifiers |
| Trend | Signal | Representative Projects |
|---|---|---|
| Autonomous Research is Real | AI-generated papers pass peer review; agents find genuine ML improvements | AI-Scientist, autoresearch |
| Tree Search > Linear Pipelines | Tree-structured exploration outperforms sequential approaches | AI-Scientist-v2, AIDE |
| RL-Trained Agents Scale Down | Small RL-trained agents outperform 50-100x larger models | ML-Agent (7B > 671B), rllm |
| Multi-Agent = ML Teams | Specialized agent roles mirror real research team dynamics | MetaGPT, AutoGen, AutoML-Agent |
| Benchmarks Maturing | Standardized evaluation from Kaggle to open-ended research | MLE-bench, MLR-Bench |
| Code Agents + ML Converge | SWE agents increasingly applied to ML research & debugging | OpenHands, SWE-agent |
If you find this collection useful, please consider giving it a star!
Contributions are welcome! Please read the contributing guidelines before submitting a PR.
To add a project:
- Ensure it is open-source and related to AI agents for ML
- Add it to the appropriate category in
README.md - Include: project link, stars, description, and highlights
- Submit a pull request