Skip to content
Change the repository type filter

All

    Repositories list

    • Python
      0100Updated Apr 10, 2026Apr 10, 2026
    • Shell
      0000Updated Apr 10, 2026Apr 10, 2026
    • Python
      1000Updated Apr 10, 2026Apr 10, 2026
    • VLMEvalKit

      Public
      Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
      Python
      Apache License 2.0
      6774k20327Updated Apr 10, 2026Apr 10, 2026
    • CNFinBench — the first comprehensive benchmark for high-stakes financial scenarios. It spans 29 subtasks grounded in authoritative financial corpora and real bu…
      Python
      0100Updated Apr 10, 2026Apr 10, 2026
    • OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets…
      Python
      Apache License 2.0
      7556.8k37371Updated Apr 9, 2026Apr 9, 2026
    • The first unified, efficient, and extensible evaluation toolkit for evaluating image generation and editing models across multiple benchmarks.
      Jupyter Notebook
      MIT License
      44100Updated Apr 5, 2026Apr 5, 2026
    • Python
      0000Updated Apr 3, 2026Apr 3, 2026
    • TextEdit

      Public
      We provide TextEdit, a high-quality, multi-scenario text editing benchmark for generation models.
      Python
      MIT License
      01900Updated Mar 16, 2026Mar 16, 2026
    • GTA

      Public
      [NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
      Python
      Apache License 2.0
      913800Updated Feb 16, 2026Feb 16, 2026
    • MiroFlow

      Public
      MiroMind Research Agent: Fully Open-Source Deep Research Agent with Reproducible State-of-the-Art Performance on FutureX, GAIA, HLE, BrowserComp and xBench.
      Python
      Apache License 2.0
      303000Updated Dec 30, 2025Dec 30, 2025
    • RePro

      Public
      [ICLR 2026] Rectifying LLM Thought From Lens of Optimization
      Python
      MIT License
      41510Updated Dec 5, 2025Dec 5, 2025
    • SAGA

      Public
      The code repository for the NeurIPS 2025 paper "Rethinking Verification for LLM Code Generation: From Generation to Testing."
      01110Updated Nov 27, 2025Nov 27, 2025
    • ATLAS

      Public
      ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
      1600Updated Nov 20, 2025Nov 20, 2025
    • OASIS

      Public
      Python
      0300Updated Nov 12, 2025Nov 12, 2025
    • JavaScript
      Apache License 2.0
      0800Updated Oct 31, 2025Oct 31, 2025
    • Deep Research Agent CognitiveKernel-Pro from Tencent AI Lab. Paper: https://arxiv.org/pdf/2508.00414
      Python
      Other
      50000Updated Oct 27, 2025Oct 27, 2025
    • Jupyter Notebook
      711650Updated Oct 7, 2025Oct 7, 2025
    • .github

      Public
      1000Updated Sep 9, 2025Sep 9, 2025
    • Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical mann…
      Python
      610350Updated Sep 8, 2025Sep 8, 2025
    • ReasonZoo

      Public
      Python
      Apache License 2.0
      0300Updated Aug 27, 2025Aug 27, 2025
    • [EMNLP 2025] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
      Jupyter Notebook
      26800Updated Aug 10, 2025Aug 10, 2025
    • GPassK

      Public
      [ACL 2025] Are Your LLMs Capable of Stable Reasoning?
      Python
      23220Updated Aug 5, 2025Aug 5, 2025
    • Assessing Context-Aware Creative Intelligence in MLLMs
      JavaScript
      02310Updated Jul 22, 2025Jul 22, 2025
    • The All-in-one Judge Models introduced by Opencompass
      Apache License 2.0
      611910Updated Jul 15, 2025Jul 15, 2025
    • RaML

      Public
      [Preprint 2025] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
      Jupyter Notebook
      2800Updated May 27, 2025May 27, 2025
    • BotChat

      Public
      Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
      Jupyter Notebook
      Apache License 2.0
      716220Updated May 22, 2025May 22, 2025
    • Ada-LEval

      Public
      The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
      Python
      35600Updated May 22, 2025May 22, 2025
    • MathBench

      Public
      [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
      Apache License 2.0
      111350Updated May 22, 2025May 22, 2025
    • MMBench

      Public
      Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
      Apache License 2.0
      17296140Updated May 22, 2025May 22, 2025
    ProTip! When viewing an organization's repositories, you can use the props. filter to filter by custom property.