Skip to content
Change the repository type filter

All

    Repositories list

    • SWE-bench-server

      Public
      Python
      0000Updated Feb 4, 2026Feb 4, 2026
    • VLMEvalKit

      Public
      Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
      Python
      6233.8k20246Updated Feb 3, 2026Feb 3, 2026
    • opencompass

      Public
      OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets…
      Python
      7356.6k36565Updated Jan 22, 2026Jan 22, 2026
    • MiroFlow

      Public
      MiroMind Research Agent: Fully Open-Source Deep Research Agent with Reproducible State-of-the-Art Performance on FutureX, GAIA, HLE, BrowserComp and xBench.
      Python
      249000Updated Dec 30, 2025Dec 30, 2025
    • RePro

      Public
      [ICLR 2026] Rectifying LLM Thought From Lens of Optimization
      Python
      41310Updated Dec 5, 2025Dec 5, 2025
    • SAGA

      Public
      The code repository for the NeurIPS 2025 paper "Rethinking Verification for LLM Code Generation: From Generation to Testing."
      01000Updated Nov 27, 2025Nov 27, 2025
    • ATLAS

      Public
      ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
      1600Updated Nov 20, 2025Nov 20, 2025
    • OASIS

      Public
      Python
      0300Updated Nov 12, 2025Nov 12, 2025
    • InteractScience

      Public
      JavaScript
      0800Updated Oct 31, 2025Oct 31, 2025
    • CognitiveKernel-Pro

      Public
      Deep Research Agent CognitiveKernel-Pro from Tencent AI Lab. Paper: https://arxiv.org/pdf/2508.00414
      Python
      47000Updated Oct 27, 2025Oct 27, 2025
    • GAOKAO-Eval

      Public
      Jupyter Notebook
      711350Updated Oct 7, 2025Oct 7, 2025
    • .github

      Public
      1000Updated Sep 9, 2025Sep 9, 2025
    • MMBench-GUI

      Public
      Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical mann…
      Python
      69850Updated Sep 8, 2025Sep 8, 2025
    • ReasonZoo

      Public
      Python
      0300Updated Aug 27, 2025Aug 27, 2025
    • CompassVerifier

      Public
      [EMNLP 2025] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
      Jupyter Notebook
      26300Updated Aug 10, 2025Aug 10, 2025
    • GPassK

      Public
      [ACL 2025] Are Your LLMs Capable of Stable Reasoning?
      Python
      23220Updated Aug 5, 2025Aug 5, 2025
    • Creation-MMBench

      Public
      Assessing Context-Aware Creative Intelligence in MLLMs
      JavaScript
      02310Updated Jul 22, 2025Jul 22, 2025
    • CompassJudger

      Public
      The All-in-one Judge Models introduced by Opencompass
      611710Updated Jul 15, 2025Jul 15, 2025
    • RaML

      Public
      [Preprint 2025] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
      Jupyter Notebook
      2600Updated May 27, 2025May 27, 2025
    • BotChat

      Public
      Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
      Jupyter Notebook
      716120Updated May 22, 2025May 22, 2025
    • Ada-LEval

      Public
      The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
      Python
      35600Updated May 22, 2025May 22, 2025
    • MathBench

      Public
      [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
      111050Updated May 22, 2025May 22, 2025
    • MMBench

      Public
      Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
      15285120Updated May 22, 2025May 22, 2025
    • ProSA

      Public
      [EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
      Python
      22900Updated May 22, 2025May 22, 2025
    • ANAH

      Public
      [ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2 & [ICLR 2025] Mask-DPO
      Python
      46210Updated Apr 30, 2025Apr 30, 2025
    • GTA

      Public
      [NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
      Python
      913310Updated Mar 28, 2025Mar 28, 2025
    • 0000Updated Feb 12, 2025Feb 12, 2025
    • CriticEval

      Public
      [NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
      Python
      24920Updated Nov 29, 2024Nov 29, 2024
    • lagent-cibench

      Public
      Python
      1200Updated Sep 23, 2024Sep 23, 2024
    • hinode

      Public
      A clean documentation and blog theme for your Hugo site based on Bootstrap 5
      HTML
      65000Updated Sep 1, 2024Sep 1, 2024