Skip to content
Change the repository type filter

All

    Repositories list

    • OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
      Python
      Apache License 2.0
      5074.8k26234Updated Feb 25, 2025Feb 25, 2025
    • Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
      Python
      Apache License 2.0
      2771.9k628Updated Feb 25, 2025Feb 25, 2025
    • GPassK

      Public
      Official Repository of Are Your LLMs Capable of Stable Reasoning?
      Python
      12020Updated Feb 25, 2025Feb 25, 2025
    • 58400Updated Feb 25, 2025Feb 25, 2025
    • GTA

      Public
      [NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
      Python
      Apache License 2.0
      67500Updated Feb 13, 2025Feb 13, 2025
    • 0000Updated Feb 12, 2025Feb 12, 2025
    • Jupyter Notebook
      510150Updated Dec 16, 2024Dec 16, 2024
    • ANAH

      Public
      [ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2
      Python
      Apache License 2.0
      33000Updated Dec 11, 2024Dec 11, 2024
    • [NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
      Python
      Apache License 2.0
      23900Updated Nov 29, 2024Nov 29, 2024
    • ProSA

      Public
      [EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
      Python
      Apache License 2.0
      22400Updated Oct 22, 2024Oct 22, 2024
    • Python
      Apache License 2.0
      1200Updated Sep 23, 2024Sep 23, 2024
    • MMBench

      Public
      Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
      Apache License 2.0
      1018440Updated Sep 1, 2024Sep 1, 2024
    • hinode

      Public
      A clean documentation and blog theme for your Hugo site based on Bootstrap 5
      HTML
      MIT License
      60000Updated Sep 1, 2024Sep 1, 2024
    • storage

      Public
      Apache License 2.0
      0000Updated Aug 18, 2024Aug 18, 2024
    • Demo data of CompassBench
      3720Updated Aug 7, 2024Aug 7, 2024
    • CIBench

      Public
      Official Repo of "CIBench: Evaluation of LLMs as Code Interpreter "
      Python
      Apache License 2.0
      21000Updated Jul 19, 2024Jul 19, 2024
    • MathBench

      Public
      [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
      Apache License 2.0
      19660Updated Jul 12, 2024Jul 12, 2024
    • .github

      Public
      1000Updated May 31, 2024May 31, 2024
    • DevEval

      Public
      A Comprehensive Benchmark for Software Development.
      Python
      Apache License 2.0
      69400Updated May 30, 2024May 30, 2024
    • CodeBench

      Public
      0200Updated May 21, 2024May 21, 2024
    • Ada-LEval

      Public
      The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
      Python
      25300Updated Apr 22, 2024Apr 22, 2024
    • T-Eval

      Public
      [ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
      Python
      Apache License 2.0
      15261362Updated Apr 3, 2024Apr 3, 2024
    • Code for the paper "Evaluating Large Language Models Trained on Code"
      Python
      MIT License
      370300Updated Mar 14, 2024Mar 14, 2024
    • Apache License 2.0
      24830Updated Mar 8, 2024Mar 8, 2024
    • A multi-language code evaluation tool.
      Python
      Apache License 2.0
      82101Updated Jan 26, 2024Jan 26, 2024
    • evalplus

      Public
      EvalPlus for rigourous evaluation of LLM-synthesized code
      Python
      Apache License 2.0
      129100Updated Dec 20, 2023Dec 20, 2023
    • A toolkit for inference and evaluation of 'mixtral-8x7b-32kseqlen' from Mistral AI
      Python
      Apache License 2.0
      80767120Updated Dec 15, 2023Dec 15, 2023
    • LawBench

      Public
      Benchmarking Legal Knowledge of Large Language Models
      Python
      Apache License 2.0
      4929530Updated Nov 13, 2023Nov 13, 2023
    • BotChat

      Public
      Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
      Jupyter Notebook
      Apache License 2.0
      614510Updated Nov 2, 2023Nov 2, 2023
    • Sphinx Theme for OpenCompass - Modified from PyTorch
      CSS
      MIT License
      139000Updated Aug 30, 2023Aug 30, 2023