OpenCompass

All

30 repositories

opencompass
Public
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
benchmark evaluation openai llm chatgpt large-language-model llama2 llama3
Python
•
Apache License 2.0
•507•4.8k•262•34•Updated Feb 25, 2025Feb 25, 2025
VLMEvalKit
Public
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
computer-vision evaluation pytorch gemini openai vqa vit gpt multi-modal clip
Python
•
Apache License 2.0
•277•1.9k•62•8•Updated Feb 25, 2025Feb 25, 2025
GPassK
Public
Official Repository of Are Your LLMs Capable of Stable Reasoning?
Python
•1•20•2•0•Updated Feb 25, 2025Feb 25, 2025
CompassJudger
Public
5•84•0•0•Updated Feb 25, 2025Feb 25, 2025
GTA
Public
[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
llm-agent llm-evaluation
Python
•
Apache License 2.0
•6•75•0•0•Updated Feb 13, 2025Feb 13, 2025
oc_doc_website
Public
0•0•0•0•Updated Feb 12, 2025Feb 12, 2025
GAOKAO-Eval
Public
Jupyter Notebook
•5•101•5•0•Updated Dec 16, 2024Dec 16, 2024
ANAH
Public
[ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2
acl gpt neurips llms hallucination-detection
Python
•
Apache License 2.0
•3•30•0•0•Updated Dec 11, 2024Dec 11, 2024
CriticEval
Public
[NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
Python
•
Apache License 2.0
•2•39•0•0•Updated Nov 29, 2024Nov 29, 2024
ProSA
Public
[EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
Python
•
Apache License 2.0
•2•24•0•0•Updated Oct 22, 2024Oct 22, 2024
lagent-cibench
Public
Python
•
Apache License 2.0
•1•2•0•0•Updated Sep 23, 2024Sep 23, 2024
MMBench
Public
Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
Apache License 2.0
•10•184•4•0•Updated Sep 1, 2024Sep 1, 2024
hinode
Public
A clean documentation and blog theme for your Hugo site based on Bootstrap 5
HTML
•
MIT License
•60•0•0•0•Updated Sep 1, 2024Sep 1, 2024
storage
Public
Apache License 2.0
•0•0•0•0•Updated Aug 18, 2024Aug 18, 2024
CompassBench
Public
Demo data of CompassBench
3•7•2•0•Updated Aug 7, 2024Aug 7, 2024
CIBench
Public
Official Repo of "CIBench: Evaluation of LLMs as Code Interpreter "
Python
•
Apache License 2.0
•2•10•0•0•Updated Jul 19, 2024Jul 19, 2024
MathBench
Public
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
Apache License 2.0
•1•96•6•0•Updated Jul 12, 2024Jul 12, 2024
.github
Public
1•0•0•0•Updated May 31, 2024May 31, 2024
DevEval
Public
A Comprehensive Benchmark for Software Development.
Python
•
Apache License 2.0
•6•94•0•0•Updated May 30, 2024May 30, 2024
CodeBench
Public
0•2•0•0•Updated May 21, 2024May 21, 2024
Ada-LEval
Public
The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
gpt4 llm long-context
Python
•2•53•0•0•Updated Apr 22, 2024Apr 22, 2024
T-Eval
Public
[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
Python
•
Apache License 2.0
•15•261•36•2•Updated Apr 3, 2024Apr 3, 2024
human-eval
Public
Code for the paper "Evaluating Large Language Models Trained on Code"
Python
•
MIT License
•370•3•0•0•Updated Mar 14, 2024Mar 14, 2024
OpenFinData
Public
Apache License 2.0
•2•48•3•0•Updated Mar 8, 2024Mar 8, 2024
code-evaluator
Public
A multi-language code evaluation tool.
Python
•
Apache License 2.0
•8•21•0•1•Updated Jan 26, 2024Jan 26, 2024
evalplus
Public
EvalPlus for rigourous evaluation of LLM-synthesized code
Python
•
Apache License 2.0
•129•1•0•0•Updated Dec 20, 2023Dec 20, 2023
MixtralKit
Public
A toolkit for inference and evaluation of 'mixtral-8x7b-32kseqlen' from Mistral AI
moe mistral llm
Python
•
Apache License 2.0
•80•767•12•0•Updated Dec 15, 2023Dec 15, 2023
LawBench
Public
Benchmarking Legal Knowledge of Large Language Models
law benchmark llm chatgpt
Python
•
Apache License 2.0
•49•295•3•0•Updated Nov 13, 2023Nov 13, 2023
BotChat
Public
Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
Jupyter Notebook
•
Apache License 2.0
•6•145•1•0•Updated Nov 2, 2023Nov 2, 2023
pytorch_sphinx_theme
Public
Sphinx Theme for OpenCompass - Modified from PyTorch
CSS
•
MIT License
•139•0•0•0•Updated Aug 30, 2023Aug 30, 2023