OpenCompass

All

26 repositories

VLMEvalKit
Public
Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks
computer-vision evaluation pytorch gemini openai vqa vit gpt multi-modal clip
Python
•
Apache License 2.0
•157•1.1k•30•8•Updated Oct 3, 2024Oct 3, 2024
ProSA
Public
Apache License 2.0
•0•0•0•0•Updated Oct 2, 2024Oct 2, 2024
opencompass
Public
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
benchmark evaluation openai llm chatgpt large-language-model llama2 llama3
Python
•
Apache License 2.0
•406•3.8k•188•27•Updated Oct 2, 2024Oct 2, 2024
MMBench
Public
Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
Apache License 2.0
•10•152•1•0•Updated Sep 1, 2024Sep 1, 2024
hinode
Public
A clean documentation and blog theme for your Hugo site based on Bootstrap 5
HTML
•
MIT License
•52•0•0•0•Updated Sep 1, 2024Sep 1, 2024
storage
Public
Apache License 2.0
•0•0•0•0•Updated Aug 18, 2024Aug 18, 2024
CompassBench
Public
Demo data of CompassBench
2•2•2•0•Updated Aug 7, 2024Aug 7, 2024
CIBench
Public
Official Repo of "CIBench: Evaluation of LLMs as Code Interpreter "
Python
•
Apache License 2.0
•1•6•1•0•Updated Jul 19, 2024Jul 19, 2024
GAOKAO-Eval
Public
Jupyter Notebook
•6•89•2•0•Updated Jul 17, 2024Jul 17, 2024
ANAH
Public
[ACL 2024] ANAH: Analytical Annotation of Hallucinations in Large Language Models
acl gpt llms hallucination-detection
Python
•
Apache License 2.0
•1•19•0•0•Updated Jul 12, 2024Jul 12, 2024
MathBench
Public
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
Apache License 2.0
•1•79•5•0•Updated Jul 12, 2024Jul 12, 2024
GTA
Public
Official repository for paper "GTA: A Benchmark for General Tool Agents" (NeurIPS 2024 D&B Track)
Python
•
Apache License 2.0
•3•33•0•0•Updated Jul 12, 2024Jul 12, 2024
.github
Public
1•0•0•0•Updated May 31, 2024May 31, 2024
DevBench
Public
A Comprehensive Benchmark for Software Development.
Python
•
Apache License 2.0
•5•84•1•0•Updated May 30, 2024May 30, 2024
CodeBench
Public
0•2•0•0•Updated May 21, 2024May 21, 2024
Ada-LEval
Public
The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
gpt4 llm long-context
Python
•2•49•0•0•Updated Apr 22, 2024Apr 22, 2024
T-Eval
Public
[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
Python
•
Apache License 2.0
•13•215•32•2•Updated Apr 3, 2024Apr 3, 2024
human-eval
Public
Code for the paper "Evaluating Large Language Models Trained on Code"
Python
•
MIT License
•334•2•0•0•Updated Mar 14, 2024Mar 14, 2024
OpenFinData
Public
Apache License 2.0
•2•33•3•0•Updated Mar 8, 2024Mar 8, 2024
CriticBench
Public
[NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
Python
•
Apache License 2.0
•1•26•0•0•Updated Feb 24, 2024Feb 24, 2024
code-evaluator
Public
A multi-language code evaluation tool.
Python
•
Apache License 2.0
•6•17•0•1•Updated Jan 26, 2024Jan 26, 2024
evalplus
Public
EvalPlus for rigourous evaluation of LLM-synthesized code
Python
•
Apache License 2.0
•102•1•0•0•Updated Dec 20, 2023Dec 20, 2023
MixtralKit
Public
A toolkit for inference and evaluation of 'mixtral-8x7b-32kseqlen' from Mistral AI
moe mistral llm
Python
•
Apache License 2.0
•81•762•12•0•Updated Dec 15, 2023Dec 15, 2023
LawBench
Public
Benchmarking Legal Knowledge of Large Language Models
law benchmark llm chatgpt
Python
•
Apache License 2.0
•36•238•5•0•Updated Nov 13, 2023Nov 13, 2023
BotChat
Public
Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
Jupyter Notebook
•
Apache License 2.0
•6•136•1•0•Updated Nov 2, 2023Nov 2, 2023
pytorch_sphinx_theme
Public
Sphinx Theme for OpenCompass - Modified from PyTorch
CSS
•
MIT License
•138•0•0•0•Updated Aug 30, 2023Aug 30, 2023