Skip to content
Change the repository type filter

All

    Repositories list

    • Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks
      Python
      Apache License 2.0
      1571.1k308Updated Oct 3, 2024Oct 3, 2024
    • ProSA

      Public
      Apache License 2.0
      0000Updated Oct 2, 2024Oct 2, 2024
    • OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
      Python
      Apache License 2.0
      4063.8k18827Updated Oct 2, 2024Oct 2, 2024
    • MMBench

      Public
      Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
      Apache License 2.0
      1015210Updated Sep 1, 2024Sep 1, 2024
    • hinode

      Public
      A clean documentation and blog theme for your Hugo site based on Bootstrap 5
      HTML
      MIT License
      52000Updated Sep 1, 2024Sep 1, 2024
    • storage

      Public
      Apache License 2.0
      0000Updated Aug 18, 2024Aug 18, 2024
    • Demo data of CompassBench
      2220Updated Aug 7, 2024Aug 7, 2024
    • CIBench

      Public
      Official Repo of "CIBench: Evaluation of LLMs as Code Interpreter "
      Python
      Apache License 2.0
      1610Updated Jul 19, 2024Jul 19, 2024
    • Jupyter Notebook
      68920Updated Jul 17, 2024Jul 17, 2024
    • ANAH

      Public
      [ACL 2024] ANAH: Analytical Annotation of Hallucinations in Large Language Models
      Python
      Apache License 2.0
      11900Updated Jul 12, 2024Jul 12, 2024
    • MathBench

      Public
      [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
      Apache License 2.0
      17950Updated Jul 12, 2024Jul 12, 2024
    • GTA

      Public
      Official repository for paper "GTA: A Benchmark for General Tool Agents" (NeurIPS 2024 D&B Track)
      Python
      Apache License 2.0
      33300Updated Jul 12, 2024Jul 12, 2024
    • .github

      Public
      1000Updated May 31, 2024May 31, 2024
    • DevBench

      Public
      A Comprehensive Benchmark for Software Development.
      Python
      Apache License 2.0
      58410Updated May 30, 2024May 30, 2024
    • CodeBench

      Public
      0200Updated May 21, 2024May 21, 2024
    • Ada-LEval

      Public
      The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
      Python
      24900Updated Apr 22, 2024Apr 22, 2024
    • T-Eval

      Public
      [ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
      Python
      Apache License 2.0
      13215322Updated Apr 3, 2024Apr 3, 2024
    • Code for the paper "Evaluating Large Language Models Trained on Code"
      Python
      MIT License
      334200Updated Mar 14, 2024Mar 14, 2024
    • Apache License 2.0
      23330Updated Mar 8, 2024Mar 8, 2024
    • [NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
      Python
      Apache License 2.0
      12600Updated Feb 24, 2024Feb 24, 2024
    • A multi-language code evaluation tool.
      Python
      Apache License 2.0
      61701Updated Jan 26, 2024Jan 26, 2024
    • evalplus

      Public
      EvalPlus for rigourous evaluation of LLM-synthesized code
      Python
      Apache License 2.0
      102100Updated Dec 20, 2023Dec 20, 2023
    • A toolkit for inference and evaluation of 'mixtral-8x7b-32kseqlen' from Mistral AI
      Python
      Apache License 2.0
      81762120Updated Dec 15, 2023Dec 15, 2023
    • LawBench

      Public
      Benchmarking Legal Knowledge of Large Language Models
      Python
      Apache License 2.0
      3623850Updated Nov 13, 2023Nov 13, 2023
    • BotChat

      Public
      Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
      Jupyter Notebook
      Apache License 2.0
      613610Updated Nov 2, 2023Nov 2, 2023
    • Sphinx Theme for OpenCompass - Modified from PyTorch
      CSS
      MIT License
      138000Updated Aug 30, 2023Aug 30, 2023