CriticBench

TextsMITIntroduced 2024-02-22

CriticBench is a comprehensive benchmark designed to assess the abilities of Large Language Models (LLMs) to critique and rectify their reasoning across various tasks. It encompasses five reasoning domains:

  1. Mathematical
  2. Commonsense
  3. Symbolic
  4. Coding
  5. Algorithmic

CriticBench compiles 15 datasets and incorporates responses from three LLM families. By utilizing CriticBench, researchers evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning (referred to as GQC reasoning). Notable findings include:

  1. A linear relationship in GQC capabilities, with critique-focused training significantly enhancing performance.
  2. Task-dependent variation in correction effectiveness, with logic-oriented tasks being more amenable to correction.
  3. GQC knowledge inconsistencies that decrease as model size increases.
  4. An intriguing inter-model critiquing dynamic, where stronger models excel at critiquing weaker ones, while weaker models surprisingly surpass stronger ones in self-critique.

(1) CriticBench: Benchmarking LLMs for Critique-Correct Reasoning. https://arxiv.org/abs/2402.14809. (2) CriticBench: Benchmarking LLMs for Critique-Correct Reasoning. http://export.arxiv.org/abs/2402.14809. (3) CriticBench: Benchmarking LLMs for Critique-Correct Reasoning. https://openreview.net/forum?id=sc5i7q6DQO. (4) CriticBench: Benchmarking LLMs for Critique-Correct Reasoning - arXiv.org. https://arxiv.org/html/2402.14809v2. (5) undefined. https://doi.org/10.48550/arXiv.2402.14809.