Papers With Code 2 | ML Benchmarks, SotA Results & Code

NPHardEval is a dynamic benchmark designed to assess the reasoning abilities of Large Language Models (LLMs) across a broad spectrum of algorithmic questions. Let's delve into the details:

Benchmark Purpose:
- Complex Reasoning Ability: One of the most crucial features of current LLMs is their ability to handle complex reasoning. This capability plays an integral role in complex decision-making tasks.
- Inadequacy of Existing Benchmarks: While several benchmarks exist to evaluate LLMs' reasoning abilities, they fall short in providing a rigorous assessment of the full extent of these abilities. Additionally, publicly accessible and static benchmarks risk overfitting, allowing models to tailor their responses to specific metrics.
- Introducing NPHardEval: To address these limitations, NPHardEval was introduced. It aims to rigorously evaluate LLMs' reasoning abilities by extending up to the NP-Hard complexity class.
Key Features of NPHardEval:
- 900 Algorithmic Questions: NPHardEval includes a diverse set of 900 algorithmic questions, carefully chosen to represent a wide range of complexity classes below NP-Hard. These questions serve as a rigorous measure of LLMs' reasoning abilities.
- Dynamic Update Mechanism: Unlike static benchmarks, NPHardEval dynamically updates its datapoints on a monthly basis. Regular updates mitigate the risk of overfitting, ensuring a more accurate and reliable assessment of LLMs' reasoning capabilities.
Research Contribution:
- Objective Perspective: NPHardEval sheds light on the current state of reasoning in LLMs by comparing their performance across complex classes.
- Available Resources: The benchmark dataset and code for NPHardEval are accessible here ¹.

In summary, NPHardEval provides a comprehensive evaluation framework for assessing LLMs' reasoning abilities through the lens of computational complexity classes. 🌟

(1) NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language .... https://arxiv.org/abs/2312.14890. (2) NPHardEval/README.md at main · casmlab/NPHardEval · GitHub. https://github.com/casmlab/NPHardEval/blob/main/README.md. (3) NPHardEval: Benchmarking Reasoning Ability of Large Language Models via .... https://frankling2020.github.io/publication/nphardeval/. (4) undefined. https://doi.org/10.48550/arXiv.2312.14890.