Papers With Code 2 | ML Benchmarks, SotA Results & Code

ACCORD CSQA is an extension of the popular CommonsenseQA (CSQA) dataset using ACCORD, a scalable framework for disentangling the commonsense grounding and reasoning abilities of large language models (LLMs) through controlled, multi-hop counterfactuals. ACCORD closes the measurability gap between commonsense and formal reasoning tasks for LLMs. A detailed understanding of LLMs' commonsense reasoning abilities is severely lagging compared to our understanding of their formal reasoning abilities, since commonsense benchmarks are difficult to construct in a manner that is rigorously quantifiable. Specifically, prior commonsense reasoning benchmarks and datasets are limited to one- or two-hop reasoning or include an unknown (i.e., non-measurable) number of reasoning hops and/or distractors. Arbitrary scalability via compositional construction is also typical of formal reasoning tasks but lacking in commonsense reasoning. Finally, most prior commonsense benchmarks either are limited to a single reasoning skill or do not control skills. ACCORD aims to address all these gaps by introducing formal elements to commonsense reasoning to explicitly control and quantify reasoning complexity beyond the typical 1 or 2 reasoning hops. Uniquely, ACCORD can automatically generate benchmarks of arbitrary reasoning complexity, and so it scales with future LLM improvements. ACCORD CSQA is a benchmark suite comprising problem with 6 levels of reasoning difficulty, ACCORD CSQA 0 to ACCORD CSQA 5. Experiments on state-of-the-art LLMs show performance degrading to random chance with only moderate scaling, leaving substantial headroom for improvement.

ACCORD CSQA 0-5