Papers With Code 2 | ML Benchmarks, SotA Results & Code

ArcBench is a logically challenging dataset of 158 English question–answer pairs, derived from the RoR-Bench benchmark. It targets deductive and multi-step reasoning in LLMs and multi-agent systems. The dataset was curated by translating original riddles into accessible English, removing multi-modal complexities, and validating each pair through automated reasoning workflows (e.g., Nexus Architect). ArcBench supports evaluation of reasoning performance, workflow refinement with feedback loops, comparative analysis of language models, and prompt-engineering research.