Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

Yujun Zhou, Jiayi Ye, Zipeng Ling, Yufei Han, Yue Huang, Haomin Zhuang, Zhenwen Liang, Kehan Guo, Taicheng Guo, Xiangqi Wang, Xiangliang Zhang

2025-06-05Logical Reasoning

Paper PDF Code(official)

Abstract

Logical reasoning is a core capability for many applications of large language models (LLMs), yet existing benchmarks often rely solely on final-answer accuracy, failing to capture the quality and structure of the reasoning process. We propose FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall benchmark accuracy, stepwise soundness, and representation-level alignment. In addition, to better understand how reasoning capabilities emerge, we conduct a comprehensive study on the effects of supervision format during fine-tuning. We construct four supervision styles (one natural language and three symbolic variants) and train LLMs under each. Our findings reveal that natural language supervision yields strong generalization even on out-of-distribution and long-context tasks, while symbolic reasoning styles promote more structurally sound and atomic inference chains. Further, our representation-level probing shows that fine-tuning primarily improves reasoning behaviors through step-by-step generation, rather than enhancing shortcut prediction or internalized correctness. Together, our framework and analysis provide a more rigorous and interpretable lens for evaluating and improving logical reasoning in LLMs.

Related Papers

FEVO: Financial Knowledge Expansion and Reasoning Evolution for Large Language Models2025-07-08 MiCo: Multi-image Contrast for Reinforcement Visual Reasoning2025-06-27 Discrete JEPA: Learning Discrete Token Representations without Reconstruction2025-06-17 SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models2025-06-15 CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making2025-06-15 TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving2025-06-12 Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation2025-06-12 TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games2025-06-11