TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Large Language Models Can Self-Improve

Large Language Models Can Self-Improve

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han

2022-10-20Question AnsweringNatural Language InferenceCommon Sense ReasoningGSM8KArithmetic Reasoning
PaperPDF

Abstract

Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.

Results

TaskDatasetMetricValueModel
Question AnsweringDROPAccuracy83PaLM 540B (Self Improvement, Self Consistency)
Question AnsweringDROPAccuracy78.2PaLM 540B (Self Consistency)
Question AnsweringDROPAccuracy76.2PaLM 540B (Self Improvement, CoT Prompting)
Question AnsweringDROPAccuracy71.7PaLM 540B (Self Improvement, Standard-Prompting)
Question AnsweringDROPAccuracy70.6PaLM 540B (CoT Prompting)
Question AnsweringDROPAccuracy60PaLM 540B (Standard-Prompting)
Question AnsweringOpenBookQAAccuracy94.4PaLM 540B (Self Improvement, Self Consistency)
Question AnsweringOpenBookQAAccuracy93PaLM 540B (Self Improvement, CoT Prompting)
Question AnsweringOpenBookQAAccuracy92PaLM 540B (Self Improvement, Standard-Prompting)
Question AnsweringOpenBookQAAccuracy90PaLM 540B (Self Consistency)
Question AnsweringOpenBookQAAccuracy86.4PaLM 540B (CoT Prompting)
Question AnsweringOpenBookQAAccuracy84.4PaLM 540B (Standard-Prompting)
Common Sense ReasoningARC (Challenge)Accuracy89.8PaLM 540B (Self Improvement, Self Consistency)
Common Sense ReasoningARC (Challenge)Accuracy88.7PaLM 540B (Self Consistency)
Common Sense ReasoningARC (Challenge)Accuracy88.3PaLM 540B (Self Improvement, CoT Prompting)
Common Sense ReasoningARC (Challenge)Accuracy87.2PaLM 540B (Self Improvement, Standard-Prompting)
Common Sense ReasoningARC (Challenge)Accuracy87.1PaLM 540B (Standard-Prompting)
Common Sense ReasoningARC (Challenge)Accuracy85.2PaLM 540B (CoT Prompting)
Natural Language InferenceANLI testA266.5PaLM 540B (Self Improvement, Self Consistency)
Natural Language InferenceANLI testA367.9PaLM 540B (Self Improvement, Self Consistency)
Natural Language InferenceANLI testA265.3PaLM 540B (Self Improvement, CoT Prompting)
Natural Language InferenceANLI testA367.3PaLM 540B (Self Improvement, CoT Prompting)
Natural Language InferenceANLI testA264.8PaLM 540B (Self Improvement, Standard-Prompting)
Natural Language InferenceANLI testA366.9PaLM 540B (Self Improvement, Standard-Prompting)
Natural Language InferenceANLI testA264.5PaLM 540B (Self Consistency)
Natural Language InferenceANLI testA363.4PaLM 540B (Self Consistency)
Natural Language InferenceANLI testA258.9PaLM 540B (CoT Prompting)
Natural Language InferenceANLI testA360.6PaLM 540B (CoT Prompting)
Natural Language InferenceANLI testA255.8PaLM 540B (Standard-Prompting)
Natural Language InferenceANLI testA355.8PaLM 540B (Standard-Prompting)
Arithmetic ReasoningGSM8KAccuracy82.1PaLM 540B (Self Improvement, Self Consistency)
Arithmetic ReasoningGSM8KParameters (Billion)540PaLM 540B (Self Improvement, Self Consistency)
Arithmetic ReasoningGSM8KAccuracy74.4PaLM 540B (Self Consistency)
Arithmetic ReasoningGSM8KParameters (Billion)540PaLM 540B (Self Consistency)
Arithmetic ReasoningGSM8KAccuracy73.5PaLM 540B (Self Improvement, CoT Prompting)
Arithmetic ReasoningGSM8KParameters (Billion)540PaLM 540B (Self Improvement, CoT Prompting)
Arithmetic ReasoningGSM8KAccuracy56.5PaLM 540B (CoT Prompting)
Arithmetic ReasoningGSM8KParameters (Billion)540PaLM 540B (CoT Prompting)
Arithmetic ReasoningGSM8KAccuracy32.2PaLM 540B (Self Improvement, Standard-Prompting)
Arithmetic ReasoningGSM8KParameters (Billion)540PaLM 540B (Self Improvement, Standard-Prompting)
Arithmetic ReasoningGSM8KAccuracy17.9PaLM 540B (Standard-Prompting)
Arithmetic ReasoningGSM8KParameters (Billion)540PaLM 540B (Standard-Prompting)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16