Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han
Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Question Answering | DROP | Accuracy | 83 | PaLM 540B (Self Improvement, Self Consistency) |
| Question Answering | DROP | Accuracy | 78.2 | PaLM 540B (Self Consistency) |
| Question Answering | DROP | Accuracy | 76.2 | PaLM 540B (Self Improvement, CoT Prompting) |
| Question Answering | DROP | Accuracy | 71.7 | PaLM 540B (Self Improvement, Standard-Prompting) |
| Question Answering | DROP | Accuracy | 70.6 | PaLM 540B (CoT Prompting) |
| Question Answering | DROP | Accuracy | 60 | PaLM 540B (Standard-Prompting) |
| Question Answering | OpenBookQA | Accuracy | 94.4 | PaLM 540B (Self Improvement, Self Consistency) |
| Question Answering | OpenBookQA | Accuracy | 93 | PaLM 540B (Self Improvement, CoT Prompting) |
| Question Answering | OpenBookQA | Accuracy | 92 | PaLM 540B (Self Improvement, Standard-Prompting) |
| Question Answering | OpenBookQA | Accuracy | 90 | PaLM 540B (Self Consistency) |
| Question Answering | OpenBookQA | Accuracy | 86.4 | PaLM 540B (CoT Prompting) |
| Question Answering | OpenBookQA | Accuracy | 84.4 | PaLM 540B (Standard-Prompting) |
| Common Sense Reasoning | ARC (Challenge) | Accuracy | 89.8 | PaLM 540B (Self Improvement, Self Consistency) |
| Common Sense Reasoning | ARC (Challenge) | Accuracy | 88.7 | PaLM 540B (Self Consistency) |
| Common Sense Reasoning | ARC (Challenge) | Accuracy | 88.3 | PaLM 540B (Self Improvement, CoT Prompting) |
| Common Sense Reasoning | ARC (Challenge) | Accuracy | 87.2 | PaLM 540B (Self Improvement, Standard-Prompting) |
| Common Sense Reasoning | ARC (Challenge) | Accuracy | 87.1 | PaLM 540B (Standard-Prompting) |
| Common Sense Reasoning | ARC (Challenge) | Accuracy | 85.2 | PaLM 540B (CoT Prompting) |
| Natural Language Inference | ANLI test | A2 | 66.5 | PaLM 540B (Self Improvement, Self Consistency) |
| Natural Language Inference | ANLI test | A3 | 67.9 | PaLM 540B (Self Improvement, Self Consistency) |
| Natural Language Inference | ANLI test | A2 | 65.3 | PaLM 540B (Self Improvement, CoT Prompting) |
| Natural Language Inference | ANLI test | A3 | 67.3 | PaLM 540B (Self Improvement, CoT Prompting) |
| Natural Language Inference | ANLI test | A2 | 64.8 | PaLM 540B (Self Improvement, Standard-Prompting) |
| Natural Language Inference | ANLI test | A3 | 66.9 | PaLM 540B (Self Improvement, Standard-Prompting) |
| Natural Language Inference | ANLI test | A2 | 64.5 | PaLM 540B (Self Consistency) |
| Natural Language Inference | ANLI test | A3 | 63.4 | PaLM 540B (Self Consistency) |
| Natural Language Inference | ANLI test | A2 | 58.9 | PaLM 540B (CoT Prompting) |
| Natural Language Inference | ANLI test | A3 | 60.6 | PaLM 540B (CoT Prompting) |
| Natural Language Inference | ANLI test | A2 | 55.8 | PaLM 540B (Standard-Prompting) |
| Natural Language Inference | ANLI test | A3 | 55.8 | PaLM 540B (Standard-Prompting) |
| Arithmetic Reasoning | GSM8K | Accuracy | 82.1 | PaLM 540B (Self Improvement, Self Consistency) |
| Arithmetic Reasoning | GSM8K | Parameters (Billion) | 540 | PaLM 540B (Self Improvement, Self Consistency) |
| Arithmetic Reasoning | GSM8K | Accuracy | 74.4 | PaLM 540B (Self Consistency) |
| Arithmetic Reasoning | GSM8K | Parameters (Billion) | 540 | PaLM 540B (Self Consistency) |
| Arithmetic Reasoning | GSM8K | Accuracy | 73.5 | PaLM 540B (Self Improvement, CoT Prompting) |
| Arithmetic Reasoning | GSM8K | Parameters (Billion) | 540 | PaLM 540B (Self Improvement, CoT Prompting) |
| Arithmetic Reasoning | GSM8K | Accuracy | 56.5 | PaLM 540B (CoT Prompting) |
| Arithmetic Reasoning | GSM8K | Parameters (Billion) | 540 | PaLM 540B (CoT Prompting) |
| Arithmetic Reasoning | GSM8K | Accuracy | 32.2 | PaLM 540B (Self Improvement, Standard-Prompting) |
| Arithmetic Reasoning | GSM8K | Parameters (Billion) | 540 | PaLM 540B (Self Improvement, Standard-Prompting) |
| Arithmetic Reasoning | GSM8K | Accuracy | 17.9 | PaLM 540B (Standard-Prompting) |
| Arithmetic Reasoning | GSM8K | Parameters (Billion) | 540 | PaLM 540B (Standard-Prompting) |