Seungone Kim, Se June Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin, Minjoon Seo
Language models (LMs) with less than 100B parameters are known to perform poorly on chain-of-thought (CoT) reasoning in contrast to large LMs when solving unseen tasks. In this work, we aim to equip smaller LMs with the step-by-step reasoning capability by instruction tuning with CoT rationales. In order to achieve this goal, we first introduce a new instruction-tuning dataset called the CoT Collection, which augments the existing Flan Collection (including only 9 CoT tasks) with additional 1.84 million rationales across 1,060 tasks. We show that CoT fine-tuning Flan-T5 (3B & 11B) with CoT Collection enables smaller LMs to have better CoT capabilities on unseen tasks. On the BIG-Bench-Hard (BBH) benchmark, we report an average improvement of +4.34% (Flan-T5 3B) and +2.60% (Flan-T5 11B), in terms of zero-shot task accuracy. Furthermore, we show that instruction tuning with CoT Collection allows LMs to possess stronger few-shot learning capabilities on 4 domain-specific tasks, resulting in an improvement of +2.24% (Flan-T5 3B) and +2.37% (Flan-T5 11B), even outperforming ChatGPT utilizing demonstrations until the max length by a +13.98% margin. Our code, the CoT Collection data, and model checkpoints are publicly available.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Few-Shot Learning | PubMedQA | Accuracy | 73.42 | CoT-T5-11B (1024 Shot) |
| Few-Shot Learning | CaseHOLD | Accuracy | 68.3 | CoT-T5-11B (1024 Shot) |
| Few-Shot Learning | MedNLI | Accuracy | 78.02 | CoT-T5-11B (1024 Shot) |
| Question Answering | COPA | Accuracy | 90.9 | T0-3B (CoT fine-tuned) |
| Question Answering | PubMedQA | Accuracy | 73.42 | CoT-T5-11B (1024 Shot) |
| Question Answering | StoryCloze | Accuracy | 94.5 | T0-3B (CoT fine-tuned) |
| Common Sense Reasoning | WinoGrande | Accuracy | 57.5 | T0-3B (CoT fine-tuned) |
| Word Sense Disambiguation | Words in Context | Accuracy | 56.7 | T0-3B (CoT fine-tuned) |
| Natural Language Inference | ANLI test | A1 | 41.7 | T0-3B (CoT fine-tuned) |
| Natural Language Inference | ANLI test | A2 | 37.2 | T0-3B (CoT fine-tuned) |
| Natural Language Inference | ANLI test | A3 | 41.9 | T0-3B (CoT fine-tuned) |
| Coreference Resolution | Winograd Schema Challenge | Accuracy | 66 | T0-3B (CoT fine-tuned) |
| Meta-Learning | PubMedQA | Accuracy | 73.42 | CoT-T5-11B (1024 Shot) |
| Meta-Learning | CaseHOLD | Accuracy | 68.3 | CoT-T5-11B (1024 Shot) |
| Meta-Learning | MedNLI | Accuracy | 78.02 | CoT-T5-11B (1024 Shot) |
| Sentence Completion | HellaSwag | Accuracy | 41.1 | T0-3B (CoT fine-tuned) |