TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/The CoT Collection: Improving Zero-shot and Few-shot Learn...

The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

Seungone Kim, Se June Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin, Minjoon Seo

2023-05-23Question AnsweringFew-Shot LearningCommon Sense Reasoning (Zero-Shot)Sentence CompletionCoreference ResolutionNatural Language InferenceCommon Sense ReasoningNatural Language Inference (Zero-Shot)Word Sense Disambiguation
PaperPDFCode(official)Code(official)

Abstract

Language models (LMs) with less than 100B parameters are known to perform poorly on chain-of-thought (CoT) reasoning in contrast to large LMs when solving unseen tasks. In this work, we aim to equip smaller LMs with the step-by-step reasoning capability by instruction tuning with CoT rationales. In order to achieve this goal, we first introduce a new instruction-tuning dataset called the CoT Collection, which augments the existing Flan Collection (including only 9 CoT tasks) with additional 1.84 million rationales across 1,060 tasks. We show that CoT fine-tuning Flan-T5 (3B & 11B) with CoT Collection enables smaller LMs to have better CoT capabilities on unseen tasks. On the BIG-Bench-Hard (BBH) benchmark, we report an average improvement of +4.34% (Flan-T5 3B) and +2.60% (Flan-T5 11B), in terms of zero-shot task accuracy. Furthermore, we show that instruction tuning with CoT Collection allows LMs to possess stronger few-shot learning capabilities on 4 domain-specific tasks, resulting in an improvement of +2.24% (Flan-T5 3B) and +2.37% (Flan-T5 11B), even outperforming ChatGPT utilizing demonstrations until the max length by a +13.98% margin. Our code, the CoT Collection data, and model checkpoints are publicly available.

Results

TaskDatasetMetricValueModel
Few-Shot LearningPubMedQAAccuracy73.42CoT-T5-11B (1024 Shot)
Few-Shot LearningCaseHOLDAccuracy68.3CoT-T5-11B (1024 Shot)
Few-Shot LearningMedNLIAccuracy78.02CoT-T5-11B (1024 Shot)
Question AnsweringCOPAAccuracy90.9T0-3B (CoT fine-tuned)
Question AnsweringPubMedQAAccuracy73.42CoT-T5-11B (1024 Shot)
Question AnsweringStoryClozeAccuracy94.5T0-3B (CoT fine-tuned)
Common Sense ReasoningWinoGrandeAccuracy57.5T0-3B (CoT fine-tuned)
Word Sense DisambiguationWords in ContextAccuracy56.7T0-3B (CoT fine-tuned)
Natural Language InferenceANLI testA141.7T0-3B (CoT fine-tuned)
Natural Language InferenceANLI testA237.2T0-3B (CoT fine-tuned)
Natural Language InferenceANLI testA341.9T0-3B (CoT fine-tuned)
Coreference ResolutionWinograd Schema ChallengeAccuracy66T0-3B (CoT fine-tuned)
Meta-LearningPubMedQAAccuracy73.42CoT-T5-11B (1024 Shot)
Meta-LearningCaseHOLDAccuracy68.3CoT-T5-11B (1024 Shot)
Meta-LearningMedNLIAccuracy78.02CoT-T5-11B (1024 Shot)
Sentence CompletionHellaSwagAccuracy41.1T0-3B (CoT fine-tuned)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16