Orca 2: Teaching Small Language Models How to Reason

Arindam Mitra, Luciano del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, Hamid Palangi, Guoqing Zheng, Corby Rosset, Hamed Khanpour, Ahmed Awadallah

2023-11-18Reading Comprehension Question Answering Mathematical Reasoning Multi-task Language Understanding Counterfactual Reasoning Imitation Learning Common Sense Reasoning Crass AI Arithmetic Reasoning

Paper PDF

Abstract

Orca 1 learns from rich signals, such as explanation traces, allowing it to outperform conventional instruction-tuned models on benchmarks like BigBench Hard and AGIEval. In Orca 2, we continue exploring how improved training signals can enhance smaller LMs' reasoning abilities. Research on training small LMs has often relied on imitation learning to replicate the output of more capable models. We contend that excessive emphasis on imitation may restrict the potential of smaller models. We seek to teach small LMs to employ different solution strategies for different tasks, potentially different from the one used by the larger model. For example, while larger models might provide a direct answer to a complex task, smaller models may not have the same capacity. In Orca 2, we teach the model various reasoning techniques (step-by-step, recall then generate, recall-reason-generate, direct answer, etc.). More crucially, we aim to help the model learn to determine the most effective solution strategy for each task. We evaluate Orca 2 using a comprehensive set of 15 diverse benchmarks (corresponding to approximately 100 tasks and over 36,000 unique prompts). Orca 2 significantly surpasses models of similar size and attains performance levels similar or better to those of models 5-10x larger, as assessed on complex tasks that test advanced reasoning abilities in zero-shot settings. make Orca 2 weights publicly available at aka.ms/orca-lm to support research on the development, evaluation, and alignment of smaller LMs

Results

Task	Dataset	Metric	Value	Model
Reading Comprehension	RACE	Accuracy	82.87	Orca 2-13B
Reading Comprehension	RACE	Accuracy	80.79	Orca 2-7B
Transfer Learning	BBH-nlp	Average (%)	50.18	Orca 2-13B
Transfer Learning	BBH-nlp	Average (%)	45.93	Orca 2-7B
Question Answering	DROP Test	F1	60.26	Orca 2-7B
Question Answering	DROP Test	F1	57.97	Orca 2-13B
Question Answering	AGI Eval	Accuracy	49.93	Orca 2-13B
Question Answering	AGI Eval	Accuracy	45.1	Orca 2-7B
Common Sense Reasoning	BIG-bench	Accuracy	86.86	Orca 2-13B
Common Sense Reasoning	BIG-bench	Accuracy	84.31	Orca 2-7B
Multi-Task Learning	BBH-nlp	Average (%)	50.18	Orca 2-13B
Multi-Task Learning	BBH-nlp	Average (%)	45.93	Orca 2-7B
Arithmetic Reasoning	GSM8K	Accuracy	59.14	Orca 2 13B
Arithmetic Reasoning	GSM8K	Parameters (Billion)	13	Orca 2 13B
Arithmetic Reasoning	GSM8K	Accuracy	47.23	Orca 2 7B
Arithmetic Reasoning	GSM8K	Parameters (Billion)	7	Orca 2 7B

Orca 2: Teaching Small Language Models How to Reason

Abstract

Results

Related Papers

Orca 2: Teaching Small Language Models How to Reason

Abstract

Results

Related Papers