Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, Alham Fikri Aji
Large language models (LLMs) with instruction fine-tuning demonstrate superior generative capabilities. However, these models are resource-intensive. To alleviate this issue, we explore distilling knowledge from instruction-tuned LLMs into much smaller ones. To this end, we carefully develop a large set of 2.58M instructions based on both existing and newly-generated instructions. In addition to being sizable, we design our instructions to cover a broad set of topics to ensure diversity. Extensive analysis of our instruction dataset confirms its diversity, and we generate responses for these instructions using gpt-3.5-turbo. Leveraging these instructions, we fine-tune a diverse herd of models, collectively referred to as LaMini-LM, which includes models from both the encoder-decoder and decoder-only families, with varying sizes. We evaluate the performance of our models using automatic metrics on 15 different natural language processing (NLP) benchmarks, as well as through human assessment. The results demonstrate that our proposed LaMini-LM models are comparable to competitive baselines, while being much smaller in size.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Question Answering | PIQA | Accuracy | 72.2 | FLAN-T5-Large 783M |
| Question Answering | PIQA | Accuracy | 71.3 | LaMini-GPT 1.5B |
| Question Answering | PIQA | Accuracy | 70.6 | LaMini-F-T5 783M |
| Question Answering | PIQA | Accuracy | 70.5 | GPT-2-XL 1.5B |
| Question Answering | PIQA | Accuracy | 67.2 | LaMini-T5 738M |
| Question Answering | PIQA | Accuracy | 55.9 | T5-Large 738M |
| Question Answering | OpenBookQA | Accuracy | 39.8 | LaMini-GPT 1.5B |
| Question Answering | OpenBookQA | Accuracy | 36 | LaMini-T5 738M |
| Question Answering | OpenBookQA | Accuracy | 34 | LaMini-F-T5 783M |
| Question Answering | OpenBookQA | Accuracy | 32.8 | T5-Large 738M |
| Question Answering | OpenBookQA | Accuracy | 32 | GPT-2-XL 1.5B |
| Question Answering | OpenBookQA | Accuracy | 31.2 | FLAN-T5-Large 783M |
| Common Sense Reasoning | WinoGrande | Accuracy | 59.9 | FLAN-T5-Large 783M |
| Common Sense Reasoning | WinoGrande | Accuracy | 58.3 | GPT-2-XL 1.5B |
| Common Sense Reasoning | WinoGrande | Accuracy | 56 | LaMini-F-T5 783M |
| Common Sense Reasoning | WinoGrande | Accuracy | 56 | LaMini-GPT 1.5B |
| Common Sense Reasoning | WinoGrande | Accuracy | 55.2 | T5-Large 738M |
| Common Sense Reasoning | WinoGrande | Accuracy | 54.9 | LaMini-T5 738M |
| Word Sense Disambiguation | Words in Context | Accuracy | 64.7 | FLAN-T5-Large 783M |
| Word Sense Disambiguation | Words in Context | Accuracy | 63.8 | LaMini-F-T5 783M |
| Word Sense Disambiguation | Words in Context | Accuracy | 52.4 | LaMini-GPT 1.5B |
| Word Sense Disambiguation | Words in Context | Accuracy | 50.5 | LaMini-T5 738M |
| Word Sense Disambiguation | Words in Context | Accuracy | 49.8 | GPT-2-XL 1.5B |
| Natural Language Inference | MultiNLI | Matched | 72.4 | T5-Large 738M |
| Natural Language Inference | MultiNLI | Mismatched | 72 | T5-Large 738M |
| Natural Language Inference | MultiNLI | Matched | 67.5 | LaMini-GPT 1.5B |
| Natural Language Inference | MultiNLI | Mismatched | 69.3 | LaMini-GPT 1.5B |
| Natural Language Inference | MultiNLI | Matched | 61.4 | LaMini-F-T5 783M |
| Natural Language Inference | MultiNLI | Mismatched | 61 | LaMini-F-T5 783M |
| Natural Language Inference | MultiNLI | Matched | 54.7 | LaMini-T5 738M |
| Natural Language Inference | MultiNLI | Mismatched | 55.8 | LaMini-T5 738M |
| Natural Language Inference | MultiNLI | Matched | 36.5 | GPT-2-XL 1.5B |
| Natural Language Inference | MultiNLI | Mismatched | 37 | GPT-2-XL 1.5B |
| Coreference Resolution | Winograd Schema Challenge | Accuracy | 73.3 | GPT-2-XL 1.5B |
| Coreference Resolution | Winograd Schema Challenge | Accuracy | 69.6 | LaMini-GPT 1.5B |
| Coreference Resolution | Winograd Schema Challenge | Accuracy | 66.7 | T5-Large 738M |
| Coreference Resolution | Winograd Schema Challenge | Accuracy | 64.1 | LaMini-F-T5 783M |
| Coreference Resolution | Winograd Schema Challenge | Accuracy | 59 | LaMini-T5 738M |
| Sentence Completion | HellaSwag | Accuracy | 50.9 | GPT-2-XL 1.5B |
| Sentence Completion | HellaSwag | Accuracy | 48.7 | FLAN-T5-Large 783M |
| Sentence Completion | HellaSwag | Accuracy | 48.3 | LaMini-GPT 1.5B |
| Sentence Completion | HellaSwag | Accuracy | 43.7 | LaMini-F-T5 783M |
| Sentence Completion | HellaSwag | Accuracy | 40.6 | LaMini-T5 738M |
| Sentence Completion | HellaSwag | Accuracy | 38.9 | T5-Large 738M |