Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks

Haoyuan Wu, Haisheng Zheng, Zhuolun He, Bei Yu

2024-01-05Question Answering Math Word Problem Solving Multi-task Language Understanding Sentence Completion Common Sense Reasoning Arithmetic Reasoning Code Generation

Paper PDF Code(official)Code

Abstract

Large language models (LLMs) have demonstrated considerable proficiency in general natural language processing (NLP) tasks. Instruction tuning, a successful paradigm, enhances the ability of LLMs to follow natural language instructions and exhibit robust generalization across general tasks. However, these models often encounter performance limitations across multiple tasks due to constrained model capacity. Expanding this capacity during the instruction tuning phase poses significant challenges. To address this issue, we introduce parameter-efficient sparsity crafting (PESC), which crafts dense models into sparse models using the mixture-of-experts (MoE) architecture. PESC integrates adapters into the MoE layers of sparse models, differentiating experts without altering the individual weights within these layers. This method significantly reduces computational costs and GPU memory requirements, facilitating model capacity expansion through a minimal parameter increase when guaranteeing the quality of approximation in function space compared to original sparse upcycling. Our empirical evaluation demonstrates the effectiveness of the PESC method. Using PESC during instruction tuning, our best sparse model outperforms other sparse and dense models and exhibits superior general capabilities compared to GPT-3.5. Our code is available at https://github.com/wuhy68/Parameter-Efficient-MoE.

Results

Task	Dataset	Metric	Value	Model
Question Answering	PIQA	Accuracy	82.7	Camelidae-8×34B
Question Answering	MATH	Accuracy	29.9	Qwen2idae-16x14B (4-shot)
Question Answering	MATH	Accuracy	22.6	Camelidae-8×34B (4-shot)
Code Generation	MBPP	Accuracy	48.6	Qwen2idae-16x14B (4-shot)
Code Generation	MBPP	Accuracy	41.4	Camelidae-8×34B (4-shot)
Common Sense Reasoning	WinoGrande	Accuracy	80.9	Camelidae-8×34B
Common Sense Reasoning	ARC (Challenge)	Accuracy	65.2	Camelidae-8×34B
Common Sense Reasoning	ARC (Easy)	Accuracy	86.2	Camelidae-8×34B
Math Word Problem Solving	MATH	Accuracy	29.9	Qwen2idae-16x14B (4-shot)
Math Word Problem Solving	MATH	Accuracy	22.6	Camelidae-8×34B (4-shot)
Mathematical Question Answering	MATH	Accuracy	29.9	Qwen2idae-16x14B (4-shot)
Mathematical Question Answering	MATH	Accuracy	22.6	Camelidae-8×34B (4-shot)
Mathematical Reasoning	MATH	Accuracy	29.9	Qwen2idae-16x14B (4-shot)
Mathematical Reasoning	MATH	Accuracy	22.6	Camelidae-8×34B (4-shot)
Sentence Completion	HellaSwag	Accuracy	83.2	Camelidae-8×34B (10-shot)
Sentence Completion	HellaSwag	Accuracy	82.3	Qwen2idae-16x14B (10-shot)
Arithmetic Reasoning	GSM8K	Accuracy	78.3	Camelidae-8×34B (5-shot)
Arithmetic Reasoning	GSM8K	Accuracy	77.8	Qwen2idae-16x14B (5-shot)

Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks

Abstract

Results

Related Papers

Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks

Abstract

Results

Related Papers