Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, Xian Li

2024-03-12Question Answering Math Math Word Problem Solving Multi-task Language Understanding Common Sense Reasoning World Knowledge Arithmetic Reasoning Code Generation

Paper PDF Code

Abstract

We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best accuracy-efficiency tradeoff.

Results

Task	Dataset	Metric	Value	Model
Transfer Learning	MML	Average (%)	53.2	Branch-Train-MiX 4x7B (sampling top-1 experts)
Question Answering	TriviaQA	EM	57.1	Branch-Train-MiX 4x7B (sampling top-2 experts)
Question Answering	MATH	Accuracy	17.8	Branch-Train-MiX 4x7B (sampling top-2 experts)
Code Generation	MBPP	Accuracy	42.6	Branch-Train-Merge 4x7B (top-2)
Code Generation	MBPP	Accuracy	39.4	Branch-Train-MiX 4x7B (sampling top-2 experts)
Common Sense Reasoning	WinoGrande	Accuracy	70.6	Branch-Train-MiX 4x7B (sampling top-1 expert)
Math Word Problem Solving	MATH	Accuracy	17.8	Branch-Train-MiX 4x7B (sampling top-2 experts)
Mathematical Question Answering	MATH	Accuracy	17.8	Branch-Train-MiX 4x7B (sampling top-2 experts)
Multi-Task Learning	MML	Average (%)	53.2	Branch-Train-MiX 4x7B (sampling top-1 experts)
Mathematical Reasoning	MATH	Accuracy	17.8	Branch-Train-MiX 4x7B (sampling top-2 experts)
Arithmetic Reasoning	GSM8K	Accuracy	37.1	Branch-Train-MiX 4x7B (sampling top-2 experts)

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Abstract

Results

Related Papers

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Abstract

Results

Related Papers