Mixtral of Experts

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed

2024-01-08Question Answering Math Word Problem Solving Multi-task Language Understanding Common Sense Reasoning Code Generation Language Modelling

Paper PDF Code Code Code Code Code Code

Abstract

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

Results

Task	Dataset	Metric	Value	Model
Transfer Learning	MML	Average (%)	70.6	Mixtral 8x7B (5-shot)
Transfer Learning	MML	Average (%)	62.5	Mistral 7B (5-shot)
Question Answering	PIQA	Accuracy	83.6	Mixtral 8x7B (0-shot)
Question Answering	PIQA	Accuracy	82.2	Mistral 7B (0-shot)
Question Answering	MATH	Accuracy	28.4	Mixtral 8x7B (maj@4)
Question Answering	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)
Question Answering	MATH	Accuracy	12.7	Mistral 7B (maj@4)
Question Answering	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)
Code Generation	MBPP	Accuracy	60.7	Mixtral 8x7B (3-shot)
Common Sense Reasoning	WinoGrande	Accuracy	77.2	Mixtral 8x7B (0-shot)
Common Sense Reasoning	WinoGrande	Accuracy	74.2	Mistral 7B (0-shot)
Common Sense Reasoning	ARC (Easy)	Accuracy	83.1	Mixtral 8x7B (0-shot)
Common Sense Reasoning	ARC (Easy)	Accuracy	80.5	Mistral 7B (0-shot)
Math Word Problem Solving	MATH	Accuracy	28.4	Mixtral 8x7B (maj@4)
Math Word Problem Solving	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)
Math Word Problem Solving	MATH	Accuracy	12.7	Mistral 7B (maj@4)
Math Word Problem Solving	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)
Mathematical Question Answering	MATH	Accuracy	28.4	Mixtral 8x7B (maj@4)
Mathematical Question Answering	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)
Mathematical Question Answering	MATH	Accuracy	12.7	Mistral 7B (maj@4)
Mathematical Question Answering	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)
Multi-Task Learning	MML	Average (%)	70.6	Mixtral 8x7B (5-shot)
Multi-Task Learning	MML	Average (%)	62.5	Mistral 7B (5-shot)
Mathematical Reasoning	MATH	Accuracy	28.4	Mixtral 8x7B (maj@4)
Mathematical Reasoning	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)
Mathematical Reasoning	MATH	Accuracy	12.7	Mistral 7B (maj@4)
Mathematical Reasoning	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)

Mixtral of Experts

Abstract

Results

Task	Dataset	Metric	Value	Model
Transfer Learning	MML	Average (%)	70.6	Mixtral 8x7B (5-shot)
Transfer Learning	MML	Average (%)	62.5	Mistral 7B (5-shot)
Question Answering	PIQA	Accuracy	83.6	Mixtral 8x7B (0-shot)
Question Answering	PIQA	Accuracy	82.2	Mistral 7B (0-shot)
Question Answering	MATH	Accuracy	28.4	Mixtral 8x7B (maj@4)
Question Answering	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)
Question Answering	MATH	Accuracy	12.7	Mistral 7B (maj@4)
Question Answering	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)
Code Generation	MBPP	Accuracy	60.7	Mixtral 8x7B (3-shot)
Common Sense Reasoning	WinoGrande	Accuracy	77.2	Mixtral 8x7B (0-shot)
Common Sense Reasoning	WinoGrande	Accuracy	74.2	Mistral 7B (0-shot)
Common Sense Reasoning	ARC (Easy)	Accuracy	83.1	Mixtral 8x7B (0-shot)
Common Sense Reasoning	ARC (Easy)	Accuracy	80.5	Mistral 7B (0-shot)
Math Word Problem Solving	MATH	Accuracy	28.4	Mixtral 8x7B (maj@4)
Math Word Problem Solving	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)
Math Word Problem Solving	MATH	Accuracy	12.7	Mistral 7B (maj@4)
Math Word Problem Solving	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)
Mathematical Question Answering	MATH	Accuracy	28.4	Mixtral 8x7B (maj@4)
Mathematical Question Answering	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)
Mathematical Question Answering	MATH	Accuracy	12.7	Mistral 7B (maj@4)
Mathematical Question Answering	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)
Multi-Task Learning	MML	Average (%)	70.6	Mixtral 8x7B (5-shot)
Multi-Task Learning	MML	Average (%)	62.5	Mistral 7B (5-shot)
Mathematical Reasoning	MATH	Accuracy	28.4	Mixtral 8x7B (maj@4)
Mathematical Reasoning	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)
Mathematical Reasoning	MATH	Accuracy	12.7	Mistral 7B (maj@4)
Mathematical Reasoning	MATH	Parameters (Billions)	7	Mistral 7B (maj@4)

Mixtral of Experts

Abstract

Results

Related Papers

Mixtral of Experts

Abstract

Results

Related Papers