TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Mixtral of Experts

Mixtral of Experts

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed

2024-01-08Question AnsweringMath Word Problem SolvingMulti-task Language UnderstandingCommon Sense ReasoningCode GenerationLanguage Modelling
PaperPDFCodeCodeCodeCodeCodeCode

Abstract

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

Results

TaskDatasetMetricValueModel
Transfer LearningMMLAverage (%)70.6Mixtral 8x7B (5-shot)
Transfer LearningMMLAverage (%)62.5Mistral 7B (5-shot)
Question AnsweringPIQAAccuracy83.6Mixtral 8x7B (0-shot)
Question AnsweringPIQAAccuracy82.2Mistral 7B (0-shot)
Question AnsweringMATHAccuracy28.4Mixtral 8x7B (maj@4)
Question AnsweringMATHParameters (Billions)7Mistral 7B (maj@4)
Question AnsweringMATHAccuracy12.7Mistral 7B (maj@4)
Question AnsweringMATHParameters (Billions)7Mistral 7B (maj@4)
Code GenerationMBPPAccuracy60.7Mixtral 8x7B (3-shot)
Common Sense ReasoningWinoGrandeAccuracy77.2Mixtral 8x7B (0-shot)
Common Sense ReasoningWinoGrandeAccuracy74.2Mistral 7B (0-shot)
Common Sense ReasoningARC (Easy)Accuracy83.1Mixtral 8x7B (0-shot)
Common Sense ReasoningARC (Easy)Accuracy80.5Mistral 7B (0-shot)
Math Word Problem SolvingMATHAccuracy28.4Mixtral 8x7B (maj@4)
Math Word Problem SolvingMATHParameters (Billions)7Mistral 7B (maj@4)
Math Word Problem SolvingMATHAccuracy12.7Mistral 7B (maj@4)
Math Word Problem SolvingMATHParameters (Billions)7Mistral 7B (maj@4)
Mathematical Question AnsweringMATHAccuracy28.4Mixtral 8x7B (maj@4)
Mathematical Question AnsweringMATHParameters (Billions)7Mistral 7B (maj@4)
Mathematical Question AnsweringMATHAccuracy12.7Mistral 7B (maj@4)
Mathematical Question AnsweringMATHParameters (Billions)7Mistral 7B (maj@4)
Multi-Task LearningMMLAverage (%)70.6Mixtral 8x7B (5-shot)
Multi-Task LearningMMLAverage (%)62.5Mistral 7B (5-shot)
Mathematical ReasoningMATHAccuracy28.4Mixtral 8x7B (maj@4)
Mathematical ReasoningMATHParameters (Billions)7Mistral 7B (maj@4)
Mathematical ReasoningMATHAccuracy12.7Mistral 7B (maj@4)
Mathematical ReasoningMATHParameters (Billions)7Mistral 7B (maj@4)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17Towards Formal Verification of LLM-Generated Code from Natural Language Prompts2025-07-17