TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Exp...

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, Xian Li

2024-03-12Question AnsweringMathMath Word Problem SolvingMulti-task Language UnderstandingCommon Sense ReasoningWorld KnowledgeArithmetic ReasoningCode Generation
PaperPDFCode

Abstract

We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best accuracy-efficiency tradeoff.

Results

TaskDatasetMetricValueModel
Transfer LearningMMLAverage (%)53.2Branch-Train-MiX 4x7B (sampling top-1 experts)
Question AnsweringTriviaQAEM57.1Branch-Train-MiX 4x7B (sampling top-2 experts)
Question AnsweringMATHAccuracy17.8Branch-Train-MiX 4x7B (sampling top-2 experts)
Code GenerationMBPPAccuracy42.6Branch-Train-Merge 4x7B (top-2)
Code GenerationMBPPAccuracy39.4Branch-Train-MiX 4x7B (sampling top-2 experts)
Common Sense ReasoningWinoGrandeAccuracy70.6Branch-Train-MiX 4x7B (sampling top-1 expert)
Math Word Problem SolvingMATHAccuracy17.8Branch-Train-MiX 4x7B (sampling top-2 experts)
Mathematical Question AnsweringMATHAccuracy17.8Branch-Train-MiX 4x7B (sampling top-2 experts)
Multi-Task LearningMMLAverage (%)53.2Branch-Train-MiX 4x7B (sampling top-1 experts)
Mathematical ReasoningMATHAccuracy17.8Branch-Train-MiX 4x7B (sampling top-2 experts)
Arithmetic ReasoningGSM8KAccuracy37.1Branch-Train-MiX 4x7B (sampling top-2 experts)

Related Papers

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17