Llama 3 Meets MoE: Efficient Upcycling

Aditya Vavre, Ethan He, Dennis Liu, Zijie Yan, June Yang, Nima Tajbakhsh, Ashwath Aithal

2024-12-13Multi-task Language Understanding MMLU

Abstract

Scaling large language models (LLMs) significantly improves performance but comes with prohibitive computational costs. Mixture-of-Experts (MoE) models offer an efficient alternative, increasing capacity without a proportional rise in compute requirements. However, training MoE models from scratch poses challenges like overfitting and routing instability. We present an efficient training recipe leveraging pre-trained dense checkpoints, training an 8-Expert Top-2 MoE model from Llama 3-8B with less than $1\%$ of typical pre-training compute. Our approach enhances downstream performance on academic benchmarks, achieving a $\textbf{2%}$ improvement in 0-shot accuracy on MMLU, while reaching a Model FLOPs Utilization (MFU) of $\textbf{46.8%}$ during training using our framework. We also integrate online upcycling in NeMo for seamless use of pre-trained weights, enabling cost-effective development of high-capacity MoE models.

Results

Task	Dataset	Metric	Value	Model
Transfer Learning	MML	Average (%)	86.6	Llama 3.1 (405B)
Transfer Learning	MML	Average (%)	86	Llama 3.1 (70B)
Multi-Task Learning	MML	Average (%)	86.6	Llama 3.1 (405B)
Multi-Task Learning	MML	Average (%)	86	Llama 3.1 (70B)

Related Papers

Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning2025-07-16 Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMs2025-07-15 Lizard: An Efficient Linearization Framework for Large Language Models2025-07-11 Integrating External Tools with Large Language Models to Improve Accuracy2025-07-09 The Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong Gains2025-07-08 Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate2025-07-08 Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations2025-07-07 Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training2025-07-07