TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Llama 3 Meets MoE: Efficient Upcycling

Llama 3 Meets MoE: Efficient Upcycling

Aditya Vavre, Ethan He, Dennis Liu, Zijie Yan, June Yang, Nima Tajbakhsh, Ashwath Aithal

2024-12-13Multi-task Language UnderstandingMMLU
PaperPDFCode(official)

Abstract

Scaling large language models (LLMs) significantly improves performance but comes with prohibitive computational costs. Mixture-of-Experts (MoE) models offer an efficient alternative, increasing capacity without a proportional rise in compute requirements. However, training MoE models from scratch poses challenges like overfitting and routing instability. We present an efficient training recipe leveraging pre-trained dense checkpoints, training an 8-Expert Top-2 MoE model from Llama 3-8B with less than $1\%$ of typical pre-training compute. Our approach enhances downstream performance on academic benchmarks, achieving a $\textbf{2%}$ improvement in 0-shot accuracy on MMLU, while reaching a Model FLOPs Utilization (MFU) of $\textbf{46.8%}$ during training using our framework. We also integrate online upcycling in NeMo for seamless use of pre-trained weights, enabling cost-effective development of high-capacity MoE models.

Results

TaskDatasetMetricValueModel
Transfer LearningMMLAverage (%)86.6Llama 3.1 (405B)
Transfer LearningMMLAverage (%)86Llama 3.1 (70B)
Multi-Task LearningMMLAverage (%)86.6Llama 3.1 (405B)
Multi-Task LearningMMLAverage (%)86Llama 3.1 (70B)

Related Papers

Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning2025-07-16Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMs2025-07-15Lizard: An Efficient Linearization Framework for Large Language Models2025-07-11Integrating External Tools with Large Language Models to Improve Accuracy2025-07-09The Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong Gains2025-07-08Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate2025-07-08Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations2025-07-07Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training2025-07-07