TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ST-MoE: Designing Stable and Transferable Sparse Expert Mo...

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus

2022-02-17WinograndeQuestion AnsweringCoreference ResolutionNatural Language InferenceCommon Sense ReasoningNatural QuestionsTransfer LearningWord Sense Disambiguation
PaperPDFCodeCodeCode(official)

Abstract

Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).

Results

TaskDatasetMetricValueModel
Question AnsweringCOPAAccuracy99.2ST-MoE-32B 269B (fine-tuned)
Question AnsweringCOPAAccuracy91ST-MoE-L 4.1B (fine-tuned)
Question AnsweringMultiRCF189.6ST-MoE-32B 269B (fine-tuned)
Question AnsweringMultiRCF186ST-MoE-L 4.1B (fine-tuned)
Question AnsweringBoolQAccuracy92.4ST-MoE-32B 269B (fine-tuned)
Question AnsweringBoolQAccuracy88.6ST-MoE-L 4.1B (fine-tuned)
Common Sense ReasoningWinoGrandeAccuracy96.1ST-MoE-32B 269B (fine-tuned)
Common Sense ReasoningWinoGrandeAccuracy81.7ST-MoE-L 4.1B (fine-tuned)
Common Sense ReasoningARC (Challenge)Accuracy86.5ST-MoE-32B 269B (fine-tuned)
Common Sense ReasoningARC (Challenge)Accuracy56.9ST-MoE-L 4.1B (fine-tuned)
Common Sense ReasoningARC (Easy)Accuracy95.2ST-MoE-32B 269B (fine-tuned)
Common Sense ReasoningARC (Easy)Accuracy75.4ST-MoE-L 4.1B (fine-tuned)
Common Sense ReasoningReCoRDEM95.1ST-MoE-32B 269B (fine-tuned)
Common Sense ReasoningReCoRDEM88.9ST-MoE-L 4.1B (fine-tuned)
Word Sense DisambiguationWords in ContextAccuracy77.7ST-MoE-32B 269B (fine-tuned)
Word Sense DisambiguationWords in ContextAccuracy74ST-MoE-L 4.1B (fine-tuned)
Natural Language InferenceCommitmentBankAccuracy98.2ST-MoE-L 4.1B (fine-tuned)
Natural Language InferenceCommitmentBankAccuracy98ST-MoE-32B 269B (fine-tuned)
Coreference ResolutionWinograd Schema ChallengeAccuracy96.6ST-MoE-32B 269B (fine-tuned)
Coreference ResolutionWinograd Schema ChallengeAccuracy93.3ST-MoE-L 4.1B (fine-tuned)

Related Papers

RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16