TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Efficient Language Modeling with Sparse all-MLP

Efficient Language Modeling with Sparse all-MLP

Ping Yu, Mikel Artetxe, Myle Ott, Sam Shleifer, Hongyu Gong, Ves Stoyanov, Xian Li

2022-03-14Question AnsweringSentence CompletionCommon Sense ReasoningAllZero-Shot LearningLanguage Modelling
PaperPDF

Abstract

All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input (token) dimensions. Such sparse all-MLPs significantly increase model capacity and expressiveness while keeping the compute constant. We address critical challenges in incorporating conditional computation with two routing strategies. The proposed sparse all-MLP improves language modeling perplexity and obtains up to 2$\times$ improvement in training efficiency compared to both Transformer-based MoEs (GShard, Switch Transformer, Base Layers and HASH Layers) as well as dense Transformers and all-MLPs. Finally, we evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.

Results

TaskDatasetMetricValueModel
Question AnsweringCOPAAccuracy79sMLP – deterministic 9.4B (0-shot)
Question AnsweringCOPAAccuracy76Gshard 9B
Question AnsweringCOPAAccuracy75Switch Transformer 9B
Question AnsweringCOPAAccuracy64HASH Layers 10B (0-shot)
Question AnsweringCOPAAccuracy63Base Layers 10B (0-shot)
Question AnsweringPIQAAccuracy73sMLP - deterministic 9.4B (0-shot)
Question AnsweringPIQAAccuracy68.1Gshard 9B
Question AnsweringPIQAAccuracy63.8Base Layers 10B (0-shot)
Question AnsweringPIQAAccuracy63.8HASH Layers 10B (0-shot)
Question AnsweringStoryClozeAccuracy74.7sMLP – deterministic 9.4B (0-shot)
Question AnsweringStoryClozeAccuracy73.3Switch Transformer 9B
Question AnsweringStoryClozeAccuracy67.9Gshard 9B
Question AnsweringStoryClozeAccuracy64.7HASH Layers 10B (0-shot)
Question AnsweringStoryClozeAccuracy61.4Base Layers 10B (0-shot)
Common Sense ReasoningWinoGrandeAccuracy54.3sMLP – deterministic 9.4B (0-shot)
Common Sense ReasoningWinoGrandeAccuracy53.4Switch Transformer 9B (0-shot)
Common Sense ReasoningWinoGrandeAccuracy51.7HASH Layers 10B (0-shot)
Common Sense ReasoningWinoGrandeAccuracy51.1Gshard 9B (0-shot)
Common Sense ReasoningWinoGrandeAccuracy51Base Layers 10B (0-shot)
Common Sense ReasoningReCoRDEM79.9Switch Transformer 9B
Common Sense ReasoningReCoRDEM73.4sMLP – deterministic 9.4B (0-shot)
Common Sense ReasoningReCoRDEM72.4Gshard 9B
Common Sense ReasoningReCoRDEM67.2HASH Layers 10B (0-shot)
Common Sense ReasoningReCoRDEM60.7Base Layers 10B (0-shot)
Sentence CompletionHellaSwagAccuracy54.5sMLP – deterministic 9.4B (0-shot)
Sentence CompletionHellaSwagAccuracy52.5Switch Transformer 9B
Sentence CompletionHellaSwagAccuracy38Gshard 9B
Sentence CompletionHellaSwagAccuracy33HASH Layers 10B (0-shot)
Sentence CompletionHellaSwagAccuracy30.2Base Layers 10B (0-shot)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17