TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/GLaM: Efficient Scaling of Language Models with Mixture-of...

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui

2021-12-13Question AnsweringCommon Sense ReasoningLanguage Modelling
PaperPDF

Abstract

Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.

Results

TaskDatasetMetricValueModel
Question AnsweringNatural QuestionsEM32.5GLaM 62B/64E (Few-Shot)
Question AnsweringNatural QuestionsEM26.3GLaM 62B/64E (One-Shot)
Question AnsweringNatural QuestionsEM24.7GLaM 62B/64E (Zero-Shot)
Question AnsweringWebQuestionsEM15.5GLaM 62B/64E (Zero-Shot)
Question AnsweringTriviaQAEM75.8GLaM 62B/64E (One-shot)
Question AnsweringTriviaQAEM75.8GLaM 62B/64E (Few-shot)
Question AnsweringTriviaQAEM71.3GLaM 62B/64E (Zero-shot)
Common Sense ReasoningARC (Challenge)Accuracy50.3GLaM 64B/64E (0 shot)
Common Sense ReasoningARC (Challenge)Accuracy48.2GLaM 64B/64E (1 shot)
Common Sense ReasoningARC (Easy)Accuracy74.8GLaM (64B/64E) (5-shot)
Common Sense ReasoningARC (Easy)Accuracy68GLaM 64B/64E (0-shot)
Language ModellingLAMBADAAccuracy80.9GLaM 62B/64E (One-Shot)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17