TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Pythia: A Suite for Analyzing Large Language Models Across...

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, Oskar van der Wal

2023-04-03Question AnsweringCoreference ResolutionCommon Sense ReasoningMemorizationLanguage Modelling
PaperPDFCodeCode(official)Code(official)Code

Abstract

How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce \textit{Pythia}, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We intend \textit{Pythia} to facilitate research in many areas, and we present several case studies including novel results in memorization, term frequency effects on few-shot performance, and reducing gender bias. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics. Trained models, analysis code, training code, and training data can be found at \url{https://github.com/EleutherAI/pythia}.

Results

TaskDatasetMetricValueModel
Question AnsweringPIQAAccuracy76.7Pythia 12B (5-shot)
Question AnsweringPIQAAccuracy76Pythia 12B (0-shot)
Question AnsweringPIQAAccuracy75.2Pythia 6.9B (0-shot)
Question AnsweringPIQAAccuracy70.4Pythia 1B (5-shot)
Common Sense ReasoningWinoGrandeAccuracy66.6Pythia 12B (5-shot)
Common Sense ReasoningWinoGrandeAccuracy63.9Pythia 12B (0-shot)
Common Sense ReasoningWinoGrandeAccuracy60.9Pythia 6.9B (0-shot)
Common Sense ReasoningWinoGrandeAccuracy59.4Pythia 2.8B (0-shot)
Common Sense ReasoningARC (Challenge)Accuracy36.8Pythia 12B (5-shot)
Common Sense ReasoningARC (Challenge)Accuracy31.8Pythia 12B (0-shot)
Common Sense ReasoningARC (Easy)Accuracy71.5Pythia 12B (5-shot)
Common Sense ReasoningARC (Easy)Accuracy70.2Pythia 12B (0-shot)
Language ModellingLAMBADAAccuracy70.46Pythia 12B (0-shot)
Language ModellingLAMBADAAccuracy67.28Pythia 6.9B (0-shot)
Language ModellingLAMBADAPerplexity3.92Pythia 12B(Zero-Shot)
Language ModellingLAMBADAPerplexity4.45Pythia 6.9B(Zero-Shot)
Coreference ResolutionWinograd Schema ChallengeAccuracy54.8Pythia 12B (0-shot)
Coreference ResolutionWinograd Schema ChallengeAccuracy38.5Pythia 2.8B (0-shot)
Coreference ResolutionWinograd Schema ChallengeAccuracy36.5Pythia 6.9B (0-shot)
Coreference ResolutionWinograd Schema ChallengeAccuracy36.5Pythia 12B (5-shot)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17