Description
Pythia is a suite of decoder-only autoregressive language models all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. The model architecture and hyperparameters largely follow GPT-3, with a few notable deviations based on recent advances in best practices for large scale language modeling.
Papers Using This Method
LexiMark: Robust Watermarking via Lexical Substitutions to Enhance Membership Verification of an LLM's Textual Training Data2025-06-17What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers2025-06-16Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs2025-05-28Pretraining Language Models to Ponder in Continuous Space2025-05-27Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning2025-05-16Memorization or Interpolation ? Detecting LLM Memorization through Input Perturbation Analysis2025-05-05An Empirical Study of the Role of Incompleteness and Ambiguity in Interactions with Large Language Models2025-03-23PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs2025-03-12I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?2025-03-12Interrogating LLM design under a fair learning doctrine2025-02-22Revisiting Privacy, Utility, and Efficiency Trade-offs when Fine-Tuning Large Language Models2025-02-18RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models2025-02-13MemHunter: Automated and Verifiable Memorization Detection at Dataset-scale in LLMs2024-12-10Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning2024-11-21Explaining and Improving Contrastive Decoding by Extrapolating the Probabilities of a Huge and Hypothetical LM2024-11-03Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA2024-10-28Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups2024-10-28Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model Training2024-10-20Tending Towards Stability: Convergence Challenges in Small Language Models2024-10-15Context-Parametric Inversion: Why Instruction Finetuning Can Worsen Context Reliance2024-10-14