TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/GPT-2

GPT-2

Natural Language ProcessingIntroduced 2000768 papers
Source Paper

Description

GPT-2 is a Transformer architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous GPT architecture with some modifications:

  • Layer normalization is moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block.

  • A modified initialization which accounts for the accumulation on the residual path with model depth is used. Weights of residual layers are scaled at initialization by a factor of 1/N1/\sqrt{N}1/N​ where NNN is the number of residual layers.

  • The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and a larger batch size of 512 is used.

Papers Using This Method

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements2025-06-27M2BeamLLM: Multimodal Sensing-empowered mmWave Beam Prediction with Large Language Models2025-06-17Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization2025-06-12A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning2025-06-11Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models2025-06-08Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective2025-06-05Rethinking the effects of data contamination in Code Intelligence2025-06-03An Exploratory Framework for Future SETI Applications: Detecting Generative Reactivity via Language Models2025-06-03How Neural Networks Organize Concepts: Introducing Concept Trajectory Analysis for Deep Learning Interpretability2025-06-01Power-of-Two (PoT) Weights in Large Language Models (LLMs)2025-05-31Matryoshka Model Learning for Improved Elastic Student Models2025-05-29Privacy-Preserving Chest X-ray Report Generation via Multimodal Federated Learning with ViT and GPT-22025-05-27Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents2025-05-26Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language2025-05-26ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining2025-05-26Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models2025-05-24The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm2025-05-22AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training2025-05-22Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders2025-05-20Scaling Laws for State Dynamics in Large Language Models2025-05-20