Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

GPT-2

Natural Language ProcessingIntroduced 2000768 papers

Description

GPT-2 is a Transformer architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous GPT architecture with some modifications:

Layer normalization is moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block.
A modified initialization which accounts for the accumulation on the residual path with model depth is used. Weights of residual layers are scaled at initialization by a factor of $1/\sqrt{N}$ where $N$ is the number of residual layers.
The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and a larger batch size of 512 is used.

Papers Using This Method

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements2025-06-27 M2BeamLLM: Multimodal Sensing-empowered mmWave Beam Prediction with Large Language Models2025-06-17 Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization2025-06-12 A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning2025-06-11 Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models2025-06-08 Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective2025-06-05 Rethinking the effects of data contamination in Code Intelligence2025-06-03 An Exploratory Framework for Future SETI Applications: Detecting Generative Reactivity via Language Models2025-06-03 How Neural Networks Organize Concepts: Introducing Concept Trajectory Analysis for Deep Learning Interpretability2025-06-01 Power-of-Two (PoT) Weights in Large Language Models (LLMs)2025-05-31 Matryoshka Model Learning for Improved Elastic Student Models2025-05-29 Privacy-Preserving Chest X-ray Report Generation via Multimodal Federated Learning with ViT and GPT-22025-05-27 Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents2025-05-26 Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language2025-05-26 ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining2025-05-26 Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models2025-05-24 The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm2025-05-22 AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training2025-05-22 Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders2025-05-20 Scaling Laws for State Dynamics in Large Language Models2025-05-20