Reducing Transformer Depth on Demand with Structured Dropout

Angela Fan, Edouard Grave, Armand Joulin

2019-09-25ICLR 2020 1Machine Translation Question Answering Translation Open-Domain Question Answering Language Modelling

Paper PDF Code Code Code Code Code

Abstract

Overparameterized transformer networks have obtained state of the art results in various natural language processing tasks, such as machine translation, language modeling, and question answering. These models contain hundreds of millions of parameters, necessitating a large amount of computation and making them prone to overfitting. In this work, we explore LayerDrop, a form of structured dropout, which has a regularization effect during training and allows for efficient pruning at inference time. In particular, we show that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance. We demonstrate the effectiveness of our approach by improving the state of the art on machine translation, language modeling, summarization, question answering, and language understanding benchmarks. Moreover, we show that our approach leads to small BERT-like models of higher quality compared to training from scratch or using distillation.

Results

Task	Dataset	Metric	Value	Model
Question Answering	ELI5	Rouge-1	29.4	Transformer Multitask + LayerDrop
Question Answering	ELI5	Rouge-2	5.5	Transformer Multitask + LayerDrop
Question Answering	ELI5	Rouge-L	23.4	Transformer Multitask + LayerDrop
Open-Domain Question Answering	ELI5	Rouge-1	29.4	Transformer Multitask + LayerDrop
Open-Domain Question Answering	ELI5	Rouge-2	5.5	Transformer Multitask + LayerDrop
Open-Domain Question Answering	ELI5	Rouge-L	23.4	Transformer Multitask + LayerDrop

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17 Making Language Model a Hierarchical Classifier and Generator2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17