Fixup Initialization: Residual Learning Without Normalization

Hongyi Zhang, Yann N. Dauphin, Tengyu Ma

2019-01-27ICLR 2019 5Machine Translation Image Classification Translation General Classification

Paper PDF Code Code Code Code Code Code Code Code Code Code

Abstract

Normalization layers are a staple in state-of-the-art deep neural network architectures. They are widely believed to stabilize training, enable higher learning rate, accelerate convergence and improve generalization, though the reason for their effectiveness is still an active research topic. In this work, we challenge the commonly-held beliefs by showing that none of the perceived benefits is unique to normalization. Specifically, we propose fixed-update initialization (Fixup), an initialization motivated by solving the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialization. We find training residual networks with Fixup to be as stable as training with normalization -- even for networks with 10,000 layers. Furthermore, with proper regularization, Fixup enables residual networks without normalization to achieve state-of-the-art performance in image classification and machine translation.

Results

Task	Dataset	Metric	Value	Model
Image Classification	CIFAR-10	Percentage correct	97.7	WRN + fixup init + mixup + cutout
Image Classification	SVHN	Percentage error	1.4	WRN + fixup init + mixup + cutout

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18 Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17 Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17 Federated Learning for Commercial Image Sources2025-07-17 MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17 A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17 Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15 Function-to-Style Guidance of LLMs for Code Translation2025-07-15