Understanding and Improving Layer Normalization

Jingjing Xu, Xu sun, Zhiyuan Zhang, Guangxiang Zhao, Junyang Lin

2019-11-16NeurIPS 2019 12Machine Translation Translation

Abstract

Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, our main contribution is to take a step further in understanding LayerNorm. Many of previous studies believe that the success of LayerNorm comes from forward normalization. Unlike them, we find that the derivatives of the mean and variance are more important than forward normalization by re-centering and re-scaling backward gradients. Furthermore, we find that the parameters of LayerNorm, including the bias and gain, increase the risk of over-fitting and do not work in most cases. Experiments show that a simple version of LayerNorm (LayerNorm-simple) without the bias and gain outperforms LayerNorm on four datasets. It obtains the state-of-the-art performance on En-Vi machine translation. To address the over-fitting problem, we propose a new normalization method, Adaptive Normalization (AdaNorm), by replacing the bias and gain with a new transformation function. Experiments show that AdaNorm demonstrates better results than LayerNorm on seven out of eight datasets.

Results

Task	Dataset	Metric	Value	Model
Machine Translation	IWSLT2015 English-Vietnamese	BLEU	31.4	Transformer+LayerNorm-simple

Related Papers

A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17 Function-to-Style Guidance of LLMs for Code Translation2025-07-15 Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation2025-07-09 Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings2025-07-09 Unconditional Diffusion for Generative Sequential Recommendation2025-07-08 GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation2025-07-04 TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation2025-07-01 CycleVAR: Repurposing Autoregressive Model for Unsupervised One-Step Image Translation2025-06-29