TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Layer Normalization

Layer Normalization

GeneralIntroduced 200024985 papers
Source Paper

Description

Unlike batch normalization, Layer Normalization directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases. It works well for RNNs and improves both the training time and the generalization performance of several existing RNN models. More recently, it has been used with Transformer models.

We compute the layer normalization statistics over all the hidden units in the same layer as follows:

μl=1H∑H_i=1a_il\mu^{l} = \frac{1}{H}\sum^{H}\_{i=1}a\_{i}^{l}μl=H1​∑H_i=1a_il

σl=1H∑H_i=1(a_il−μl)2\sigma^{l} = \sqrt{\frac{1}{H}\sum^{H}\_{i=1}\left(a\_{i}^{l}-\mu^{l}\right)^{2}} σl=H1​∑H_i=1(a_il−μl)2​

where HHH denotes the number of hidden units in a layer. Under layer normalization, all the hidden units in a layer share the same normalization terms μ\muμ and σ\sigmaσ, but different training cases have different normalization terms. Unlike batch normalization, layer normalization does not impose any constraint on the size of the mini-batch and it can be used in the pure online regime with batch size 1.

Papers Using This Method

Making Language Model a Hierarchical Classifier and Generator2025-07-17DASViT: Differentiable Architecture Search for Vision Transformer2025-07-17Best Practices for Large-Scale, Pixel-Wise Crop Mapping and Transfer Learning Workflows2025-07-16DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Langevin Flows for Modeling Neural Latent Dynamics2025-07-15Generative Click-through Rate Prediction with Applications to Search Advertising2025-07-15Biological Processing Units: Leveraging an Insect Connectome to Pioneer Biofidelic Neural Architectures2025-07-15KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding2025-07-15Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI2025-07-13Learning from Synthetic Labs: Language Models as Auction Participants2025-07-12Comparative Analysis of Vision Transformers and Traditional Deep Learning Approaches for Automated Pneumonia Detection in Chest X-Rays2025-07-11Chat-Ghosting: A Comparative Study of Methods for Auto-Completion in Dialog Systems2025-07-08SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression2025-07-08Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving2025-07-08Geo-Registration of Terrestrial LiDAR Point Clouds with Satellite Images without GNSS2025-07-08Tile-Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification2025-07-08A Wireless Foundation Model for Multi-Task Prediction2025-07-08Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate2025-07-08SV-DRR: High-Fidelity Novel View X-Ray Synthesis Using Diffusion Model2025-07-07