TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Weight Standardization

Weight Standardization

GeneralIntroduced 200012 papers
Source Paper

Description

Weight Standardization is a normalization technique that smooths the loss landscape by standardizing the weights in convolutional layers. Different from the previous normalization methods that focus on activations, WS considers the smoothing effects of weights more than just length-direction decoupling. Theoretically, WS reduces the Lipschitz constants of the loss and the gradients. Hence, WS smooths the loss landscape and improves training.

In Weight Standardization, instead of directly optimizing the loss L\mathcal{L}L on the original weights W^\hat{W}W^, we reparameterize the weights W^\hat{W}W^ as a function of WWW, i.e. W^=WS(W)\hat{W}=\text{WS}(W)W^=WS(W), and optimize the loss L\mathcal{L}L on WWW by SGD:

W^=[W^_i,j ∣ W^_i,j=W_i,j−μ_W_i,⋅σ_W_i,⋅+ϵ] \hat{W} = \Big[ \hat{W}\_{i,j}~\big|~ \hat{W}\_{i,j} = \dfrac{W\_{i,j} - \mu\_{W\_{i,\cdot}}}{\sigma\_{W\_{i,\cdot}+\epsilon}}\Big]W^=[W^_i,j ​ W^_i,j=σ_W_i,⋅+ϵW_i,j−μ_W_i,⋅​] y=W^∗x y = \hat{W}*xy=W^∗x

where

μW_i,⋅=1I∑_j=1IW_i,j,  σ_W_i,⋅=1I∑_i=1I(W_i,j−μ_W_i,⋅)2 \mu_{W\_{i,\cdot}} = \dfrac{1}{I}\sum\_{j=1}^{I}W\_{i, j},~~\sigma\_{W\_{i,\cdot}}=\sqrt{\dfrac{1}{I}\sum\_{i=1}^I(W\_{i,j} - \mu\_{W\_{i,\cdot}})^2}μW_i,⋅​=I1​∑_j=1IW_i,j,  σ_W_i,⋅=I1​∑_i=1I(W_i,j−μ_W_i,⋅)2​

Similar to Batch Normalization, WS controls the first and second moments of the weights of each output channel individually in convolutional layers. Note that many initialization methods also initialize the weights in some similar ways. Different from those methods, WS standardizes the weights in a differentiable way which aims to normalize gradients during back-propagation. Note that we do not have any affine transformation on W^\hat{W}W^. This is because we assume that normalization layers such as BN or GN will normalize this convolutional layer again.

Papers Using This Method

Addressing Data Heterogeneity in Federated Learning with Adaptive Normalization-Free Feature Recalibration2024-10-02Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks2024-07-24Exploring Loss Functions for Time-based Training Strategy in Spiking Neural Networks2023-09-21Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks2023-05-26Domain Adaptation and Active Learning for Fine-Grained Recognition in the Field of Biodiversity2021-10-22Effect of Pre-Training Scale on Intra- and Inter-Domain Full and Few-Shot Transfer Learning for Natural and Medical X-Ray Chest Images2021-05-31ReCU: Reviving the Dead Weights in Binary Neural Networks2021-03-23Characterizing signal propagation to close the performance gap in unnormalized ResNets2021-01-21Monocular Depth Estimation Using Laplacian Pyramid-Based Depth Residuals2021-01-08Gradient Centralization: A New Optimization Technique for Deep Neural Networks2020-04-03Big Transfer (BiT): General Visual Representation Learning2019-12-24Micro-Batch Training with Batch-Channel Normalization and Weight Standardization2019-03-25