TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Online Normalization

Online Normalization

GeneralIntroduced 20003 papers
Source Paper

Description

Online Normalization is a normalization technique for training deep neural networks. To define Online Normalization. we replace arithmetic averages over the full dataset in with exponentially decaying averages of online samples. The decay factors α_f\alpha\_{f}α_f and α_b\alpha\_{b}α_b for forward and backward passes respectively are hyperparameters for the technique.

We allow incoming samples x_tx\_{t}x_t, such as images, to have multiple scalar components and denote feature-wide mean and variance by μ(x_t)\mu\left(x\_{t}\right)μ(x_t) and σ2(x_t)\sigma^{2}\left(x\_{t}\right)σ2(x_t). The algorithm also applies to outputs of fully connected layers with only one scalar output per feature. In fact, this case simplifies to μ(x_t)=x_t\mu\left(x\_{t}\right) = x\_{t}μ(x_t)=x_t and σ(x_t)=0\sigma\left(x\_{t}\right) = 0σ(x_t)=0. Denote scalars μ_t\mu\_{t}μ_t and σ_t\sigma\_{t}σ_t to denote running estimates of mean and variance across all samples. The subscript ttt denotes time steps corresponding to processing new incoming samples.

Online Normalization uses an ongoing process during the forward pass to estimate activation means and variances. It implements the standard online computation of mean and variance generalized to processing multi-value samples and exponential averaging of sample statistics. The resulting estimates directly lead to an affine normalization transform.

y_t=x_t−μ_t−1σ_t−1y\_{t} = \frac{x\_{t} - \mu\_{t-1}}{\sigma\_{t-1}}y_t=σ_t−1x_t−μ_t−1​

μ_t=α_fμ_t−1+(1−α_f)μ(x_t)\mu\_{t} = \alpha\_{f}\mu\_{t-1} + \left(1-\alpha\_{f}\right)\mu\left(x\_{t}\right)μ_t=α_fμ_t−1+(1−α_f)μ(x_t)

σ2_t=α_fσ2_t−1+(1−α_f)σ2(x_t)+α_f(1−α_f)(μ(x_t)−μ_t−1)2\sigma^{2}\_{t} = \alpha\_{f}\sigma^{2}\_{t-1} + \left(1-\alpha\_{f}\right)\sigma^{2}\left(x\_{t}\right) + \alpha\_{f}\left(1-\alpha\_{f}\right)\left(\mu\left(x\_{t}\right) - \mu\_{t-1}\right)^{2}σ2_t=α_fσ2_t−1+(1−α_f)σ2(x_t)+α_f(1−α_f)(μ(x_t)−μ_t−1)2

Papers Using This Method

One model to enhance them all: array geometry agnostic multi-channel personalized speech enhancement2021-10-20Pipelined Backpropagation at Scale: Training Large Models without Batches2020-03-25Online Normalization for Training Neural Networks2019-05-15