TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Filter Response Normalization

Filter Response Normalization

GeneralIntroduced 20002 papers
Source Paper

Description

Filter Response Normalization (FRN) is a type of normalization that combines normalization and an activation function, which can be used as a replacement for other normalizations and activations. It operates on each activation channel of each batch element independently, eliminating the dependency on other batch elements.

To demonstrate, assume we are dealing with the feed-forward convolutional neural network. We follow the usual convention that the filter responses (activation maps) produced after a convolution operation are a 4D tensor XXX with shape [B,W,H,C][B, W, H, C][B,W,H,C], where BBB is the mini-batch size, W,HW, HW,H are the spatial extents of the map, and CCC is the number of filters used in convolution. CCC is also referred to as output channels. Let x=Xb,:,:,c∈RNx = X_{b,:,:,c} \in \mathcal{R}^{N}x=Xb,:,:,c​∈RN, where N=W×HN = W \times HN=W×H, be the vector of filter responses for the cthc^{th}cth filter for the bthb^{th}bth batch point. Let ν2=∑_ixi2/N\nu^2 = \sum\_i x_i^2/Nν2=∑_ixi2​/N, be the mean squared norm of xxx.

Then Filter Response Normalization is defined as the following:

x^=xν2+ϵ,\hat{x} = \frac{x}{\sqrt{\nu^2 + \epsilon}},x^=ν2+ϵ​x​,

where ϵ\epsilonϵ is a small positive constant to prevent division by zero.

A lack of mean centering in FRN can lead to activations having an arbitrary bias away from zero. Such a bias in conjunction with ReLU can have a detrimental effect on learning and lead to poor performance and dead units. To address this the authors augment ReLU with a learned threshold τ\tauτ to yield:

z=max⁡(y,τ)z = \max(y, \tau)z=max(y,τ)

Since max⁡(y,τ)=max⁡(y−τ,0)+τ=ReLU(y−τ)+τ\max(y, \tau){=}\max(y-\tau,0){+}\tau{=}\text{ReLU}{(y{-}\tau)}{+}\taumax(y,τ)=max(y−τ,0)+τ=ReLU(y−τ)+τ, the effect of this activation is the same as having a shared bias before and after ReLU.

Papers Using This Method

Deceiving computers in Reverse Turing Test through Deep Learning2020-06-01Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks2019-11-21