Papers With Code 2 | ML Benchmarks, SotA Results & Code

Description

Spatial Gating Unit, or SGU, is a gating unit used in the gMLP architecture to captures spatial interactions. To enable cross-token interactions, it is necessary for the layer $s(\cdot)$ to contain a contraction operation over the spatial dimension. The layer $s(\cdot)$ is formulated as the output of linear gating:

s(Z)=Z \odot f\_{W, b}(Z)

where $\odot$ denotes element-wise multiplication. For training stability, the authors find it critical to initialize $W$ as near-zero values and $b$ as ones, meaning that $f\_{W, b}(Z) \approx 1$ and therefore $s(Z) \approx Z$ at the beginning of training. This initialization ensures each gMLP block behaves like a regular FFN at the early stage of training, where each token is processed independently, and only gradually injects spatial information across tokens during the course of learning.

The authors find it further effective to split $Z$ into two independent parts $\left(Z\_{1}, Z\_{2}\right)$ along the channel dimension for the gating function and for the multiplicative bypass:

s(Z)=Z\_{1} \odot f\_{W, b}\left(Z\_{2}\right)

They also normalize the input to $f\_{W, b}$ which empirically improved the stability of large NLP models.

Description

s(Z)=Z \odot f\_{W, b}(Z)

The authors find it further effective to split $Z$ into two independent parts $\left(Z\_{1}, Z\_{2}\right)$ along the channel dimension for the gating function and for the multiplicative bypass:

s(Z)=Z\_{1} \odot f\_{W, b}\left(Z\_{2}\right)

They also normalize the input to $f\_{W, b}$ which empirically improved the stability of large NLP models.

Spatial Gating Unit

Description

Papers Using This Method

Spatial Gating Unit

Description

Papers Using This Method