Papers With Code 2 | ML Benchmarks, SotA Results & Code

Description

T-Fixup is an initialization method for Transformers that aims to remove the need for layer normalization and warmup. The initialization procedure is as follows:

Apply Xavier initialization for all parameters excluding input embeddings. Use Gaussian initialization $\mathcal{N}\left(0, d^{-\frac{1}{2}}\right)$ for input embeddings where $d$ is the embedding dimension.
Scale $\mathbf{v}\_{d}$ and $\mathbf{w}\_{d}$ matrices in each decoder attention block, weight matrices in each decoder MLP block and input embeddings $\mathbf{x}$ and $\mathbf{y}$ in encoder and decoder by $(9 N)^{-\frac{1}{4}}$
Scale $\mathbf{v}\_{e}$ and $\mathbf{w}\_{e}$ matrices in each encoder attention block and weight matrices in each encoder MLP block by $0.67 N^{-\frac{1}{4}}$

Description

T-Fixup is an initialization method for Transformers that aims to remove the need for layer normalization and warmup. The initialization procedure is as follows:

Apply Xavier initialization for all parameters excluding input embeddings. Use Gaussian initialization $\mathcal{N}\left(0, d^{-\frac{1}{2}}\right)$ for input embeddings where $d$ is the embedding dimension.
Scale $\mathbf{v}\_{d}$ and $\mathbf{w}\_{d}$ matrices in each decoder attention block, weight matrices in each decoder MLP block and input embeddings $\mathbf{x}$ and $\mathbf{y}$ in encoder and decoder by $(9 N)^{-\frac{1}{4}}$
Scale $\mathbf{v}\_{e}$ and $\mathbf{w}\_{e}$ matrices in each encoder attention block and weight matrices in each encoder MLP block by $0.67 N^{-\frac{1}{4}}$

T-Fixup