TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/T-Fixup

T-Fixup

GeneralIntroduced 20002 papers
Source Paper

Description

T-Fixup is an initialization method for Transformers that aims to remove the need for layer normalization and warmup. The initialization procedure is as follows:

  • Apply Xavier initialization for all parameters excluding input embeddings. Use Gaussian initialization N(0,d−12)\mathcal{N}\left(0, d^{-\frac{1}{2}}\right)N(0,d−21​) for input embeddings where ddd is the embedding dimension.
  • Scale v_d\mathbf{v}\_{d}v_d and w_d\mathbf{w}\_{d}w_d matrices in each decoder attention block, weight matrices in each decoder MLP block and input embeddings x\mathbf{x}x and y\mathbf{y}y in encoder and decoder by (9N)−14(9 N)^{-\frac{1}{4}}(9N)−41​
  • Scale v_e\mathbf{v}\_{e}v_e and w_e\mathbf{w}\_{e}w_e matrices in each encoder attention block and weight matrices in each encoder MLP block by 0.67N−140.67 N^{-\frac{1}{4}}0.67N−41​

Papers Using This Method

Optimizing Deeper Transformers on Small Datasets2020-12-30Improving Transformer Optimization Through Better Initialization2020-01-01