Apply Xavier initialization for all parameters excluding input embeddings. Use Gaussian initialization N(0,d−21) for input embeddings where d is the embedding dimension.
Scale v_d and w_d matrices in each decoder attention block, weight matrices in each decoder MLP block and input embeddings x and y in encoder and decoder by (9N)−41
Scale v_e and w_e matrices in each encoder attention block and weight matrices in each encoder MLP block by 0.67N−41