GLU Variants Improve Transformer

Noam Shazeer

2020-02-12

Paper PDF Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code

Abstract

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.