Zewei Sun, Shu-Jian Huang, Xin-yu Dai, Jia-Jun Chen
Recent studies show that the attention heads in Transformer are not equal. We relate this phenomenon to the imbalance training of multi-head attention and the model dependence on specific heads. To tackle this problem, we propose a simple masking method: HeadMask, in two specific ways. Experiments show that translation improvements are achieved on multiple language pairs. Subsequent empirical analyses also support our assumption and confirm the effectiveness of the method.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Machine Translation | IWSLT2015 Vietnamese-English | BLEU | 26.85 | HeadMask (Random-18) |
| Machine Translation | IWSLT2015 Vietnamese-English | BLEU | 26.36 | HeadMask (Impt-18) |
| Machine Translation | WMT2016 Romanian-English | BLEU score | 32.95 | HeadMask (Impt-18) |
| Machine Translation | WMT2016 Romanian-English | BLEU score | 32.85 | HeadMask (Random-18) |
| Machine Translation | WMT2017 Turkish-English | BLEU score | 17.56 | HeadMask (Random-18) |
| Machine Translation | WMT2017 Turkish-English | BLEU score | 17.48 | HeadMask (Impt-18) |