Alleviating the Inequality of Attention Heads for Neural Machine Translation

Zewei Sun, Shu-Jian Huang, Xin-yu Dai, Jia-Jun Chen

2020-09-21COLING 2022 10Machine Translation Translation

Abstract

Recent studies show that the attention heads in Transformer are not equal. We relate this phenomenon to the imbalance training of multi-head attention and the model dependence on specific heads. To tackle this problem, we propose a simple masking method: HeadMask, in two specific ways. Experiments show that translation improvements are achieved on multiple language pairs. Subsequent empirical analyses also support our assumption and confirm the effectiveness of the method.

Results

Task	Dataset	Metric	Value	Model
Machine Translation	IWSLT2015 Vietnamese-English	BLEU	26.85	HeadMask (Random-18)
Machine Translation	IWSLT2015 Vietnamese-English	BLEU	26.36	HeadMask (Impt-18)
Machine Translation	WMT2016 Romanian-English	BLEU score	32.95	HeadMask (Impt-18)
Machine Translation	WMT2016 Romanian-English	BLEU score	32.85	HeadMask (Random-18)
Machine Translation	WMT2017 Turkish-English	BLEU score	17.56	HeadMask (Random-18)
Machine Translation	WMT2017 Turkish-English	BLEU score	17.48	HeadMask (Impt-18)

Related Papers

A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17 Function-to-Style Guidance of LLMs for Code Translation2025-07-15 Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation2025-07-09 Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings2025-07-09 Unconditional Diffusion for Generative Sequential Recommendation2025-07-08 GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation2025-07-04 TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation2025-07-01 CycleVAR: Repurposing Autoregressive Model for Unsupervised One-Step Image Translation2025-06-29