TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Mask Attention Networks: Rethinking and Strengthen Transfo...

Mask Attention Networks: Rethinking and Strengthen Transformer

Zhihao Fan, Yeyun Gong, Dayiheng Liu, Zhongyu Wei, Siyuan Wang, Jian Jiao, Nan Duan, Ruofei Zhang, Xuanjing Huang

2021-03-25NAACL 2021 4Machine TranslationRepresentation LearningAbstractive Text SummarizationText SummarizationTranslation
PaperPDFCode

Abstract

Transformer is an attention-based neural network, which consists of two sublayers, namely, Self-Attention Network (SAN) and Feed-Forward Network (FFN). Existing research explores to enhance the two sublayers separately to improve the capability of Transformer for text representation. In this paper, we present a novel understanding of SAN and FFN as Mask Attention Networks (MANs) and show that they are two special cases of MANs with static mask matrices. However, their static mask matrices limit the capability for localness modeling in text representation learning. We therefore introduce a new layer named dynamic mask attention network (DMAN) with a learnable mask matrix which is able to model localness adaptively. To incorporate advantages of DMAN, SAN, and FFN, we propose a sequential layered structure to combine the three types of layers. Extensive experiments on various tasks, including neural machine translation and text summarization demonstrate that our model outperforms the original Transformer.

Results

TaskDatasetMetricValueModel
Machine TranslationIWSLT2014 German-EnglishBLEU score36.3Mask Attention Network (small)
Machine TranslationWMT2014 English-GermanBLEU score30.4Mask Attention Network (big)
Machine TranslationWMT2014 English-GermanBLEU score29.1Mask Attention Network (base)
Text SummarizationGigaWordROUGE-138.28Mask Attention Network
Text SummarizationGigaWordROUGE-219.46Mask Attention Network
Text SummarizationGigaWordROUGE-L35.46Mask Attention Network
Text SummarizationCNN / Daily MailROUGE-140.98Mask Attention Network
Text SummarizationCNN / Daily MailROUGE-218.29Mask Attention Network
Text SummarizationCNN / Daily MailROUGE-L37.88Mask Attention Network
Abstractive Text SummarizationCNN / Daily MailROUGE-140.98Mask Attention Network
Abstractive Text SummarizationCNN / Daily MailROUGE-218.29Mask Attention Network
Abstractive Text SummarizationCNN / Daily MailROUGE-L37.88Mask Attention Network

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction2025-07-15