TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Synthesizer: Rethinking Self-Attention in Transformer Models

Synthesizer: Rethinking Self-Attention in Transformer Models

Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

2020-05-02Machine TranslationText GenerationAbstractive Text SummarizationDialogue GenerationDocument SummarizationTranslationSemantic Textual SimilarityLinguistic AcceptabilityLanguage Modelling
PaperPDFCode

Abstract

The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/SuperGLUE benchmarks. When composed with dot product attention, we find that Synthesizers consistently outperform Transformers. Moreover, we conduct additional comparisons of Synthesizers against Dynamic Convolutions, showing that simple Random Synthesizer is not only $60\%$ faster but also improves perplexity by a relative $3.5\%$. Finally, we show that simple factorized Synthesizers can outperform Linformers on encoding only tasks.

Results

TaskDatasetMetricValueModel
DialoguePersona-ChatBLEU-114.7Synthesizer (R+V)
DialoguePersona-ChatCIDr19.09Synthesizer (R+V)
DialoguePersona-ChatMETEOR6.39Synthesizer (R+V)
DialoguePersona-ChatROUGE-L14.79Synthesizer (R+V)
Machine TranslationWMT2014 English-GermanBLEU score28.47Synthesizer (Random + Vanilla)
Machine TranslationWMT2014 English-FrenchBLEU score41.85Synthesizer (Random + Vanilla)
Text GenerationPersona-ChatBLEU-114.7Synthesizer (R+V)
Text GenerationPersona-ChatCIDr19.09Synthesizer (R+V)
Text GenerationPersona-ChatMETEOR6.39Synthesizer (R+V)
Text GenerationPersona-ChatROUGE-L14.79Synthesizer (R+V)
Semantic Textual SimilarityMRPC DevAccuracy91.2Synthesizer (R+V)
Text SummarizationCNN / Daily MailROUGE-138.57Synthesizer (R+V)
Text SummarizationCNN / Daily MailROUGE-216.24Synthesizer (R+V)
Text SummarizationCNN / Daily MailROUGE-L35.95Synthesizer (R+V)
Linguistic AcceptabilityCoLA DevAccuracy53.3Synthesizer (R+V)
ChatbotPersona-ChatBLEU-114.7Synthesizer (R+V)
ChatbotPersona-ChatCIDr19.09Synthesizer (R+V)
ChatbotPersona-ChatMETEOR6.39Synthesizer (R+V)
ChatbotPersona-ChatROUGE-L14.79Synthesizer (R+V)
Document SummarizationCNN / Daily MailROUGE-138.57Synthesizer (R+V)
Document SummarizationCNN / Daily MailROUGE-216.24Synthesizer (R+V)
Document SummarizationCNN / Daily MailROUGE-L35.95Synthesizer (R+V)
Dialogue GenerationPersona-ChatBLEU-114.7Synthesizer (R+V)
Dialogue GenerationPersona-ChatCIDr19.09Synthesizer (R+V)
Dialogue GenerationPersona-ChatMETEOR6.39Synthesizer (R+V)
Dialogue GenerationPersona-ChatROUGE-L14.79Synthesizer (R+V)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17Emotional Support with LLM-based Empathetic Dialogue Generation2025-07-17A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17