Synthesizer: Rethinking Self-Attention in Transformer Models

Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

2020-05-02Machine Translation Text Generation Abstractive Text Summarization Dialogue Generation Document Summarization Translation Semantic Textual Similarity Linguistic Acceptability Language Modelling

Paper PDF Code

Abstract

The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/SuperGLUE benchmarks. When composed with dot product attention, we find that Synthesizers consistently outperform Transformers. Moreover, we conduct additional comparisons of Synthesizers against Dynamic Convolutions, showing that simple Random Synthesizer is not only $60\%$ faster but also improves perplexity by a relative $3.5\%$. Finally, we show that simple factorized Synthesizers can outperform Linformers on encoding only tasks.

Results

Task	Dataset	Metric	Value	Model
Dialogue	Persona-Chat	BLEU-1	14.7	Synthesizer (R+V)
Dialogue	Persona-Chat	CIDr	19.09	Synthesizer (R+V)
Dialogue	Persona-Chat	METEOR	6.39	Synthesizer (R+V)
Dialogue	Persona-Chat	ROUGE-L	14.79	Synthesizer (R+V)
Machine Translation	WMT2014 English-German	BLEU score	28.47	Synthesizer (Random + Vanilla)
Machine Translation	WMT2014 English-French	BLEU score	41.85	Synthesizer (Random + Vanilla)
Text Generation	Persona-Chat	BLEU-1	14.7	Synthesizer (R+V)
Text Generation	Persona-Chat	CIDr	19.09	Synthesizer (R+V)
Text Generation	Persona-Chat	METEOR	6.39	Synthesizer (R+V)
Text Generation	Persona-Chat	ROUGE-L	14.79	Synthesizer (R+V)
Semantic Textual Similarity	MRPC Dev	Accuracy	91.2	Synthesizer (R+V)
Text Summarization	CNN / Daily Mail	ROUGE-1	38.57	Synthesizer (R+V)
Text Summarization	CNN / Daily Mail	ROUGE-2	16.24	Synthesizer (R+V)
Text Summarization	CNN / Daily Mail	ROUGE-L	35.95	Synthesizer (R+V)
Linguistic Acceptability	CoLA Dev	Accuracy	53.3	Synthesizer (R+V)
Chatbot	Persona-Chat	BLEU-1	14.7	Synthesizer (R+V)
Chatbot	Persona-Chat	CIDr	19.09	Synthesizer (R+V)
Chatbot	Persona-Chat	METEOR	6.39	Synthesizer (R+V)
Chatbot	Persona-Chat	ROUGE-L	14.79	Synthesizer (R+V)
Document Summarization	CNN / Daily Mail	ROUGE-1	38.57	Synthesizer (R+V)
Document Summarization	CNN / Daily Mail	ROUGE-2	16.24	Synthesizer (R+V)
Document Summarization	CNN / Daily Mail	ROUGE-L	35.95	Synthesizer (R+V)
Dialogue Generation	Persona-Chat	BLEU-1	14.7	Synthesizer (R+V)
Dialogue Generation	Persona-Chat	CIDr	19.09	Synthesizer (R+V)
Dialogue Generation	Persona-Chat	METEOR	6.39	Synthesizer (R+V)
Dialogue Generation	Persona-Chat	ROUGE-L	14.79	Synthesizer (R+V)

Synthesizer: Rethinking Self-Attention in Transformer Models

Abstract

Results

Related Papers

Synthesizer: Rethinking Self-Attention in Transformer Models

Abstract

Results

Related Papers