TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Attention Is All You Need

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

2017-06-12NeurIPS 2017 12Machine TranslationQuestion AnsweringMultimodal Machine TranslationAbstractive Text SummarizationText SummarizationCoreference ResolutionNatural Language UnderstandingTranslationFew-Shot 3D Point Cloud ClassificationSpeech Emotion RecognitionSupervised Only 3D Point Cloud ClassificationAllLIDAR Semantic SegmentationImage-guided Story Ending GenerationLink Prediction
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Results

TaskDatasetMetricValueModel
Machine TranslationIWSLT2015 English-GermanBLEU score28.5Transformer
Machine TranslationIWSLT2014 German-EnglishBLEU score34.44Transformer
Machine TranslationWMT2014 English-GermanBLEU score28.4Transformer Big
Machine TranslationWMT2014 English-GermanBLEU score27.3Transformer Base
Machine TranslationWMT2014 English-FrenchBLEU score41Transformer Big
Machine TranslationWMT2014 English-FrenchBLEU score38.1Transformer Base
Machine TranslationMulti30KBLUE (DE-EN)29Transformer
Question AnsweringMathematics DatasetAccuracy0.76Transformer
Text GenerationLSMDC-EBLEU-115.35Transformer
Text GenerationLSMDC-EBLEU-24.49Transformer
Text GenerationLSMDC-EBLEU-31.82Transformer
Text GenerationLSMDC-EBLEU-40.76Transformer
Text GenerationLSMDC-ECIDEr9.32Transformer
Text GenerationLSMDC-EMETEOR11.43Transformer
Text GenerationLSMDC-EROUGE-L19.16Transformer
Text GenerationVIST-EBLEU-117.18Transformer
Text GenerationVIST-EBLEU-26.29Transformer
Text GenerationVIST-EBLEU-33.07Transformer
Text GenerationVIST-EBLEU-42.01Transformer
Text GenerationVIST-ECIDEr12.75Transformer
Text GenerationVIST-EMETEOR6.91Transformer
Text GenerationVIST-EROUGE-L18.23Transformer
Coreference ResolutionWinograd Schema ChallengeAccuracy54.1Subword-level Transformer LM
Constituency ParsingPenn TreebankF1 score92.7Transformer
Text SummarizationGigaWordROUGE-137.57Transformer
Text SummarizationGigaWordROUGE-218.9Transformer
Text SummarizationGigaWordROUGE-L34.69Transformer
Text SummarizationCNN / Daily MailROUGE-139.5Transformer
Text SummarizationCNN / Daily MailROUGE-216.06Transformer
Text SummarizationCNN / Daily MailROUGE-L36.63Transformer
Abstractive Text SummarizationCNN / Daily MailROUGE-139.5Transformer
Abstractive Text SummarizationCNN / Daily MailROUGE-216.06Transformer
Abstractive Text SummarizationCNN / Daily MailROUGE-L36.63Transformer
Data-to-Text GenerationLSMDC-EBLEU-115.35Transformer
Data-to-Text GenerationLSMDC-EBLEU-24.49Transformer
Data-to-Text GenerationLSMDC-EBLEU-31.82Transformer
Data-to-Text GenerationLSMDC-EBLEU-40.76Transformer
Data-to-Text GenerationLSMDC-ECIDEr9.32Transformer
Data-to-Text GenerationLSMDC-EMETEOR11.43Transformer
Data-to-Text GenerationLSMDC-EROUGE-L19.16Transformer
Data-to-Text GenerationVIST-EBLEU-117.18Transformer
Data-to-Text GenerationVIST-EBLEU-26.29Transformer
Data-to-Text GenerationVIST-EBLEU-33.07Transformer
Data-to-Text GenerationVIST-EBLEU-42.01Transformer
Data-to-Text GenerationVIST-ECIDEr12.75Transformer
Data-to-Text GenerationVIST-EMETEOR6.91Transformer
Data-to-Text GenerationVIST-EROUGE-L18.23Transformer
Shape Representation Of 3D Point CloudsScanObjectNNGFLOPs4.8Transformer
Shape Representation Of 3D Point CloudsScanObjectNNNumber of params (M)22.1Transformer
Shape Representation Of 3D Point CloudsScanObjectNNOverall Accuracy (PB_T50_RS)77.24Transformer
Multimodal Machine TranslationMulti30KBLUE (DE-EN)29Transformer
3D Point Cloud ClassificationScanObjectNNGFLOPs4.8Transformer
3D Point Cloud ClassificationScanObjectNNNumber of params (M)22.1Transformer
3D Point Cloud ClassificationScanObjectNNOverall Accuracy (PB_T50_RS)77.24Transformer
Natural Language UnderstandingPDP60Accuracy58.3Subword-level Transformer LM
Visual StorytellingLSMDC-EBLEU-115.35Transformer
Visual StorytellingLSMDC-EBLEU-24.49Transformer
Visual StorytellingLSMDC-EBLEU-31.82Transformer
Visual StorytellingLSMDC-EBLEU-40.76Transformer
Visual StorytellingLSMDC-ECIDEr9.32Transformer
Visual StorytellingLSMDC-EMETEOR11.43Transformer
Visual StorytellingLSMDC-EROUGE-L19.16Transformer
Visual StorytellingVIST-EBLEU-117.18Transformer
Visual StorytellingVIST-EBLEU-26.29Transformer
Visual StorytellingVIST-EBLEU-33.07Transformer
Visual StorytellingVIST-EBLEU-42.01Transformer
Visual StorytellingVIST-ECIDEr12.75Transformer
Visual StorytellingVIST-EMETEOR6.91Transformer
Visual StorytellingVIST-EROUGE-L18.23Transformer
Story GenerationLSMDC-EBLEU-115.35Transformer
Story GenerationLSMDC-EBLEU-24.49Transformer
Story GenerationLSMDC-EBLEU-31.82Transformer
Story GenerationLSMDC-EBLEU-40.76Transformer
Story GenerationLSMDC-ECIDEr9.32Transformer
Story GenerationLSMDC-EMETEOR11.43Transformer
Story GenerationLSMDC-EROUGE-L19.16Transformer
Story GenerationVIST-EBLEU-117.18Transformer
Story GenerationVIST-EBLEU-26.29Transformer
Story GenerationVIST-EBLEU-33.07Transformer
Story GenerationVIST-EBLEU-42.01Transformer
Story GenerationVIST-ECIDEr12.75Transformer
Story GenerationVIST-EMETEOR6.91Transformer
Story GenerationVIST-EROUGE-L18.23Transformer
3D Point Cloud ReconstructionScanObjectNNGFLOPs4.8Transformer
3D Point Cloud ReconstructionScanObjectNNNumber of params (M)22.1Transformer
3D Point Cloud ReconstructionScanObjectNNOverall Accuracy (PB_T50_RS)77.24Transformer

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification2025-07-15