TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MASS: Masked Sequence to Sequence Pre-training for Languag...

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu

2019-05-07Machine TranslationText GenerationText SummarizationUnsupervised Machine TranslationTranslationConversational Response GenerationResponse Generation
PaperPDFCodeCodeCodeCodeCodeCode(official)Code

Abstract

Pre-training and fine-tuning, e.g., BERT, have achieved great success in language understanding by transferring knowledge from rich-resource pre-training task to the low/zero-resource downstream tasks. Inspired by the success of BERT, we propose MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language generation tasks. MASS adopts the encoder-decoder framework to reconstruct a sentence fragment given the remaining part of the sentence: its encoder takes a sentence with randomly masked fragment (several consecutive tokens) as input, and its decoder tries to predict this masked fragment. In this way, MASS can jointly train the encoder and decoder to develop the capability of representation extraction and language modeling. By further fine-tuning on a variety of zero/low-resource language generation tasks, including neural machine translation, text summarization and conversational response generation (3 tasks and totally 8 datasets), MASS achieves significant improvements over the baselines without pre-training or with other pre-training methods. Specially, we achieve the state-of-the-art accuracy (37.5 in terms of BLEU score) on the unsupervised English-French translation, even beating the early attention-based supervised model.

Results

TaskDatasetMetricValueModel
Machine TranslationWMT2014 English-FrenchBLEU37.5MASS (6-layer Transformer)
Machine TranslationWMT2014 French-EnglishBLEU34.9MASS (6-layer Transformer)
Machine TranslationWMT2016 English-GermanBLEU28.3MASS (6-layer Transformer)
Machine TranslationWMT2016 Romanian-EnglishBLEU33.1MASS (6-layer Transformer)
Machine TranslationWMT2016 German-EnglishBLEU35.2MASS (6-layer Transformer)
Machine TranslationWMT2016 English-RomanianBLEU35.2MASS (6-layer Transformer)
Text SummarizationGigaWordROUGE-138.73MASS
Text SummarizationGigaWordROUGE-219.71MASS
Text SummarizationGigaWordROUGE-L35.96MASS

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs2025-07-15Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification2025-07-15Function-to-Style Guidance of LLMs for Code Translation2025-07-15