Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang
Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call {\em Transient Global} (TGlobal), which mimics ETC's local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Text Generation | Multi-News | ROUGE-1 | 48.17 | LongT5 |
| Text Generation | Multi-News | ROUGE-2 | 19.43 | LongT5 |
| Text Generation | Multi-News | ROUGE-SU4 | 24.94 | LongT5 |
| Language Modelling | SCROLLS | Avg. | 42.53 | LongT5 XL |
| Language Modelling | SCROLLS | CNLI | 88.2 | LongT5 XL |
| Language Modelling | SCROLLS | Nrtv | 29.3 | LongT5 XL |
| Language Modelling | SCROLLS | Qspr | 53.1 | LongT5 XL |
| Language Modelling | SCROLLS | Avg. | 41.03 | LongT5 Large |
| Language Modelling | SCROLLS | CNLI | 87.3 | LongT5 Large |
| Language Modelling | SCROLLS | Nrtv | 27.2 | LongT5 Large |
| Language Modelling | SCROLLS | Qspr | 52.3 | LongT5 Large |
| Language Modelling | SCROLLS | Avg. | 38.6 | LongT5 Base |
| Language Modelling | SCROLLS | CNLI | 85.6 | LongT5 Base |
| Language Modelling | SCROLLS | Nrtv | 23 | LongT5 Base |
| Language Modelling | SCROLLS | Qspr | 46.6 | LongT5 Base |
| Text Summarization | BigPatent | ROUGE-1 | 76.87 | LongT5 |
| Text Summarization | BigPatent | ROUGE-2 | 66.06 | LongT5 |
| Text Summarization | BigPatent | ROUGE-L | 70.76 | LongT5 |
| Text Summarization | Arxiv HEP-TH citation graph | ROUGE-1 | 48.35 | LongT5 |
| Text Summarization | Arxiv HEP-TH citation graph | ROUGE-2 | 21.92 | LongT5 |
| Text Summarization | Arxiv HEP-TH citation graph | ROUGE-L | 44.27 | LongT5 |
| Text Summarization | Pubmed | ROUGE-1 | 50.23 | LongT5 |
| Text Summarization | Pubmed | ROUGE-2 | 24.76 | LongT5 |
| Text Summarization | Pubmed | ROUGE-L | 46.67 | LongT5 |
| Text Summarization | CNN / Daily Mail | ROUGE-1 | 43.94 | LongT5 |
| Text Summarization | CNN / Daily Mail | ROUGE-2 | 21.4 | LongT5 |
| Text Summarization | CNN / Daily Mail | ROUGE-L | 41.28 | LongT5 |
| Text Summarization | Multi-News | ROUGE-1 | 48.17 | LongT5 |
| Text Summarization | Multi-News | ROUGE-2 | 19.43 | LongT5 |
| Text Summarization | Multi-News | ROUGE-SU4 | 24.94 | LongT5 |
| Abstractive Text Summarization | CNN / Daily Mail | ROUGE-1 | 43.94 | LongT5 |
| Abstractive Text Summarization | CNN / Daily Mail | ROUGE-2 | 21.4 | LongT5 |
| Abstractive Text Summarization | CNN / Daily Mail | ROUGE-L | 41.28 | LongT5 |