Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, Zheng Zhang
The Transformer model is widely successful on many natural language processing tasks. However, the quadratic complexity of self-attention limit its application on long text. In this paper, adopting a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning (BP), we propose BP-Transformer (BPT for short). BPT yields $O(k\cdot n\log (n/k))$ connections where $k$ is a hyperparameter to control the density of attention. BPT has a good balance between computation complexity and model capacity. A series of experiments on text classification, machine translation and language modeling shows BPT has a superior performance for long text than previous self-attention models. Our code, hyperparameters and CUDA kernels for sparse attention are available in PyTorch.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Machine Translation | IWSLT2015 Chinese-English | BLEU | 19.84 | BP-Transformer |
| Language Modelling | Text8 | Bit per Character (BPC) | 1.11 | BP-Transformer - 12 Layers |
| Language Modelling | enwik8 | Bit per Character (BPC) | 1.02 | BP-Transformer (12 layers) |
| Sentiment Analysis | SST-5 Fine-grained classification | Accuracy | 52.71 | BP-Transformer + GloVe |
| Sentiment Analysis | IMDb | Accuracy | 92.12 | BP-Transformer + GloVe |