A Dot Product Attention Free Transformer
Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, Joshua M. Susskind
Abstract
We introduce Dot Product Attention Free Transformer (DAFT), an efficient variant of Transformers \citep{transformer} that eliminates the query-key dot product in self attention. The core idea is to construct a decomposable attention map for each dimension of the query, key and value. This compositionality enables an implementation where the attention tensor does not to be computed or stored explicitly. A DAFT layer has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible with both large input and model sizes. We also introduce DAFT-conv, a model variant that takes advantage of locality and spatial weight sharing while maintaining global connectivity. We conduct experiments on ImageNet-1K classification, as well as CIFAR10 and Enwik8, two autoregressive modeling tasks. We show that DAFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.