TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Factorized Random Synthesized Attention

Factorized Random Synthesized Attention

Natural Language ProcessingIntroduced 20001 papers
Source Paper

Description

Factorized Random Synthesized Attention, introduced with the Synthesizer architecture, is similar to factorized dense synthesized attention but for random synthesizers. Letting RRR being a randomly initialized matrix, we factorize RRR into low rank matrices R_1,R_2∈Rl xkR\_{1}, R\_{2} \in \mathbb{R}^{l\text{ x}k}R_1,R_2∈Rl xk in the attention function:

Y=Softmax(R_1R_2T)G(X).Y = \text{Softmax}\left(R\_{1}R\_{2}^{T}\right)G\left(X\right) .Y=Softmax(R_1R_2T)G(X).

Here G(.)G\left(.\right)G(.) is a parameterized function that is equivalent to VVV in Scaled Dot-Product Attention.

For each head, the factorization reduces the parameter costs from l2l^{2}l2 to 2(lk)2\left(lk\right)2(lk) where k<<lk << lk<<l and hence helps prevent overfitting. In practice, we use a small value of k=8k = 8k=8.

The basic idea of a Random Synthesizer is to not rely on pairwise token interactions or any information from individual token but rather to learn a task-specific alignment that works well globally across many samples.

Papers Using This Method

Synthesizer: Rethinking Self-Attention in Transformer Models2020-05-02