TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Factorized Dense Synthesized Attention

Factorized Dense Synthesized Attention

Natural Language ProcessingIntroduced 20001 papers
Source Paper

Description

Factorized Dense Synthesized Attention is a synthesized attention mechanism, similar to dense synthesized attention, but we factorize the outputs to reduce parameters and prevent overfitting. It was proposed as part of the Synthesizer architecture. The factorized variant of the dense synthesizer can be expressed as follows:

A,B=F_A(X_i),F_B(X_i)A, B = F\_{A}\left(X\_{i}\right), F\_{B}\left(X\_{i}\right)A,B=F_A(X_i),F_B(X_i)

where F_A(.)F\_{A}\left(.\right)F_A(.) projects input X_iX\_{i}X_i into aaa dimensions, F_B(.)F\_B\left(.\right)F_B(.) projects X_iX\_{i}X_i to bbb dimensions, and a x b=la \text{ x } b = la x b=l. The output of the factorized module is now written as:

Y=Softmax(C)G(X)Y = \text{Softmax}\left(C\right)G\left(X\right)Y=Softmax(C)G(X)

where C=H_A(A)∗H_B(B)C = H\_{A}\left(A\right) * H\_{B}\left(B\right)C=H_A(A)∗H_B(B), where H_AH\_{A}H_A, H_BH\_{B}H_B are tiling functions and C∈Rl x lC \in \mathbb{R}^{l \text{ x } l}C∈Rl x l. The tiling function simply duplicates the vector kkk times, i.e., Rl→Rlk\mathbb{R}^{l} \rightarrow \mathbb{R}^{lk}Rl→Rlk. In this case, H_A()H\_{A}\left(\right)H_A() is a projection of Ra→Rab\mathbb{R}^{a} \rightarrow \mathbb{R}^{ab}Ra→Rab and H_B()H\_{B}\left(\right)H_B() is a projection of Rb→Rba\mathbb{R}^{b} \rightarrow \mathbb{R}^{ba}Rb→Rba. To avoid having similar values within the same block, we compose the outputs of H_AH\_{A}H_A and H_BH\_{B}H_B.

Papers Using This Method

Synthesizer: Rethinking Self-Attention in Transformer Models2020-05-02