Factorized Dense Synthesized Attention is a synthesized attention mechanism, similar to dense synthesized attention, but we factorize the outputs to reduce parameters and prevent overfitting. It was proposed as part of the Synthesizer architecture. The factorized variant of the dense synthesizer can be expressed as follows:
A,B=F_A(X_i),F_B(X_i)
where F_A(.) projects input X_i into a dimensions, F_B(.) projects X_i to b dimensions, and a x b=l. The output of the factorized module is now written as:
Y=Softmax(C)G(X)
where C=H_A(A)∗H_B(B), where H_A, H_B are tiling functions and C∈Rl x l. The tiling function simply duplicates the vector k times, i.e., Rl→Rlk. In this case, H_A() is a projection of Ra→Rab and H_B() is a projection of Rb→Rba. To avoid having similar values within the same block, we compose the outputs of H_A and H_B.