TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Mogrifier LSTM

Mogrifier LSTM

SequentialIntroduced 20002 papers
Source Paper

Description

The Mogrifier LSTM is an extension to the LSTM where the LSTM’s input x\mathbf{x}x is gated conditioned on the output of the previous step h_prev\mathbf{h}\_{prev}h_prev. Next, the gated input is used in a similar manner to gate the output of the previous time step. After a couple of rounds of this mutual gating, the last updated x\mathbf{x}x and h_prev\mathbf{h}\_{prev}h_prev are fed to an LSTM.

In detail, the Mogrifier is an LSTM where two inputs x\mathbf{x}x and h_prev\mathbf{h}\_{prev}h_prev modulate one another in an alternating fashion before the usual LSTM computation takes place. That is: Mogrify(x,c_prev,h_prev)=LSTM(x↑,c_prev,h↑_prev) \text{Mogrify}\left(\mathbf{x}, \mathbf{c}\_{prev}, \mathbf{h}\_{prev}\right) = \text{LSTM}\left(\mathbf{x}^{↑}, \mathbf{c}\_{prev}, \mathbf{h}^{↑}\_{prev}\right)Mogrify(x,c_prev,h_prev)=LSTM(x↑,c_prev,h↑_prev) where the modulated inputs x↑\mathbf{x}^{↑}x↑ and h↑_prev\mathbf{h}^{↑}\_{prev}h↑_prev are defined as the highest indexed xi\mathbf{x}^{i}xi and hi_prev\mathbf{h}^{i}\_{prev}hi_prev, respectively, from the interleaved sequences:

xi=2σ(Qihi−1_prev)⊙xi−2 for odd i∈[1…r]\mathbf{x}^{i} = 2\sigma\left(\mathbf{Q}^{i}\mathbf{h}^{i−1}\_{prev}\right) \odot x^{i-2} \text{ for odd } i \in \left[1 \dots r\right]xi=2σ(Qihi−1_prev)⊙xi−2 for odd i∈[1…r]

hi_prev=2σ(Rixi−1)⊙hi−2_prev for even i∈[1…r]\mathbf{h}^{i}\_{prev} = 2\sigma\left(\mathbf{R}^{i}\mathbf{x}^{i-1}\right) \odot \mathbf{h}^{i-2}\_{prev} \text{ for even } i \in \left[1 \dots r\right]hi_prev=2σ(Rixi−1)⊙hi−2_prev for even i∈[1…r]

with x−1=x\mathbf{x}^{-1} = \mathbf{x}x−1=x and h0_prev=h_prev\mathbf{h}^{0}\_{prev} = \mathbf{h}\_{prev}h0_prev=h_prev. The number of "rounds", r∈Nr \in \mathbb{N}r∈N, is a hyperparameter; r=0r = 0r=0 recovers the LSTM. Multiplication with the constant 2 ensures that randomly initialized Qi\mathbf{Q}^{i}Qi, Ri\mathbf{R}^{i}Ri matrices result in transformations close to identity. To reduce the number of additional model parameters, we typically factorize the Qi\mathbf{Q}^{i}Qi, Ri\mathbf{R}^{i}Ri matrices as products of low-rank matrices: Qi\mathbf{Q}^{i}Qi = Qi_leftQi_right\mathbf{Q}^{i}\_{left}\mathbf{Q}^{i}\_{right}Qi_leftQi_right with Qi∈Rm×n\mathbf{Q}^{i} \in \mathbb{R}^{m\times{n}}Qi∈Rm×n, Qi_left∈Rm×k\mathbf{Q}^{i}\_{left} \in \mathbb{R}^{m\times{k}}Qi_left∈Rm×k, Qi_right∈Rk×n\mathbf{Q}^{i}\_{right} \in \mathbb{R}^{k\times{n}}Qi_right∈Rk×n, where k<min⁡(m,n)k < \min\left(m, n\right)k<min(m,n) is the rank.

Papers Using This Method

Gates Are Not What You Need in RNNs2021-08-01Mogrifier LSTM2019-09-04