Papers With Code 2 | ML Benchmarks, SotA Results & Code

Description

A Neural Cache, or a Continuous Cache, is a module for language modelling which stores previous hidden states in memory cells. They are then used as keys to retrieve their corresponding word, that is the next word. There is no transformation applied to the storage during writing and reading.

More formally it exploits the hidden representations $h\_{t}$ to define a probability distribution over the words in the cache. As illustrated in the Figure, the cache stores pairs $\left(h\_{i}, x\_{i+1}\right)$ of a hidden representation, and the word which was generated based on this representation (the vector $h\_{i}$ encodes the history $x\_{i}, \dots, x\_{1}$ ). At time $t$ , we then define a probability distribution over words stored in the cache based on the stored hidden representations and the current one $h\_{t}$ as:

$p\_{cache}\left(w | h\_{1\dots{t}}, x\_{1\dots{t}}\right) \propto \sum^{t-1}\_{i=1}\mathcal{1}\_{\text{set}\left(w=x\_{i+1}\right)} \exp\left(θ\_{h}>h\_{t}^{T}h\_{i}\right)$

where the scalar $\theta$ is a parameter which controls the flatness of the distribution. When $\theta$ is equal to zero, the probability distribution over the history is uniform, and the model is equivalent to a unigram cache model.

Description

$p\_{cache}\left(w | h\_{1\dots{t}}, x\_{1\dots{t}}\right) \propto \sum^{t-1}\_{i=1}\mathcal{1}\_{\text{set}\left(w=x\_{i+1}\right)} \exp\left(θ\_{h}>h\_{t}^{T}h\_{i}\right)$

Neural Cache

Description

Papers Using This Method

Neural Cache

Description

Papers Using This Method