TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/SimAdapter

SimAdapter

GeneralIntroduced 20001 papers
Source Paper

Description

SimAdapter is a module for explicitly learning knowledge from adapters. SimAdapter aims to learn the similarities between the source and target languages during fine-tuning using the adapters, and the similarity is based on an attention mechanism.

The detailed composition of the SimAdapter is shown in the Figure. By taking the language-agnostic representations from the backbone model as the query, and the language-specific outputs from multiple adapter as the keys and values, the final output for SimAdapter over attention are computed as (For notation simplicity, we omit the layer index lll below):

\operatorname{SimAdapter}\left(\mathbf{z}, \mathbf{a}\_{\left\(S\_{1}, S\_{2}, \ldots, S\_{N}\right\)}\right)=\sum_{i=1}^{N} \operatorname{Attn}\left(\mathbf{z}, \mathbf{a}\_{S\_{i}}\right) \cdot\left(\mathbf{a}\_{S\_{i}} \mathbf{W}\_{V}\right)

where SimAdapter (⋅)(\cdot)(⋅) and Attn⁡(⋅)\operatorname{Attn}(\cdot)Attn(⋅) denotes the SimAdapter and attention operations, respectively. Specifically, the attention operation is computed as:

Attn⁡(z,a)=Softmax⁡((zW_Q)(aW_K)⊤τ)\operatorname{Attn}(\mathbf{z}, \mathbf{a})=\operatorname{Softmax}\left(\frac{\left(\mathbf{z} \mathbf{W}\_{Q}\right)\left(\mathbf{a} \mathbf{W}\_{K}\right)^{\top}}{\tau}\right)Attn(z,a)=Softmax(τ(zW_Q)(aW_K)⊤​)

where τ\tauτ is the temperature coefficient, W_Q,W_K,W_V\mathbf{W}\_{Q}, \mathbf{W}\_{K}, \mathbf{W}\_{V}W_Q,W_K,W_V are attention matrices. Note that while W_Q,W_K\mathbf{W}\_{Q}, \mathbf{W}\_{K}W_Q,W_K are initialized randomly, W_V\mathbf{W}\_{V}W_V is initialized with a diagonal of ones and the rest of the matrix with small weights (1e−6)(1 e-6)(1e−6) to retain the adapter representations. Furthermore, a regularization term is introduced to avoid drastic feature changes:

L_reg=∑_i,j((I_V)_i,j−(W_V)i,j)2\mathcal{L}\_{\mathrm{reg}}=\sum\_{i, j}\left(\left(\mathbf{I}\_{V}\right)\_{i, j}-\left(\mathbf{W}\_{V}\right)_{i, j}\right)^{2}L_reg=∑_i,j((I_V)_i,j−(W_V)i,j​)2

where I_V\mathbf{I}\_{V}I_V is the identity matrix with the same size as W_V\mathbf{W}\_{V}W_V

Papers Using This Method

Exploiting Adapters for Cross-lingual Low-resource Speech Recognition2021-05-18