TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/DExTra

DExTra

GeneralIntroduced 20001 papers
Source Paper

Description

DExTra, or Deep and Light-weight Expand-reduce Transformation, is a light-weight expand-reduce transformation that enables learning wider representations efficiently.

DExTra maps a d_md\_{m}d_m dimensional input vector into a high dimensional space (expansion) and then reduces it down to a d_od\_{o}d_o dimensional output vector (reduction) using NNN layers of group transformations. During these expansion and reduction phases, DExTra uses group linear transformations because they learn local representations by deriving the output from a specific part of the input and are more efficient than linear transformations. To learn global representations, DExTra shares information between different groups in the group linear transformation using feature shuffling

Formally, the DExTra transformation is controlled by five configuration parameters: (1) depth NNN, (2) width multiplier m_wm\_{w}m_w, (3) input dimension d_md\_{m}d_m, (4) output dimension d_od\_{o}d_o, and (5) maximum groups g_maxg\_{max}g_max in a group linear transformation. In the expansion phase, DExTra projects the d_md\_{m}d_m-dimensional input to a high-dimensional space, d_max=m_wd_md\_{max} = m\_{w}d\_{m}d_max=m_wd_m, linearly using ceil(N2)\text{ceil}\left(\frac{N}{2}\right)ceil(2N​) layers. In the reduction phase, DExTra projects the d_maxd\_{max}d_max-dimensional vector to a d_od\_{o}d_o-dimensional space using the remaining N−ceil(N2)N -\text{ceil}\left(\frac{N}{2}\right)N−ceil(2N​) layers. Mathematically, we define the output YYY at each layer lll as:

Y_l=F(X,Wl,bl,gl) if l=1\mathbf{Y}\_{l} = \mathcal{F}\left(\mathbf{X}, \mathbf{W}^{l}, \mathbf{b}^{l}, g^{l}\right) \text{ if } l=1Y_l=F(X,Wl,bl,gl) if l=1 Y_l=F(H(X,Yl−1),Wl,bl,gl) Otherwise \mathbf{Y}\_{l} = \mathcal{F}\left(\mathcal{H}\left(\mathbf{X}, \mathbf{Y}^{l-1}\right), \mathbf{W}^{l}, \mathbf{b}^{l}, g^{l}\right) \text{ Otherwise }Y_l=F(H(X,Yl−1),Wl,bl,gl) Otherwise 

where the number of groups at each layer lll are computed as:

gl=min(2l−1,g_max),1≤l≤ceil(N/2)g^{l} = \text{min}\left(2^{l-1}, g\_{max}\right), 1 \leq l \leq \text{ceil}\left(N/2\right)gl=min(2l−1,g_max),1≤l≤ceil(N/2) gN−l,Otherwise g^{N-l}, \text{Otherwise}gN−l,Otherwise

In the above equations, F\mathcal{F}F is a group linear transformation function. The function F\mathcal{F}F takes the input (X or H(X,Yl−1))\left(\mathbf{X} \text{ or } \mathcal{H}\left(\mathbf{X}, \mathbf{Y}^{l-1}\right) \right)(X or H(X,Yl−1)), splits it into glg^{l}gl groups, and then applies a linear transformation with learnable parameters Wl\mathbf{W}^{l}Wl and bias bl\mathbf{b}^{l}bl to each group independently. The outputs of each group are then concatenated to produce the final output Yl\mathbf{Y}^{l}Yl. The function H\mathcal{H}H first shuffles the output of each group in Yl−1\mathbf{Y}^{l−1}Yl−1 and then combines it with the input X\mathbf{X}X using an input mixer connection.

In the authors' experiments, they use g_max=ceil(d_m32)g\_{max} = \text{ceil}\left(\frac{d\_{m}}{32}\right)g_max=ceil(32d_m​) so that each group has at least 32 input elements. Note that (i) group linear transformations reduce to linear transformations when gl=1g^{l} = 1gl=1, and (ii) DExTra is equivalent to a multi-layer perceptron when g_max=1g\_{max} = 1g_max=1.

Papers Using This Method

DeLighT: Deep and Light-weight Transformer2020-08-03