Methods

context2vec

context2vec is an unsupervised model for learning generic context embedding of wide sentential contexts, using a bidirectional LSTM. A large plain text corpora is trained on to learn a neural model that embeds entire sentential contexts and target words in the same low-dimensional space, which is optimized to reflect inter-dependencies between targets and their entire sentential context as a whole. In contrast to word2vec that use context modeling mostly internally and considers the target word embeddings as their main output, the focus of context2vec is the context representation. context2vec achieves its objective by assigning similar embeddings to sentential contexts and their associated target words.

Mirror-BERT

Mirror-BERT converts pretrained language models into effective universal text encoders without any supervision, in 20-30 seconds. It is an extremely simple, fast, and effective contrastive learning technique. It relies on fully identical or slightly modified string pairs as positive (i.e., synonymous) fine-tuning examples, and aims to maximise their similarity during identity fine-tuning.

CBoW Word2Vec

Continuous Bag-of-Words Word2Vec

Continuous Bag-of-Words Word2Vec is an architecture for creating word embeddings that uses future words as well as past words to create a word embedding. The objective function for CBOW is: In the CBOW model, the distributed representations of context are used to predict the word in the middle of the window. This contrasts with Skip-gram Word2Vec where the distributed representation of the input word is used to predict the context.

Charformer

Charformer is a type of Transformer model that learns a subword tokenization end-to-end as part of the model. Specifically it uses GBST that automatically learns latent subword representations from characters in a data-driven fashion. Following GBST, the soft subword sequence is passed through Transformer layers.

Macaw

Macaw is a generative question-answering (QA) system that is built on UnifiedQA, itself built on T5. Macaw has three interesting features. First, it often produces high-quality answers to questions far outside the domain it was trained on, sometimes surprisingly so. Second, Macaw allows different permutations (“an gles”) of inputs and outputs to be used. For example, we can give it a question and get an answer; or give it an answer and get a question; or give it a question and answer and get a set of multiple-choice (MC) options for that question. This multi-angle QA capability allows versatility in the way Macaw can be used, include recursively using outputs as new inputs to the system. Finally, Macaw also generates explanations as an optional output (or even input) element.

ConvBERT

ConvBERT is a modification on the BERT architecture which uses a span-based dynamic convolution to replace self-attention heads to directly model local dependencies. Specifically a new mixed attention module replaces the self-attention modules in BERT, which leverages the advantages of convolution to better capture local dependency. Additionally, a new span-based dynamic convolution operation is used to utilize multiple input tokens to dynamically generate the convolution kernel. Lastly, ConvBERT also incorporates some new model designs including the bottleneck attention and grouped linear operator for the feed-forward module (reducing the number of parameters).

Gradient-Based Subword Tokenization

GBST

GBST, or Gradient-based Subword Tokenization Module, is a soft gradient-based subword tokenization module that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. GBST learns a position-wise soft selection over candidate subword blocks by scoring them with a scoring network. In contrast to prior tokenization-free methods, GBST learns interpretable latent subwords, which enables easy inspection of lexical representations and is more efficient than other byte-based models.

Natural Language ProcessingIntroduced 20004 papers

Neural Cache

A Neural Cache, or a Continuous Cache, is a module for language modelling which stores previous hidden states in memory cells. They are then used as keys to retrieve their corresponding word, that is the next word. There is no transformation applied to the storage during writing and reading. More formally it exploits the hidden representations to define a probability distribution over the words in the cache. As illustrated in the Figure, the cache stores pairs of a hidden representation, and the word which was generated based on this representation (the vector encodes the history ). At time , we then define a probability distribution over words stored in the cache based on the stored hidden representations and the current one as: where the scalar is a parameter which controls the flatness of the distribution. When is equal to zero, the probability distribution over the history is uniform, and the model is equivalent to a unigram cache model.

Cross-encoder Reranking

Natural Language ProcessingIntroduced 20004 papers

Fastformer

Fastformer is an type of Transformer which uses additive attention as a building block. Instead of modeling the pair-wise interactions between tokens, additive attention is used to model global contexts, and then each token representation is further transformed based on its interaction with global context representations.

Natural Language ProcessingIntroduced 20004 papers

TAPEX

Table Pre-training via Execution

TAPEX is a conceptually simple and empirically powerful pre-training approach to empower existing models with table reasoning skills. TAPEX realizes table pre-training by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically synthesising executable SQL queries.

DeCLUTR

DeCLUTR is an approach for learning universal sentence embeddings that utilizes a self-supervised objective that does not require labelled training data. The objective learns universal sentence embeddings by training an encoder to minimize the distance between the embeddings of textual segments randomly sampled from nearby in the same document.

Subformer

Subformer is a Transformer that combines sandwich-style parameter sharing, which overcomes naive cross-layer parameter sharing in generative models, and self-attentive embedding factorization (SAFE). In SAFE, a small self-attention layer is used to reduce embedding parameter count.

Meena

Meena is a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. A seq2seq model is used with the Evolved Transformer as the main architecture. The model is trained on multi-turn conversations where the input sequence is all turns of the context and the output sequence is the response.

MixText

MixText is a semi-supervised learning method for text classification, which uses a new data augmentation method called TMix. TMix creates a large amount of augmented training samples by interpolating text in hidden space. The technique leverages advances in data augmentation to guess low-entropy labels for unlabeled data, making them as easy to use as labeled data.

PMLM

Probabilistically Masked Language Model

Probabilistically Masked Language Model, or PMLM, is a type of language model that utilizes a probabilistic masking scheme, aiming to bridge the gap between masked and autoregressive language models. The basic idea behind the connection of two categories of models is similar to MADE by Germain et al (2015). PMLM is a masked language model with a probabilistic masking scheme, which defines the way sequences are masked by following a probabilistic distribution. The authors employ a simple uniform distribution of the masking ratio and name the model as u-PMLM.

I-BERT

I-BERT is a quantized version of BERT that quantizes the entire inference with integer-only arithmetic. Based on lightweight integer only approximation methods for nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, it performs an end-to-end integer-only BERT inference without any floating point calculation. In particular, GELU and Softmax are approximated with lightweight second-order polynomials, which can be evaluated with integer-only arithmetic. For LayerNorm, integer-only computation is performed by leveraging a known algorithm for integer calculation of square root.

Compressive Transformer

The Compressive Transformer is an extension to the Transformer which maps past hidden activations (memories) to a smaller set of compressed representations (compressed memories). The Compressive Transformer uses the same attention mechanism over its set of memories and compressed memories, learning to query both its short-term granular memory and longer-term coarse memory. It builds on the ideas of Transformer-XL which maintains a memory of past activations at each layer to preserve a longer history of context. The Transformer-XL discards past activations when they become sufficiently old (controlled by the size of the memory). The key principle of the Compressive Transformer is to compress these old memories, instead of discarding them, and store them in an additional compressed memory. At each time step , we discard the oldest compressed memories (FIFO) and then the oldest states from ordinary memory are compressed and shifted to the new slot in compressed memory. During training, the compressive memory component is optimized separately from the main language model (separate training loop).

Gated Convolution Network

A Gated Convolutional Network is a type of language model that combines convolutional networks with a gating mechanism. Zero padding is used to ensure future context can not be seen. Gated convolutional layers can be stacked on top of other hierarchically. Model predictions are then obtained with an adaptive softmax layer.

TSDAE

TSDAE is an unsupervised sentence embedding method. During training, TSDAE encodes corrupted sentences into fixed-sized vectors and requires the decoder to reconstruct the original sentences from this sentence embedding. For good reconstruction quality, the semantics must be captured well in the sentence embedding from the encoder. Later, at inference, we only use the encoder for creating sentence embeddings. The model architecture of TSDAE is a modified encoder-decoder Transformer where the key and value of the cross-attention are both confined to the sentence embedding only. Formally, the formulation of the modified cross-attention is: where is the decoder hidden states within decoding steps at the -th layer, is the size of the sentence embedding, is a one-row matrix including the sentence embedding vector and and are the query, key and value, respectively. By exploring different configurations on the STS benchmark dataset, the authors discover that the best combination is: (1) adopting deletion as the input noise and setting the deletion ratio to using the output of the [CLS] token as fixed-sized sentence representation (3) tying the encoder and decoder parameters during training.

Parallel Layers

• Parallel Layers – We use a “parallel” formulation in each Transformer block (Wang & Komatsuzaki, 2021), rather than the standard “serialized” formulation. Specifically, the standard formulation can be written as: y = x + MLP(LayerNorm(x + Attention(LayerNorm(x))) Whereas the parallel formulation can be written as: y = x + MLP(LayerNorm(x)) + Attention(LayerNorm(x)) The parallel formulation results in roughly 15% faster training speed at large scales, since the MLP and Attention input matrix multiplications can be fused. Ablation experiments showed a small quality degradation at 8B scale but no quality degradation at 62B scale, so we extrapolated that the effect of parallel layers should be quality neutral at the 540B scale.

OPT-IML

OPT-IML is a version of OPT fine-tuned on a large collection of 1500+ NLP tasks divided into various task categories.

DeeBERT

DeeBERT is a method for accelerating BERT inference. It inserts extra classification layers (which are referred to as off-ramps) between each transformer layer of BERT. All transformer layers and off-ramps are jointly fine-tuned on a given downstream dataset. At inference time, after a sample goes through a transformer layer, it is passed to the following off-ramp. If the off-ramp is confident of the prediction, the result is returned; otherwise, the sample is sent to the next transformer layer.

Routing Transformer

The Routing Transformer is a Transformer that endows self-attention with a sparse routing module based on online k-means. Each attention module considers a clustering of the space: the current timestep only attends to context belonging to the same cluster. In other word, the current time-step query is routed to a limited number of context through its cluster assignment.

CuBERT

CuBERT, or Code Understanding BERT, is a BERT based model for code understanding. In order to achieve this, the authors curate a massive corpus of Python programs collected from GitHub. GitHub projects are known to contain a large amount of duplicate code. To avoid biasing the model to such duplicated code, authors perform deduplication using the method of Allamanis (2018). The resulting corpus has 7.4 million files with a total of 9.3 billion tokens (16 million unique).