158 machine learning methods and techniques
UNiversal Image-TExt Representation Learning
UNITER or UNiversal Image-TExt Representation model is a large-scale pre-trained model for joint multimodal embedding. It is pre-trained using four image-text datasets COCO, Visual Genome, Conceptual Captions, and SBU Captions. It can power heterogeneous downstream V+L tasks with joint multimodal embeddings. UNITER takes the visual regions of the image and textual tokens of the sentence as inputs. A faster R-CNN is used in Image Embedder to extract the visual features of each region and a Text Embedder is used to tokenize the input sentence into WordPieces. It proposes WRA via the Optimal Transport to provide more fine-grained alignment between word tokens and image regions that is effective in calculating the minimum cost of transporting the contextualized image embeddings to word embeddings and vice versa. Four pretraining tasks were designed for this model. They are Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). This model is different from the previous models because it uses conditional masking on pre-training tasks.
Reformer is a Transformer based architecture that seeks to make efficiency improvements. Dot-product attention is replaced by one that uses locality-sensitive hashing, changing its complexity from O() to O(), where is the length of the sequence. Furthermore, Reformers use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of times, where is the number of layers.
The Universal Transformer is a generalization of the Transformer architecture. Universal Transformers combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs. They also utilise a dynamic per-position halting mechanism.
Parsing Incrementally for Constrained Auto-Regressive Decoding
MPNet is a pre-training method for language models that combines masked language modeling (MLM) and permuted language modeling (PLM) in one view. It takes the dependency among the predicted tokens into consideration through permuted language modeling and thus avoids the issue of BERT. On the other hand, it takes position information of all tokens as input to make the model see the position information of all the tokens and thus alleviates the position discrepancy of XLNet. The training objective of MPNet is: As can be seen, MPNet conditions on (the tokens preceding the current predicted token ) rather than only the non-predicted tokens in MLM; comparing with PLM, MPNet takes more information (i.e., the mask symbol in position ) as inputs. Although the objective seems simple, it is challenging to implement the model efficiently. For details, see the paper.
Linformer is a linear Transformer that utilises a linear self-attention mechanism to tackle the self-attention bottleneck with Transformer models. The original scaled dot-product attention is decomposed into multiple smaller attentions through linear projections, such that the combination of these operations forms a low-rank factorization of the original attention.
BigBird is a Transformer with a sparse attention mechanism that reduces the quadratic dependency of self-attention to linear in the number of tokens. BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. In particular, BigBird consists of three main parts: - A set of global tokens attending on all parts of the sequence. - All tokens attending to a set of local neighboring tokens. - All tokens attending to a set of random tokens. This leads to a high performing attention mechanism scaling to much longer sequence lengths (8x).
Primer is a Transformer-based architecture that improves upon the Transformer architecture with two improvements found through neural architecture search: squared RELU activations in the feedforward block, and [depthwise convolutions]() added to the attention multi-head projections: resulting in a new module called Multi-DConv-Head-Attention.
Skip-gram Word2Vec is an architecture for computing word embeddings. Instead of using surrounding words to predict the center word, as with CBow Word2Vec, Skip-gram Word2Vec uses the central word to predict the surrounding words. The skip-gram objective function sums the log probabilities of the surrounding words to the left and right of the target word to produce the following objective:
Routed Attention is an attention pattern proposed as part of the Routing Transformer architecture. Each attention module considers a clustering of the space: the current timestep only attends to context belonging to the same cluster. In other word, the current time-step query is routed to a limited number of context through its cluster assignment. This can be contrasted with strided attention patterns and those proposed with the Sparse Transformer. In the image to the right, the rows represent the outputs while the columns represent the inputs. The different colors represent cluster memberships for the output token.
Contextual Word Vectors
CoVe, or Contextualized Word Vectors, uses a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors. word embeddings are therefore a function of the entire input sequence. These word embeddings can then be used in downstream tasks by concatenating them with embeddings: and then feeding these in as features for the task-specific models.
E-BRANCHFORMER: BRANCHFORMER WITH ENHANCED MERGING FOR SPEECH RECOGNITION
MobileBERT is a type of inverted-bottleneck BERT that compresses and accelerates the popular BERT model. MobileBERT is a thin version of BERTLARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERTLARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. It is trained by layer-to-layer imitating the inverted bottleneck BERT.
Generative Emotion Estimator
The Levenshtein Transformer (LevT) is a type of transformer that aims to address the lack of flexibility of previous decoding models. Notably, in previous frameworks, the length of generated sequences is either fixed or monotonically increased as the decoding proceeds. The authors argue this is incompatible with human-level intelligence where humans can revise, replace, revoke or delete any part of their generated text. Hence, LevT is proposed to bridge this gap by breaking the in-so-far standardized decoding mechanism and replacing it with two basic operations — insertion and deletion. LevT is trained using imitation learning. The resulted model contains two policies and they are executed in an alternate manner. The authors argue that with this model decoding becomes more flexible. For example, when the decoder is given an empty token, it falls back to a normal sequence generation model. On the other hand, the decoder acts as a refinement model when the initial state is a low-quality generated sequence. One crucial component in LevT framework is the learning algorithm. The authors leverage the characteristics of insertion and deletion — they are complementary but also adversarial. The algorithm they propose is called “dual policy learning”. The idea is that when training one policy (insertion or deletion), we use the output from its adversary at the previous iteration as input. An expert policy, on the other hand, is drawn to provide a correction signal.
Dialogue-Adaptive Pre-training Objective
Dialogue-Adaptive Pre-training Objective (DAPO) is a pre-training objective for dialogue adaptation, which is designed to measure qualities of dialogues from multiple important aspects, like Readability, Consistency and Fluency which have already been focused on by general LM pre-training objectives, and those also significant for assessing dialogues but ignored by general LM pre-training objectives, like Diversity and Specificity.
ProphetNet is a sequence-to-sequence pre-training model that introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of optimizing one-step-ahead prediction in the traditional sequence-to-sequence model, the ProphetNet is optimized by -step ahead prediction that predicts the next tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and further help predict multiple future tokens.
GPT-NeoX is an autoregressive transformer decoder model whose architecture largely follows that of GPT-3, with a few notable deviations. The model has 20 billion parameters with 44 layers, a hidden dimension size of 6144, and 64 heads. The main difference with GPT-3 is the change in tokenizer, the addition of Rotary Positional Embeddings, the parallel computation of attention and feed-forward layers, and a different initialization scheme and hyperparameters.
Galactica is a language model which uses a Transformer architecture in a decoder-only setup with the following modifications: - It uses GeLU activations on all model sizes - It uses a 2048 length context window for all model sizes - It does not use biases in any of the dense kernels or layer norms - It uses learned positional embeddings for the model - A vocabulary of 50k tokens was constructed using BPE. The vocabulary was generated from a randomly selected 2% subset of the training data
Edge-augmented Graph Transformer
Transformer neural networks have achieved state-of-the-art results for unstructured data such as text and images but their adoption for graph-structured data has been limited. This is partly due to the difficulty of incorporating complex structural information in the basic transformer framework. We propose a simple yet powerful extension to the transformer - residual edge channels. The resultant framework, which we call Edge-augmented Graph Transformer (EGT), can directly accept, process and output structural information as well as node information. It allows us to use global self-attention, the key element of transformers, directly for graphs and comes with the benefit of long-range interaction among nodes. Moreover, the edge channels allow the structural information to evolve from layer to layer, and prediction tasks on edges/links can be performed directly from the output embeddings of these channels. In addition, we introduce a generalized positional encoding scheme for graphs based on Singular Value Decomposition which can improve the performance of EGT. Our framework, which relies on global node feature aggregation, achieves better performance compared to Convolutional/Message-Passing Graph Neural Networks, which rely on local feature aggregation within a neighborhood. We verify the performance of EGT in a supervised learning setting on a wide range of experiments on benchmark datasets. Our findings indicate that convolutional aggregation is not an essential inductive bias for graphs and global self-attention can serve as a flexible and adaptive alternative.
UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.
Temporal Word Embeddings with a Compass
TWEC is a method to generate temporal word embeddings: this method is efficient and it is based on a simple heuristic: we train an atemporal word embedding, the compass and we use this embedding to freeze one of the layers of the CBOW architecture. The frozen architecture is then used to train time-specific slices that are all comparable after training.
ENIGMA is an evaluation framework for dialog systems based on Pearson and Spearman's rank correlations between the estimated rewards and the true rewards. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors.
BLANC is an automatic estimation approach for document summary quality. The goal is to measure the functional performance of a summary with an objective, reproducible, and fully automated method. BLANC achieves this by measuring the performance boost gained by a pre-trained language model with access to a document summary while carrying out its language understanding task on the document's text.
context2vec is an unsupervised model for learning generic context embedding of wide sentential contexts, using a bidirectional LSTM. A large plain text corpora is trained on to learn a neural model that embeds entire sentential contexts and target words in the same low-dimensional space, which is optimized to reflect inter-dependencies between targets and their entire sentential context as a whole. In contrast to word2vec that use context modeling mostly internally and considers the target word embeddings as their main output, the focus of context2vec is the context representation. context2vec achieves its objective by assigning similar embeddings to sentential contexts and their associated target words.
Mirror-BERT converts pretrained language models into effective universal text encoders without any supervision, in 20-30 seconds. It is an extremely simple, fast, and effective contrastive learning technique. It relies on fully identical or slightly modified string pairs as positive (i.e., synonymous) fine-tuning examples, and aims to maximise their similarity during identity fine-tuning.
Continuous Bag-of-Words Word2Vec
Continuous Bag-of-Words Word2Vec is an architecture for creating word embeddings that uses future words as well as past words to create a word embedding. The objective function for CBOW is: In the CBOW model, the distributed representations of context are used to predict the word in the middle of the window. This contrasts with Skip-gram Word2Vec where the distributed representation of the input word is used to predict the context.
Charformer is a type of Transformer model that learns a subword tokenization end-to-end as part of the model. Specifically it uses GBST that automatically learns latent subword representations from characters in a data-driven fashion. Following GBST, the soft subword sequence is passed through Transformer layers.
Macaw is a generative question-answering (QA) system that is built on UnifiedQA, itself built on T5. Macaw has three interesting features. First, it often produces high-quality answers to questions far outside the domain it was trained on, sometimes surprisingly so. Second, Macaw allows different permutations (“an gles”) of inputs and outputs to be used. For example, we can give it a question and get an answer; or give it an answer and get a question; or give it a question and answer and get a set of multiple-choice (MC) options for that question. This multi-angle QA capability allows versatility in the way Macaw can be used, include recursively using outputs as new inputs to the system. Finally, Macaw also generates explanations as an optional output (or even input) element.
ConvBERT is a modification on the BERT architecture which uses a span-based dynamic convolution to replace self-attention heads to directly model local dependencies. Specifically a new mixed attention module replaces the self-attention modules in BERT, which leverages the advantages of convolution to better capture local dependency. Additionally, a new span-based dynamic convolution operation is used to utilize multiple input tokens to dynamically generate the convolution kernel. Lastly, ConvBERT also incorporates some new model designs including the bottleneck attention and grouped linear operator for the feed-forward module (reducing the number of parameters).
GBST
GBST, or Gradient-based Subword Tokenization Module, is a soft gradient-based subword tokenization module that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. GBST learns a position-wise soft selection over candidate subword blocks by scoring them with a scoring network. In contrast to prior tokenization-free methods, GBST learns interpretable latent subwords, which enables easy inspection of lexical representations and is more efficient than other byte-based models.
A Neural Cache, or a Continuous Cache, is a module for language modelling which stores previous hidden states in memory cells. They are then used as keys to retrieve their corresponding word, that is the next word. There is no transformation applied to the storage during writing and reading. More formally it exploits the hidden representations to define a probability distribution over the words in the cache. As illustrated in the Figure, the cache stores pairs of a hidden representation, and the word which was generated based on this representation (the vector encodes the history ). At time , we then define a probability distribution over words stored in the cache based on the stored hidden representations and the current one as: where the scalar is a parameter which controls the flatness of the distribution. When is equal to zero, the probability distribution over the history is uniform, and the model is equivalent to a unigram cache model.
Cross-encoder Reranking
Fastformer is an type of Transformer which uses additive attention as a building block. Instead of modeling the pair-wise interactions between tokens, additive attention is used to model global contexts, and then each token representation is further transformed based on its interaction with global context representations.
Table Pre-training via Execution
TAPEX is a conceptually simple and empirically powerful pre-training approach to empower existing models with table reasoning skills. TAPEX realizes table pre-training by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically synthesising executable SQL queries.
DeCLUTR is an approach for learning universal sentence embeddings that utilizes a self-supervised objective that does not require labelled training data. The objective learns universal sentence embeddings by training an encoder to minimize the distance between the embeddings of textual segments randomly sampled from nearby in the same document.
Subformer is a Transformer that combines sandwich-style parameter sharing, which overcomes naive cross-layer parameter sharing in generative models, and self-attentive embedding factorization (SAFE). In SAFE, a small self-attention layer is used to reduce embedding parameter count.
Meena is a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. A seq2seq model is used with the Evolved Transformer as the main architecture. The model is trained on multi-turn conversations where the input sequence is all turns of the context and the output sequence is the response.
MixText is a semi-supervised learning method for text classification, which uses a new data augmentation method called TMix. TMix creates a large amount of augmented training samples by interpolating text in hidden space. The technique leverages advances in data augmentation to guess low-entropy labels for unlabeled data, making them as easy to use as labeled data.
Probabilistically Masked Language Model
Probabilistically Masked Language Model, or PMLM, is a type of language model that utilizes a probabilistic masking scheme, aiming to bridge the gap between masked and autoregressive language models. The basic idea behind the connection of two categories of models is similar to MADE by Germain et al (2015). PMLM is a masked language model with a probabilistic masking scheme, which defines the way sequences are masked by following a probabilistic distribution. The authors employ a simple uniform distribution of the masking ratio and name the model as u-PMLM.
I-BERT is a quantized version of BERT that quantizes the entire inference with integer-only arithmetic. Based on lightweight integer only approximation methods for nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, it performs an end-to-end integer-only BERT inference without any floating point calculation. In particular, GELU and Softmax are approximated with lightweight second-order polynomials, which can be evaluated with integer-only arithmetic. For LayerNorm, integer-only computation is performed by leveraging a known algorithm for integer calculation of square root.
The Compressive Transformer is an extension to the Transformer which maps past hidden activations (memories) to a smaller set of compressed representations (compressed memories). The Compressive Transformer uses the same attention mechanism over its set of memories and compressed memories, learning to query both its short-term granular memory and longer-term coarse memory. It builds on the ideas of Transformer-XL which maintains a memory of past activations at each layer to preserve a longer history of context. The Transformer-XL discards past activations when they become sufficiently old (controlled by the size of the memory). The key principle of the Compressive Transformer is to compress these old memories, instead of discarding them, and store them in an additional compressed memory. At each time step , we discard the oldest compressed memories (FIFO) and then the oldest states from ordinary memory are compressed and shifted to the new slot in compressed memory. During training, the compressive memory component is optimized separately from the main language model (separate training loop).
A Gated Convolutional Network is a type of language model that combines convolutional networks with a gating mechanism. Zero padding is used to ensure future context can not be seen. Gated convolutional layers can be stacked on top of other hierarchically. Model predictions are then obtained with an adaptive softmax layer.
TSDAE is an unsupervised sentence embedding method. During training, TSDAE encodes corrupted sentences into fixed-sized vectors and requires the decoder to reconstruct the original sentences from this sentence embedding. For good reconstruction quality, the semantics must be captured well in the sentence embedding from the encoder. Later, at inference, we only use the encoder for creating sentence embeddings. The model architecture of TSDAE is a modified encoder-decoder Transformer where the key and value of the cross-attention are both confined to the sentence embedding only. Formally, the formulation of the modified cross-attention is: where is the decoder hidden states within decoding steps at the -th layer, is the size of the sentence embedding, is a one-row matrix including the sentence embedding vector and and are the query, key and value, respectively. By exploring different configurations on the STS benchmark dataset, the authors discover that the best combination is: (1) adopting deletion as the input noise and setting the deletion ratio to using the output of the [CLS] token as fixed-sized sentence representation (3) tying the encoder and decoder parameters during training.
• Parallel Layers – We use a “parallel” formulation in each Transformer block (Wang & Komatsuzaki, 2021), rather than the standard “serialized” formulation. Specifically, the standard formulation can be written as: y = x + MLP(LayerNorm(x + Attention(LayerNorm(x))) Whereas the parallel formulation can be written as: y = x + MLP(LayerNorm(x)) + Attention(LayerNorm(x)) The parallel formulation results in roughly 15% faster training speed at large scales, since the MLP and Attention input matrix multiplications can be fused. Ablation experiments showed a small quality degradation at 8B scale but no quality degradation at 62B scale, so we extrapolated that the effect of parallel layers should be quality neutral at the 540B scale.
OPT-IML is a version of OPT fine-tuned on a large collection of 1500+ NLP tasks divided into various task categories.
DeeBERT is a method for accelerating BERT inference. It inserts extra classification layers (which are referred to as off-ramps) between each transformer layer of BERT. All transformer layers and off-ramps are jointly fine-tuned on a given downstream dataset. At inference time, after a sample goes through a transformer layer, it is passed to the following off-ramp. If the off-ramp is confident of the prediction, the result is returned; otherwise, the sample is sent to the next transformer layer.
The Routing Transformer is a Transformer that endows self-attention with a sparse routing module based on online k-means. Each attention module considers a clustering of the space: the current timestep only attends to context belonging to the same cluster. In other word, the current time-step query is routed to a limited number of context through its cluster assignment.
CuBERT, or Code Understanding BERT, is a BERT based model for code understanding. In order to achieve this, the authors curate a massive corpus of Python programs collected from GitHub. GitHub projects are known to contain a large amount of duplicate code. To avoid biasing the model to such duplicated code, authors perform deduplication using the method of Allamanis (2018). The resulting corpus has 7.4 million files with a total of 9.3 billion tokens (16 million unique).