8,725 machine learning methods and techniques
Self-Attention Guidance
A Neural Turing Machine is a working memory neural network model. It couples a neural network architecture with external memory resources. The whole architecture is differentiable end-to-end with gradient descent. The models can infer tasks such as copying, sorting and associative recall. A Neural Turing Machine (NTM) architecture contains two basic components: a neural network controller and a memory bank. The Figure presents a high-level diagram of the NTM architecture. Like most neural networks, the controller interacts with the external world via input and output vectors. Unlike a standard network, it also interacts with a memory matrix using selective read and write operations. By analogy to the Turing machine we refer to the network outputs that parameterise these operations as “heads.” Every component of the architecture is differentiable. This is achieved by defining 'blurry' read and write operations that interact to a greater or lesser degree with all the elements in memory (rather than addressing a single element, as in a normal Turing machine or digital computer). The degree of blurriness is determined by an attentional “focus” mechanism that constrains each read and write operation to interact with a small portion of the memory, while ignoring the rest. Because interaction with the memory is highly sparse, the NTM is biased towards storing data without interference. The memory location brought into attentional focus is determined by specialised outputs emitted by the heads. These outputs define a normalised weighting over the rows in the memory matrix (referred to as memory “locations”). Each weighting, one per read or write head, defines the degree to which the head reads or writes at each location. A head can thereby attend sharply to the memory at a single location or weakly to the memory at many locations
ShuffleNet V2 Block is an image model block used in the ShuffleNet V2 architecture, where speed is the metric optimized for (instead of indirect ones like FLOPs). It utilizes a simple operator called channel split. At the beginning of each unit, the input of feature channels are split into two branches with and channels, respectively. Following G3, one branch remains as identity. The other branch consists of three convolutions with the same input and output channels to satisfy G1. The two convolutions are no longer group-wise, unlike the original ShuffleNet. This is partially to follow G2, and partially because the split operation already produces two groups. After convolution, the two branches are concatenated. So, the number of channels keeps the same (G1). The same “channel shuffle” operation as in ShuffleNet is then used to enable information communication between the two branches. The motivation behind channel split is that alternative architectures, where pointwise group convolutions and bottleneck structures are used, lead to increased memory access cost. Additionally more network fragmentation with group convolutions reduces parallelism (less friendly for GPU), and the element-wise addition operation, while they have low FLOPs, have high memory access cost. Channel split is an alternative where we can maintain a large number of equally wide channels (equally wide minimizes memory access cost) without having dense convolutions or too many groups.
Neighborhood Attention is a restricted self attention pattern in which each token's receptive field is limited to its nearest neighboring pixels. It was proposed in Neighborhood Attention Transformer as an alternative to other local attention mechanisms used in Hierarchical Vision Transformers. NA is in concept similar to stand alone self attention (SASA), in that both can be implemented with a raster scan sliding window operation over the key value pair. However, NA would require a modification to handle corner pixels, which helps maintain a fixed receptive field size and an increased number of relative positions. The primary challenge in experimenting with both NA and SASA has been computation. Simply extracting key values for each query is slow, takes up a large amount of memory, and is eventually intractable at scale. NA was therefore implemented through a new CUDA extension to PyTorch, NATTEN.
Slime Mould Algorithm
Slime Mould Algorithm (SMA) is a new stochastic optimizer proposed based on the oscillation mode of slime mould in nature. SMA has several new features with a unique mathematical model that uses adaptive weights to simulate the process of producing positive and negative feedback of the propagation wave of slime mould based on bio-oscillator to form the optimal path for connecting food with excellent exploratory ability and exploitation propensity. 🔗 The source codes of SMA are publicly available at https://aliasgharheidari.com/SMA.html
Graph Network-based Simulators
Graph Network-Based Simulators is a type of graph neural network that represents the state of a physical system with particles, expressed as nodes in a graph, and computes dynamics via learned message-passing.
WaveGlow is a flow-based generative model that generates audio by sampling from a distribution. Specifically samples are taken from a zero mean spherical Gaussian with the same number of dimensions as our desired output, and those samples are put through a series of layers that transforms the simple distribution to one which has the desired distribution.
Deep Kernel Learning
Adaptive Feature Pooling pools features from all levels for each proposal in object detection and fuses them for the following prediction. For each proposal, we map them to different feature levels. Following the idea of Mask R-CNN, RoIAlign is used to pool feature grids from each level. Then a fusion operation (element-wise max or sum) is utilized to fuse feature grids from different levels. The motivation for this technique is that in an FPN we assign proposals to different feature levels based on the size of proposals, which could be suboptimal if images with small differences are assigned to different levels, or if the importance of features is not strongly correlated to their level which they belong.
Shifted Softplus is an activation function , which SchNet employs as non-linearity throughout the network in order to obtain a smooth potential energy surface. The shifting ensures that and improves the convergence of the network. This activation function shows similarity to ELUs, while having infinite order of continuity.
Deterministic Policy Gradient
Deterministic Policy Gradient, or DPG, is a policy gradient method for reinforcement learning. Instead of the policy function being modeled as a probability distribution, DPG considers and calculates gradients for a deterministic policy .
AugMix mixes augmented images through linear interpolations. Consequently it is like Mixup but instead mixes augmented versions of the same image.
A Switch FFN is a sparse layer that operates independently on tokens within an input sequence. It is shown in the blue block in the figure. We diagram two tokens ( = “More” and = “Parameters” below) being routed (solid lines) across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value (dotted-line).
Reformer is a Transformer based architecture that seeks to make efficiency improvements. Dot-product attention is replaced by one that uses locality-sensitive hashing, changing its complexity from O() to O(), where is the length of the sequence. Furthermore, Reformers use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of times, where is the number of layers.
Schrödinger Network
SchNet is an end-to-end deep neural network architecture based on continuous-filter convolutions. It follows the deep tensor neural network framework, i.e. atom-wise representations are constructed by starting from embedding vectors that characterize the atom type before introducing the configuration of the system by a series of interaction blocks.
Many machine learning tasks such as multiple instance learning, 3D shape recognition, and few-shot image classification are defined on sets of instances. Since solutions to such problems do not depend on the order of elements of the set, models used to address them should be permutation invariant. We present an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set. The model consists of an encoder and a decoder, both of which rely on attention mechanisms. In an effort to reduce computational complexity, we introduce an attention scheme inspired by inducing point methods from sparse Gaussian process literature. It reduces the computation time of self-attention from quadratic to linear in the number of elements in the set. We show that our model is theoretically attractive and we evaluate it on a range of tasks, demonstrating the state-of-the-art performance compared to recent methods for set-structured data.
MetaFormer is a general architecture abstracted from Transformers by not specifying the token mixer.
FastSpeech2 is a text-to-speech model that aims to improve upon FastSpeech by better solving the one-to-many mapping problem in TTS, i.e., multiple speech variations corresponding to the same text. It attempts to solve this problem by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, in FastSpeech 2, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference. The encoder converts the phoneme embedding sequence into the phoneme hidden sequence, and then the variance adaptor adds different variance information such as duration, pitch and energy into the hidden sequence, finally the mel-spectrogram decoder converts the adapted hidden sequence into mel-spectrogram sequence in parallel. FastSpeech 2 uses a feed-forward Transformer block, which is a stack of self-attention and 1D-convolution as in FastSpeech, as the basic structure for the encoder and mel-spectrogram decoder.
Attention with Linear Biases
ALiBi, or Attention with Linear Biases, is a positioning method that allows Transformer language models to consume, at inference time, sequences which are longer than the ones they were trained on. ALiBi does this without using actual position embeddings. Instead, computing the attention between a certain key and query, ALiBi penalizes the attention value that that query can assign to the key depending on how far away the key and query are. So when a key and query are close by, the penalty is very low, and when they are far away, the penalty is very high. This method was motivated by the simple reasoning that words that are close-by matter much more than ones that are far away. This method is as fast as the sinusoidal or absolute embedding methods (the fastest positioning methods there are). It outperforms those methods and Rotary embeddings when evaluating sequences that are longer than the ones the model was trained on (this is known as extrapolation).
Reinforcement Learning from AI Feedback
Meta Face Recognition
Meta Face Recognition (MFR) is a meta-learning face recognition method. MFR synthesizes the source/target domain shift with a meta-optimization objective, which requires the model to learn effective representations not only on synthesized source domains but also on synthesized target domains. Specifically, domain-shift batches are built through a domain-level sampling strategy and back-propagated gradients/meta-gradients are obtained on synthesized source/target domains by optimizing multi-domain distributions. The gradients and meta-gradients are further combined to update the model to improve generalization.
Model Soups
Compress an ensemble of models into a single one by averaging their weights (under certain pre-conditions).
Random Erasing is a data augmentation method for training the convolutional neural network (CNN), which randomly selects a rectangle region in an image and erases its pixels with random values. In this process, training images with various levels of occlusion are generated, which reduces the risk of over-fitting and makes the model robust to occlusion. Random Erasing is parameter learning free, easy to implement, and can be integrated with most of the CNN-based recognition models. Random Erasing is complementary to commonly used data augmentation techniques such as random cropping and flipping, and can be implemented in various vision tasks, such as image classification, object detection, semantic segmentation. In the Albumentations library, there is a generalization of RandomErasing called CoarseDropout, which allows masking an arbitrary number of regions of rectangular shape. It could be applied to images, segmentation masks, and key points. Documentation for CoarseDropout
Dual Path Network
A Dual Path Network (DPN) is a convolutional neural network which presents a new topology of connection paths internally. The intuition is that ResNets enables feature re-usage while DenseNet enables new feature exploration, and both are important for learning good representations. To enjoy the benefits from both path topologies, Dual Path Networks share common features while maintaining the flexibility to explore new features through dual path architectures. We formulate such a dual path architecture as follows: where and denote the extracted information at -th step from individual path, is a feature learning function as . The first equation refers to the densely connected path that enables exploring new features. The second equation refers to the residual path that enables common features re-usage. The third equation defines the dual path that integrates them and feeds them to the last transformation function in the last equation.
RealNVP is a generative model that utilises real-valued non-volume preserving (real NVP) transformations for density estimation. The model can perform efficient and exact inference, sampling and log-density estimation of data points.
A Dynamic Memory Network is a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers. Questions trigger an iterative attention process which allows the model to condition its attention on the inputs and the result of previous iterations. These results are then reasoned over in a hierarchical recurrent sequence model to generate answers. The DMN consists of a number of modules: - Input Module: The input module encodes raw text inputs from the task into distributed vector representations. The input takes forms like a sentence, a long story, a movie review and so on. - Question Module: The question module encodes the question of the task into a distributed vector representation. For question answering, the question may be a sentence such as "Where did the author first fly?". The representation is fed into the episodic memory module, and forms the basis, or initial state, upon which the episodic memory module iterates. - Episodic Memory Module: Given a collection of input representations, the episodic memory module chooses which parts of the inputs to focus on through the attention mechanism. It then produces a ”memory” vector representation taking into account the question as well as the previous memory. Each iteration provides the module with newly relevant information about the input. In other words, the module has the ability to retrieve new information, in the form of input representations, which were thought to be irrelevant in previous iterations. - Answer Module: The answer module generates an answer from the final memory vector of the memory module.
Ensemble clustering, also called consensus clustering, has been attracting much attention in recent years, aiming to combine multiple base clustering algorithms into a better and more consensus clustering. Due to its good performance, ensemble clustering plays a vital role in many research areas, such as community detection and bioinformatics.
The Universal Transformer is a generalization of the Transformer architecture. Universal Transformers combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs. They also utilise a dynamic per-position halting mechanism.
Class activation guide
Class activation guide is a module which uses weak localization information from the instrument activation maps to guide the verb and target recognition. Image source: Nwoye et al.
LSGAN, or Least Squares GAN, is a type of generative adversarial network that adopts the least squares loss function for the discriminator. Minimizing the objective function of LSGAN yields minimizing the Pearson divergence. The objective function can be defined as: where and are the labels for fake data and real data and denotes the value that wants to believe for fake data.
CheXNet is a 121-layer DenseNet trained on ChestX-ray14 for pneumonia detection.
Self-Cure Network
Self-Cure Network, or SCN, is a method for suppressing uncertainties for large-scale facial expression recognition, prventing deep networks from overfitting uncertain facial images. Specifically, SCN suppresses the uncertainty from two different aspects: 1) a self-attention mechanism over mini-batch to weight each training sample with a ranking regularization, and 2) a careful relabeling mechanism to modify the labels of these samples in the lowest-ranked group.
Path Aggregation Network, or PANet, aims to boost information flow in a proposal-based instance segmentation framework. Specifically, the feature hierarchy is enhanced with accurate localization signals in lower layers by bottom-up path augmentation, which shortens the information path between lower layers and topmost feature. Additionally, adaptive feature pooling is employed, which links feature grid and all feature levels to make useful information in each feature level propagate directly to following proposal subnetworks. A complementary branch capturing different views for each proposal is created to further improve mask prediction.
TimeSformer is a convolution-free approach to video classification built exclusively on self-attention over space and time. It adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Specifically, the method adapts the image model [Vision Transformer](https//www.paperswithcode.com/method/vision-transformer) (ViT) to video by extending the self-attention mechanism from the image space to the space-time 3D volume. As in ViT, each patch is linearly mapped into an embedding and augmented with positional information. This makes it possible to interpret the resulting sequence of vector
Population Based Training, or PBT, is an optimization method for finding parameters and hyperparameters, and extends upon parallel search methods and sequential optimisation methods. It leverages information sharing across a population of concurrently running optimisation processes, and allows for online propagation/transfer of parameters and hyperparameters between members of the population based on their performance. Furthermore, unlike most other adaptation schemes, the method is capable of performing online adaptation of hyperparameters -- which can be particularly important in problems with highly non-stationary learning dynamics, such as reinforcement learning settings. PBT is decentralised and asynchronous, although it could also be executed semi-serially or with partial synchrony if there is a binding budget constraint.
Parsing Incrementally for Constrained Auto-Regressive Decoding
Inception-ResNet-v2-B is an image model block for a 17 x 17 grid used in the Inception-ResNet-v2 architecture. It largely follows the idea of Inception modules - and grouped convolutions - but also includes residual connections.
Inception-ResNet-v2 Reduction-B is an image model block used in the Inception-ResNet-v2 architecture.
Enhanced-Multimodal Fuzzy Framework
BCI MI framework to classifiy brain signals using a multimodal decission making phase, with an addtional differentiation of the signal.
Inception-ResNet-v2-C is an image model block for an 8 x 8 grid used in the Inception-ResNet-v2 architecture. It largely follows the idea of Inception modules - and grouped convolutions - but also includes residual connections.
Gait Emotion Recognition
We present a novel classifier network called STEP, to classify perceived human emotion from gaits, based on a Spatial Temporal Graph Convolutional Network (ST-GCN) architecture. Given an RGB video of an individual walking, our formulation implicitly exploits the gait features to classify the perceived emotion of the human into one of four emotions: happy, sad, angry, or neutral. We train STEP on annotated real-world gait videos, augmented with annotated synthetic gaits generated using a novel generative network called STEP-Gen, built on an ST-GCN based Conditional Variational Autoencoder (CVAE). We incorporate a novel push-pull regularization loss in the CVAE formulation of STEP-Gen to generate realistic gaits and improve the classification accuracy of STEP. We also release a novel dataset (E-Gait), which consists of 4,227 human gaits annotated with perceived emotions along with thousands of synthetic gaits. In practice, STEP can learn the affective features and exhibits classification accuracy of 88\% on E-Gait, which is 14--30\% more accurate over prior methods.
Neural Additive Model
Neural Additive Models (NAMs) make restrictions on the structure of neural networks, which yields a family of models that are inherently interpretable while suffering little loss in prediction accuracy when applied to tabular data. Methodologically, NAMs belong to a larger model family called Generalized Additive Models (GAMs). NAMs learn a linear combination of networks that each attend to a single input feature: each in the traditional GAM formulationis parametrized by a neural network. These networks are trained jointly using backpropagation and can learn arbitrarily complex shape functions. Interpreting NAMs is easy as the impact of a feature on the prediction does not rely on the other features and can be understood by visualizing its corresponding shape function (e.g., plotting vs. ).
A (2+1)D Convolution is a type of convolution used for action recognition convolutional neural networks, with a spatiotemporal volume. As opposed to applying a 3D Convolution over the entire volume, which can be computationally expensive and lead to overfitting, a (2+1)D convolution splits computation into two convolutions: a spatial 2D convolution followed by a temporal 1D convolution.
MPNet is a pre-training method for language models that combines masked language modeling (MLM) and permuted language modeling (PLM) in one view. It takes the dependency among the predicted tokens into consideration through permuted language modeling and thus avoids the issue of BERT. On the other hand, it takes position information of all tokens as input to make the model see the position information of all the tokens and thus alleviates the position discrepancy of XLNet. The training objective of MPNet is: As can be seen, MPNet conditions on (the tokens preceding the current predicted token ) rather than only the non-predicted tokens in MLM; comparing with PLM, MPNet takes more information (i.e., the mask symbol in position ) as inputs. Although the objective seems simple, it is challenging to implement the model efficiently. For details, see the paper.
Linformer is a linear Transformer that utilises a linear self-attention mechanism to tackle the self-attention bottleneck with Transformer models. The original scaled dot-product attention is decomposed into multiple smaller attentions through linear projections, such that the combination of these operations forms a low-rank factorization of the original attention.
ProxylessNAS directly learns neural network architectures on the target task and target hardware without any proxy task. Additional contributions include: - Using a new path-level pruning perspective for neural architecture search, showing a close connection between NAS and model compression. Memory consumption is saved by one order of magnitude by using path-level binarization. - Using a novel gradient-based approach (latency regularization loss) for handling hardware objectives (e.g. latency). Given different hardware platforms: CPU/GPU/Mobile, ProxylessNAS enables hardware-aware neural network specialization that’s exactly optimized for the target hardware.
PrIme Sample Attention
PrIme Sample Attention (PISA) directs the training of object detection frameworks towards prime samples. These are samples that play a key role in driving the detection performance. The authors define Hierarchical Local Rank (HLR) as a metric of importance. Specifically, they use IoU-HLR to rank positive samples and ScoreHLR to rank negative samples in each mini-batch. This ranking strategy places the positive samples with highest IoUs around each object and the negative samples with highest scores in each cluster to the top of the ranked list and directs the focus of the training process to them via a simple re-weighting scheme. The authors also devise a classification-aware regression loss to jointly optimize the classification and regression branches. Particularly, this loss would suppress those samples with large regression loss, thus reinforcing the attention to prime samples.