8,725 machine learning methods and techniques
Principal Components Analysis
Principle Components Analysis (PCA) is an unsupervised method primary used for dimensionality reduction within machine learning. PCA is calculated via a singular value decomposition (SVD) of the design matrix, or alternatively, by calculating the covariance matrix of the data and performing eigenvalue decomposition on the covariance matrix. The results of PCA provide a low-dimensional picture of the structure of the data and the leading (uncorrelated) latent factors determining variation in the data. Image Source: Wikipedia
Depthwise Convolution is a type of convolution where we apply a single convolutional filter for each input channel. In the regular 2D convolution performed over multiple input channels, the filter is as deep as the input and lets us freely mix channels to generate each element in the output. In contrast, depthwise convolutions keep each channel separate. To summarize the steps, we: 1. Split the input and filter into channels. 2. We convolve each input with the respective filter. 3. We stack the convolved outputs together. Image Credit: Chi-Feng Wang
Pointwise Convolution is a type of convolution that uses a 1x1 kernel: a kernel that iterates through every single point. This kernel has a depth of however many channels the input image has. It can be used in conjunction with depthwise convolutions to produce an efficient class of convolutions known as depthwise-separable convolutions. Image Credit: Chi-Feng Wang
Retriever-Augmented Generation, or RAG, is a type of language generation model that combines pre-trained parametric and non-parametric memory for language generation. Specifically, the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. For query , Maximum Inner Product Search (MIPS) is used to find the top-K documents . For final prediction , we treat as a latent variable and marginalize over seq2seq predictions given different documents.
GPT is a Transformer-based architecture and training procedure for natural language processing tasks. Training follows a two-stage procedure. First, a language modeling objective is used on the unlabeled data to learn the initial parameters of a neural network model. Subsequently, these parameters are adapted to a target task using the corresponding supervised objective.
While standard convolution performs the channelwise and spatial-wise computation in one step, Depthwise Separable Convolution splits the computation into two steps: depthwise convolution applies a single convolutional filter per each input channel and pointwise convolution is used to create a linear combination of the output of the depthwise convolution. The comparison of standard convolution and depthwise separable convolution is shown to the right. Credit: Depthwise Convolution Is All You Need for Learning Multiple Visual Domains
LLaMA is a collection of foundation language models ranging from 7B to 65B parameters. It is based on the transformer architecture with various improvements that were subsequently proposed. The main difference with the original architecture are listed below. - RMSNorm normalizing function is used to improve the training stability, by normalizing the input of each transformer sub-layer, instead of normalizing the output. - The ReLU non-linearity is replaced by the SwiGLU activation function to improve performance. - Absolute positional embeddings are removed and instead rotary positional embeddings (RoPE) are added at each layer of the network.
Region Proposal Network
A Region Proposal Network, or RPN, is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals. RPN and algorithms like Fast R-CNN can be merged into a single network by sharing their convolutional features - using the recently popular terminology of neural networks with attention mechanisms, the RPN component tells the unified network where to look. RPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios. RPNs use anchor boxes that serve as references at multiple scales and aspect ratios. The scheme can be thought of as a pyramid of regression references, which avoids enumerating images or filters of multiple scales or aspect ratios.
Train a convolutional neural network to generate the contents of an arbitrary image region conditioned on its surroundings.
Graph Convolutional Network
A Graph Convolutional Network, or GCN, is an approach for semi-supervised learning on graph-structured data. It is based on an efficient variant of convolutional neural networks which operate directly on graphs. The choice of convolutional architecture is motivated via a localized first-order approximation of spectral graph convolutions. The model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes.
Proximal Policy Optimization
Proximal Policy Optimization, or PPO, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of TRPO, while using only first-order optimization. Let denote the probability ratio , so . TRPO maximizes a “surrogate” objective: Where refers to a conservative policy iteration. Without a constraint, maximization of would lead to an excessively large policy update; hence, we PPO modifies the objective, to penalize changes to the policy that move away from 1: where is a hyperparameter, say, . The motivation for this objective is as follows. The first term inside the min is . The second term, modifies the surrogate objective by clipping the probability ratio, which removes the incentive for moving outside of the interval . Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse. One detail to note is that when we apply PPO for a network where we have shared parameters for actor and critic functions, we typically add to the objective function an error term on value estimation and an entropy term to encourage exploration.
Segment Anything Model
Experience Replay is a replay memory technique used in reinforcement learning where we store the agent’s experiences at each time-step, in a data-set , pooled over many episodes into a replay memory. We then usually sample the memory randomly for a minibatch of experience, and use this to learn off-policy, as with Deep Q-Networks. This tackles the problem of autocorrelation leading to unstable training, by making the problem more like a supervised learning problem. Image Credit: Hands-On Reinforcement Learning with Python, Sudharsan Ravichandiran
ADaptive gradient method with the OPTimal convergence rate
Please enter a description about the method here
A Gated Linear Unit, or GLU computes: It is used in natural language processing architectures, for example the Gated CNN, because here is the gate that control what information from is passed up to the following layer. Intuitively, for a language modeling task, the gating mechanism allows selection of words or features that are important for predicting the next word. The GLU also has non-linear capabilities, but has a linear path for the gradient so diminishes the vanishing gradient problem.
k-Means Clustering is a clustering algorithm that divides a training set into different clusters of examples that are near each other. It works by initializing different centroids {} to different values, then alternating between two steps until convergence: (i) each training example is assigned to cluster where is the index of the nearest centroid (ii) each centroid is updated to the mean of all training examples assigned to cluster . Text Source: Deep Learning, Goodfellow et al Image Source: scikit-learn
Normalizing Flows are a method for constructing complex distributions by transforming a probability density through a series of invertible mappings. By repeatedly applying the rule for change of variables, the initial density ‘flows’ through the sequence of invertible mappings. At the end of this sequence we obtain a valid probability distribution and hence this type of flow is referred to as a normalizing flow. In the case of finite flows, the basic rule for the transformation of densities considers an invertible, smooth mapping with inverse , i.e. the composition . If we use this mapping to transform a random variable with distribution , the resulting random variable has a distribution: where the last equality can be seen by applying the chain rule (inverse function theorem) and is a property of Jacobians of invertible functions. We can construct arbitrarily complex densities by composing several simple maps and successively applying the above equation. The density obtained by successively transforming a random variable with distribution through a chain of transformations is: The path traversed by the random variables with initial distribution is called the flow and the path formed by the successive distributions is a normalizing flow.
GPT-2 is a Transformer architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous GPT architecture with some modifications: - Layer normalization is moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block. - A modified initialization which accounts for the accumulation on the residual path with model depth is used. Weights of residual layers are scaled at initialization by a factor of where is the number of residual layers. - The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and a larger batch size of 512 is used.
Adafactor is a stochastic optimization method based on Adam that reduces memory usage while retaining the empirical benefits of adaptivity. This is achieved through maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrix-valued variables, we are able to reconstruct a low-rank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized Kullback-Leibler divergence. For an matrix, this reduces the memory requirements from to . Instead of defining the optimization algorithm in terms of absolute step sizes {}, the authors define the optimization algorithm in terms of relative step sizes {}, which get multiplied by the scale of the parameters. The scale of a parameter vector or matrix is defined as the root-mean-square of its components, lower-bounded by a small constant . The reason for this lower bound is to allow zero-initialized parameters to escape 0. Proposed hyperparameters are: , , , , .
Masked autoencoder
T5, or Text-to-Text Transfer Transformer, is a Transformer based architecture that uses a text-to-text approach. Every task – including translation, question answering, and classification – is cast as feeding the model text as input and training it to generate some target text. This allows for the use of the same model, loss function, hyperparameters, etc. across our diverse set of tasks. The changes compared to BERT include: - adding a causal decoder to the bidirectional architecture. - replacing the fill-in-the-blank cloze task with a mix of alternative pre-training tasks.
Greedy Policy Search
Greedy Policy Search (GPS) is a simple algorithm that learns a policy for test-time data augmentation based on the predictive performance on a validation set. GPS starts with an empty policy and builds it in an iterative fashion. Each step selects a sub-policy that provides the largest improvement in calibrated log-likelihood of ensemble predictions and adds it to the current policy.
Inverse Square Root is a learning rate schedule 1 / where is the current training iteration and is the number of warm-up steps. This sets a constant learning rate for the first steps, then exponentially decays the learning rate until pre-training is over.
Sequence to Sequence
Seq2Seq, or Sequence To Sequence, is a model used in sequence prediction tasks, such as language modelling and machine translation. The idea is to use one LSTM, the encoder, to read the input sequence one timestep at a time, to obtain a large fixed dimensional vector representation (a context vector), and then to use another LSTM, the decoder, to extract the output sequence from that vector. The second LSTM is essentially a recurrent neural network language model except that it is conditioned on the input sequence. (Note that this page refers to the original seq2seq not general sequence-to-sequence models)
Causal inference is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect. The main difference between causal inference and inference of association is that the former analyzes the response of the effect variable when the cause is changed.
Gated Recurrent Unit
A Gated Recurrent Unit, or GRU, is a type of recurrent neural network. It is similar to an LSTM, but only has two gates - a reset gate and an update gate - and notably lacks an output gate. Fewer parameters means GRUs are generally easier/faster to train than their LSTM counterparts. Image Source: here
Bidirectional LSTM
A Bidirectional LSTM, or biLSTM, is a sequence processing model that consists of two LSTMs: one taking the input in a forward direction, and the other in a backwards direction. BiLSTMs effectively increase the amount of information available to the network, improving the context available to the algorithm (e.g. knowing what words immediately follow and precede a word in a sentence). Image Source: Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks, Cornegruta et al
Mixup is a data augmentation technique that generates a weighted combination of random image pairs from the training data. Given two images and their ground truth labels: , a synthetic training example is generated as: where is independently sampled for each augmented example.
Region of Interest Align, or RoIAlign, is an operation for extracting a small feature map from each RoI in detection and segmentation based tasks. It removes the harsh quantization of RoI Pool, properly aligning the extracted features with the input. To avoid any quantization of the RoI boundaries or bins (using instead of ), RoIAlign uses bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and the result is then aggregated (using max or average).
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers’ computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pre-training and downstream evaluation.
A Grouped Convolution uses a group of convolutions - multiple kernels per layer - resulting in multiple channel outputs per layer. This leads to wider networks helping a network learn a varied set of low level and high level features. The original motivation of using Grouped Convolutions in AlexNet was to distribute the model over multiple GPUs as an engineering compromise. But later, with models such as ResNeXt, it was shown this module could be used to improve classification accuracy. Specifically by exposing a new dimension through grouped convolutions, cardinality (the size of set of transformations), we can increase accuracy by increasing it.
Instance Normalization (also known as contrast normalization) is a normalization layer where: This prevents instance-specific mean and covariance shift simplifying the learning process. Intuitively, the normalization process allows to remove instance-specific contrast information from the content image in a task like image stylization, which simplifies generation.
The Squeeze-and-Excitation Block is an architectural unit designed to improve the representational power of a network by enabling it to perform dynamic channel-wise feature recalibration. The process is: - The block has a convolutional block as an input. - Each channel is "squeezed" into a single numeric value using average pooling. - A dense layer followed by a ReLU adds non-linearity and output channel complexity is reduced by a ratio. - Another dense layer followed by a sigmoid gives each channel a smooth gating function. - Finally, we weight each feature map of the convolutional block based on the side network; the "excitation".
RMSProp is an unpublished adaptive learning rate optimizer proposed by Geoff Hinton. The motivation is that the magnitude of gradients can differ for different weights, and can change during learning, making it hard to choose a single global learning rate. RMSProp tackles this by keeping a moving average of the squared gradient and adjusting the weight updates by this magnitude. The gradient updates are performed as: Hinton suggests , with a good default for as . Image: Alec Radford
Deep Q-Network
A DQN, or Deep Q-Network, approximates a state-value function in a Q-Learning framework with a neural network. In the Atari Games case, they take in several frames of the game as an input and output state values for each action as an output. It is usually used in conjunction with Experience Replay, for storing the episode steps in memory for off-policy learning, where samples are drawn from the replay memory at random. Additionally, the Q-Network is usually optimized towards a frozen target network that is periodically updated with the latest weights every steps (where is a hyperparameter). The latter makes training more stable by preventing short-term oscillations from a moving target. The former tackles autocorrelation that would occur from on-line learning, and having a replay memory makes the problem more like a supervised learning problem. Image Source: here
RINLINEMATH1 Regularization is a regularization technique and gradient penalty for training generative adversarial networks. It penalizes the discriminator from deviating from the Nash Equilibrium via penalizing the gradient on real data alone: when the generator distribution produces the true data distribution and the discriminator is equal to 0 on the data manifold, the gradient penalty ensures that the discriminator cannot create a non-zero gradient orthogonal to the data manifold without suffering a loss in the GAN game. This leads to the following regularization term:
Spectral clustering has attracted increasing attention due to the promising ability in dealing with nonlinearly separable datasets [15], [16]. In spectral clustering, the spectrum of the graph Laplacian is used to reveal the cluster structure. The spectral clustering algorithm mainly consists of two steps: 1) constructs the low dimensional embedded representation of the data based on the eigenvectors of the graph Laplacian, 2) applies k-means on the constructed low dimensional data to obtain the clustering result. Thus,
High-Order Consensuses
Faster R-CNN is an object detection model that improves on Fast R-CNN by utilising a region proposal network (RPN) with the CNN model. The RPN shares full-image convolutional features with the detection network, enabling nearly cost-free region proposals. It is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. RPN and Fast R-CNN are merged into a single network by sharing their convolutional features: the RPN component tells the unified network where to look. As a whole, Faster R-CNN consists of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector that uses the proposed regions.
A Dense Block is a module used in convolutional neural networks that connects all layers (with matching feature-map sizes) directly with each other. It was originally proposed as part of the DenseNet architecture. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. In contrast to ResNets, we never combine features through summation before they are passed into a layer; instead, we combine features by concatenating them. Hence, the layer has inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all subsequent layers. This introduces connections in an -layer network, instead of just , as in traditional architectures: "dense connectivity".
Early Stopping is a regularization technique for deep neural networks that stops training when parameter updates no longer begin to yield improves on a validation set. In essence, we store and update the current best parameters during training, and when parameter updates no longer yield an improvement (after a set number of iterations) we stop training and use the last best parameters. It works as a regularizer by restricting the optimization procedure to a smaller volume of parameter space. Image Source: Ramazan Gençay
Conditional Random Field
Conditional Random Fields or CRFs are a type of probabilistic graph model that take neighboring sample context into account for tasks like classification. Prediction is modeled as a graphical model, which implements dependencies between the predictions. Graph choice depends on the application, for example linear chain CRFs are popular in natural language processing, whereas in image-based tasks, the graph would connect to neighboring locations in an image to enforce that they have similar predictions. Image Credit: Charles Sutton and Andrew McCallum, An Introduction to Conditional Random Fields
Stochastic Depth aims to shrink the depth of a network during training, while keeping it unchanged during testing. This is achieved by randomly dropping entire ResBlocks during training and bypassing their transformations through skip connections. Let {} denote a Bernoulli random variable, which indicates whether the th ResBlock is active () or inactive (). Further, let us denote the “survival” probability of ResBlock as . With this definition we can bypass the th ResBlock by multiplying its function with and we extend the update rule to: If , this reduces to the original ResNet update and this ResBlock remains unchanged. If , the ResBlock reduces to the identity function, .
A Focal Loss function addresses class imbalance during training in tasks like object detection. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples. It is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. Formally, the Focal Loss adds a factor to the standard cross entropy criterion. Setting reduces the relative loss for well-classified examples (), putting more focus on hard, misclassified examples. Here there is tunable focusing parameter .
Linear Discriminant Analysis
Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition, and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification. Extracted from Wikipedia Source: Paper: Linear Discriminant Analysis: A Detailed Tutorial Public version: Linear Discriminant Analysis: A Detailed Tutorial
Cycle Consistency Loss is a type of loss used for generative adversarial networks that performs unpaired image-to-image translation. It was introduced with the CycleGAN architecture. For two domains and , we want to learn a mapping and . We want to enforce the intuition that these mappings should be reverses of each other and that both mappings should be bijections. Cycle Consistency Loss encourages and . It reduces the space of possible mapping functions by enforcing forward and backwards consistency: