8,725 machine learning methods and techniques
The Truncation Trick is a latent sampling procedure for generative adversarial networks, where we sample from a truncated normal (where values which fall outside a range are resampled to fall inside that range). The original implementation was in Megapixel Size Image Creation with GAN. In BigGAN, the authors find this provides a boost to the Inception Score and FID.
A Projection Discriminator is a type of discriminator for generative adversarial networks. It is motivated by a probabilistic model in which the distribution of the conditional variable given is discrete or uni-modal continuous distributions. If we look at the original solution for the loss function in the vanilla GANs, we can decompose it into the sum of two log-likelihood ratios: We can model the log likelihood ratio and by some parametric functions and respectively. If we make a standing assumption that and are simple distributions like those that are Gaussian or discrete log linear on the feature space, then the parametrization of the following form becomes natural: where is the embedding matrix of , is a vector output function of , and is a scalar function of the same that appears in . The learned parameters {} are trained to optimize the adversarial loss. This model of the discriminator is the projection.
Two Time-scale Update Rule
The Two Time-scale Update Rule (TTUR) is an update rule for generative adversarial networks trained with stochastic gradient descent. TTUR has an individual learning rate for both the discriminator and the generator. The main premise is that the discriminator converges to a local minimum when the generator is fixed. If the generator changes slowly enough, then the discriminator still converges, since the generator perturbations are small. Besides ensuring convergence, the performance may also improve since the discriminator must first learn new patterns before they are transferred to the generator. In contrast, a generator which is overly fast, drives the discriminator steadily into new regions without capturing its gathered information.
GraphSAGE is a general inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Image from: Inductive Representation Learning on Large Graphs
DropBlock is a structured form of dropout directed at regularizing convolutional networks. In DropBlock, units in a contiguous region of a feature map are dropped together. As DropBlock discards features in a correlated area, the networks must look elsewhere for evidence to fit the data.
A ResNeXt Block is a type of residual block used as part of the ResNeXt CNN architecture. It uses a "split-transform-merge" strategy (branched paths within a single module) similar to an Inception module, i.e. it aggregates a set of transformations. Compared to a Residual Block, it exposes a new dimension, cardinality (size of set of transformations) , as an essential factor in addition to depth and width. Formally, a set of aggregated transformations can be represented as: , where can be an arbitrary function. Analogous to a simple neuron, should project into an (optionally low-dimensional) embedding and then transform it.
A ResNeXt repeats a building block that aggregates a set of transformations with the same topology. Compared to a ResNet, it exposes a new dimension, cardinality (the size of the set of transformations) , as an essential factor in addition to the dimensions of depth and width. Formally, a set of aggregated transformations can be represented as: , where can be an arbitrary function. Analogous to a simple neuron, should project into an (optionally low-dimensional) embedding and then transform it.
Off-Diagonal Orthogonal Regularization is a modified form of orthogonal regularization originally used in BigGAN. The original orthogonal regularization is known to be limiting so the authors explore several variants designed to relax the constraint while still imparting the desired smoothness to the models. They opt for a modification where they remove diagonal terms from the regularization, and aim to minimize the pairwise cosine similarity between filters but does not constrain their norm: where denotes a matrix with all elements set to 1. The authors sweep values and select .
CSPDarknet53 is a convolutional neural network and backbone for object detection that uses DarkNet-53. It employs a CSPNet strategy to partition the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network. This CNN is used as the backbone for YOLOv4.
Approximate Bayesian Computation
Class of methods in Bayesian Statistics where the posterior distribution is approximated over a rejection scheme on simulations because the likelihood function is intractable. Different parameters get sampled and simulated. Then a distance function is calculated to measure the quality of the simulation compared to data from real observations. Only simulations that fall below a certain threshold get accepted. Image source: Kulkarni et al.
energy-based model
Correlation Alignment for Deep Domain Adaptation
PAFPN is a feature pyramid module used in Path Aggregation networks (PANet) that combines FPNs with bottom-up path augmentation, which shortens the information path between lower layers and topmost feature.
Feedback Alignment
Given a pattern that is more complicated than the patterns, we fragment into simpler patterns such that their exact count is known. In the subgraph GNN proposed earlier, look into the subgraph of the host graph. We have seen that this technique is scalable on large graphs. Also, we have seen that subgraph GNN is more expressive and efficient than traditional GNN. So, we tried to explore the expressibility when the pattern is fragmented into smaller subpatterns.
GoogLeNet is a type of convolutional neural network based on the Inception architecture. It utilises Inception modules, which allow the network to choose between multiple convolutional filter sizes in each block. An Inception network stacks these modules on top of each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid.
Xavier Initialization, or Glorot Initialization, is an initialization scheme for neural networks. Biases are initialized be 0 and the weights at each layer are initialized as: Where is a uniform distribution and is the size of the previous layer (number of columns in ) and is the size of the current layer.
Parameterized ReLU
A Parametric Rectified Linear Unit, or PReLU, is an activation function that generalizes the traditional rectified unit with a slope for negative values. Formally: The intuition is that different layers may require different types of nonlinearity. Indeed the authors find in experiments with convolutional neural networks that PReLus for the initial layer have more positive slopes, i.e. closer to linear. Since the filters of the first layers are Gabor-like filters such as edge or texture detectors, this shows a circumstance where positive and negative responses of filters are respected. In contrast the authors find deeper layers have smaller coefficients, suggesting the model becomes more discriminative at later layers (while it wants to retain more information at earlier layers).
VOS is a type of video object segmentation model consisting of two network components. The target appearance model consists of a light-weight module, which is learned during the inference stage using fast optimization techniques to predict a coarse but robust target segmentation. The segmentation model is exclusively trained offline, designed to process the coarse scores into high quality segmentation masks.
Twin Delayed Deep Deterministic
TD3 builds on the DDPG algorithm for reinforcement learning, with a couple of modifications aimed at tackling overestimation bias with the value function. In particular, it utilises clipped double Q-learning, delayed update of target and policy networks, and target policy smoothing (which is similar to a SARSA based update; a safer update, as they provide higher value to actions resistant to perturbations).
BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total).
Target Policy Smoothing is a regularization strategy for the value function in reinforcement learning. Deterministic policies can overfit to narrow peaks in the value estimate, making them highly susceptible to functional approximation error, increasing the variance of the target. To reduce this variance, target policy smoothing adds a small amount of random noise to the target policy and averages over mini-batches - approximating a SARSA-like expectation/integral. The modified target update is: where the added noise is clipped to keep the target close to the original action. The outcome is an algorithm reminiscent of Expected SARSA, where the value estimate is instead learned off-policy and the noise added to the target policy is chosen independently of the exploration policy. The value estimate learned is with respect to a noisy policy defined by the parameter .
Hierarchical Information Threading
An unsupervised approach for identifying Hierarchical Information Threads by analysing the network of related articles in a collection. In particular, HINT leverages article timestamps and the 5W1H questions to identify related articles about an event or discussion. HINT then constructs a network representation of the articles, and identify threads as strongly connected hierarchical network communities.
PnP, or Poll and Pool, is sampling module extension for DETR-type architectures that adaptively allocates its computation spatially to be more efficient. Concretely, the PnP module abstracts the image feature map into fine foreground object feature vectors and a small number of coarse background contextual feature vectors. The transformer models information interaction within the fine-coarse feature space and translates the features into the detection result.
AlphaZero is a reinforcement learning agent for playing board games such as Go, chess, and shogi.
Contrastive Predictive Coding (CPC) learns self-supervised representations by predicting the future in latent space by using powerful autoregressive models. The model uses a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples. First, a non-linear encoder maps the input sequence of observations to a sequence of latent representations , potentially with a lower temporal resolution. Next, an autoregressive model summarizes all in the latent space and produces a context latent representation . A density ratio is modelled which preserves the mutual information between and as follows: where stands for ’proportional to’ (i.e. up to a multiplicative constant). Note that the density ratio can be unnormalized (does not have to integrate to 1). The authors use a simple log-bilinear model: Any type of autoencoder and autoregressive can be used. An example the authors opt for is strided convolutional layers with residual blocks and GRUs. The autoencoder and autoregressive models are trained to minimize an InfoNCE loss (see components).
Double Q-learning is an off-policy reinforcement learning algorithm that utilises double estimation to counteract overestimation problems with traditional Q-learning. The max operator in standard Q-learning and DQN uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. To prevent this, we can decouple the selection from the evaluation, which is the idea behind Double Q-learning: The Double Q-learning error can then be written as: Here the selection of the action in the is still due to the online weights . But we use a second set of weights to fairly evaluate the value of this policy. Source: Deep Reinforcement Learning with Double Q-learning
Exponential Decay is a learning rate schedule where we decay the learning rate with more iterations using an exponential function: Image Credit: Suki Lau
\begin{equation} DiceLoss\left( y, \overline{p} \right) = 1 - \dfrac{\left( 2y\overline{p} + 1 \right)} {\left( y+\overline{p } + 1 \right)} \end{equation}
Flan-T5 is the instruction fine-tuned version of T5 or Text-to-Text Transfer Transformer Language Model.
Matching The Statements
Self-Organizing Map
The Self-Organizing Map (SOM), commonly also known as Kohonen network (Kohonen 1982, Kohonen 2001) is a computational method for the visualization and analysis of high-dimensional data, especially experimentally acquired information. Extracted from scholarpedia Sources: Image: scholarpedia Paper: Kohonen, T. Self-organized formation of topologically correct feature maps. Biol. Cybern. 43, 59–69 (1982) Book: Self-Organizing Maps
Bidirectional GRU
A Bidirectional GRU, or BiGRU, is a sequence processing model that consists of two GRUs. one taking the input in a forward direction, and the other in a backwards direction. It is a bidirectional recurrent neural network with only the input and forget gates. Image Source: Rana R (2016). Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech.
Pointer Networks tackle problems where input and output data are sequential data, but can't be solved by seq2seq type models because discrete categories of output elements depend on the variable input size (and are not decided in advance). A Pointer Network learns the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. They solve the problem of variable size output dictionaries using additive attention. But instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, Pointer Networks use attention as a pointer to select a member of the input sequence as the output. Pointer-Nets can be used to learn approximate solutions to challenging geometric problems such as finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem.
node2vec is a framework for learning graph embeddings for nodes in graphs. Node2vec maximizes a likelihood objective over mappings which preserve neighbourhood distances in higher dimensional spaces. From an algorithm design perspective, node2vec exploits the freedom to define neighbourhoods for nodes and provide an explanation for the effect of the choice of neighborhood on the learned representations. For each node, node2vec simulates biased random walks based on an efficient network-aware search strategy and the nodes appearing in the random walk define neighbourhoods. The search strategy accounts for the relative influence nodes exert in a network. It also generalizes prior work alluding to naive search strategies by providing flexibility in exploring neighborhoods.
Performer is a Transformer architecture which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. To approximate softmax attention-kernels, Performers use a Fast Attention Via positive Orthogonal Random features approach (FAVOR+), leveraging new methods for approximating softmax and Gaussian kernels.
BigGAN is a type of generative adversarial network that was designed for scaling generation to high-resolution, high-fidelity images. It includes a number of incremental changes and innovations. The baseline and incremental changes are: - Using SAGAN as a baseline with spectral norm. for G and D, and using TTUR. - Using a Hinge Loss GAN objective - Using class-conditional batch normalization to provide class information to G (but with linear projection not MLP. - Using a projection discriminator for D to provide class information to D. - Evaluating with EWMA of G's weights, similar to ProGANs. The innovations are: - Increasing batch sizes, which has a big effect on the Inception Score of the model. - Increasing the width in each layer leads to a further Inception Score improvement. - Adding skip connections from the latent variable to further layers helps performance. - A new variant of Orthogonal Regularization.
Grid Sensitive is a trick for object detection introduced by YOLOv4. When we decode the coordinate of the bounding box center and , in original YOLOv3, we can get them by where is the sigmoid function, and are integers and is a scale factor. Obviously, and cannot be exactly equal to or . This makes it difficult to predict the centres of bounding boxes that just located on the grid boundary. We can address this problem, by changing the equation to This makes it easier for the model to predict bounding box center exactly located on the grid boundary. The FLOPs added by Grid Sensitive are really small, and can be totally ignored.
Activation Patching
Activation patching studies the model's computation by altering its latent representations, the token embeddings in transformer-based language models, during the inference process
Conditional Variational Auto Encoder
YOLOv4 is a one-stage object detection model that improves on YOLOv3 with several bags of tricks and modules introduced in the literature. The components section below details the tricks and modules used.
Inception-v3 is a convolutional neural network architecture from the Inception family that makes several improvements including using Label Smoothing, Factorized 7 x 7 convolutions, and the use of an auxiliary classifer to propagate label information lower down the network (along with the use of batch normalization for layers in the sidehead).
Minimum Description Length
Minimum Description Length provides a criterion for the selection of models, regardless of their complexity, without the restrictive assumption that the data form a sample from a 'true' distribution. Extracted from scholarpedia Source: Paper: J. Rissanen (1978) Modeling by the shortest data description. Automatica 14, 465-47190005-5) Book: P. D. Grünwald (2007) The Minimum Description Length Principle, MIT Press, June 2007, 570 pages
SqueezeNet is a convolutional neural network that employs design strategies to reduce the number of parameters, notably with the use of fire modules that "squeeze" parameters using 1x1 convolutions.
Inception-v3 Module is an image block used in the Inception-v3 architecture. This architecture is used on the coarsest (8 × 8) grids to promote high dimensional representations.
ReLU6 is a modification of the rectified linear unit where we limit the activation to a maximum size of . This is due to increased robustness when used with low-precision computation. Image Credit: PyTorch
Wasserstein GAN
Wasserstein GAN, or WGAN, is a type of generative adversarial network that minimizes an approximation of the Earth-Mover's distance (EM) rather than the Jensen-Shannon divergence as in the original GAN formulation. It leads to more stable training than original GANs with less evidence of mode collapse, as well as meaningful curves that can be used for debugging and searching hyperparameters.
Additive Angular Margin Loss
ArcFace, or Additive Angular Margin Loss, is a loss function used in face recognition tasks. The softmax is traditionally used in these tasks. However, the softmax loss function does not explicitly optimise the feature embedding to enforce higher similarity for intraclass samples and diversity for inter-class samples, which results in a performance gap for deep face recognition under large intra-class appearance variations. The ArcFace loss transforms the logits , where is the angle between the weight and the feature . The individual weight is fixed by normalization. The embedding feature is fixed by normalization and re-scaled to . The normalisation step on features and weights makes the predictions only depend on the angle between the feature and the weight. The learned embedding features are thus distributed on a hypersphere with a radius of . Finally, an additive angular margin penalty is added between and to simultaneously enhance the intra-class compactness and inter-class discrepancy. Since the proposed additive angular margin penalty is equal to the geodesic distance margin penalty in the normalised hypersphere, the method is named ArcFace: The authors select face images from 8 different identities containing enough samples (around 1,500 images/class) to train 2-D feature embedding networks with the softmax and ArcFace loss, respectively. As the Figure shows, the softmax loss provides roughly separable feature embedding but produces noticeable ambiguity in decision boundaries, while the proposed ArcFace loss can obviously enforce a more evident gap between the nearest classes. Other alternatives to enforce intra-class compactness and inter-class distance include Supervised Contrastive Learning.
BLIP: Bootstrapping Language-Image Pre-training
Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https://github.com/salesforce/BLIP.