2,776 machine learning methods and techniques
Track objects as points
Our tracker, CenterTrack, applies a detection model to a pair of images and detections from the prior frame. Given this minimal input, CenterTrack localizes objects and predicts their associations with the previous frame. That's it. CenterTrack is simple, online (no peeking into the future), and real-time.
TrOCR is an end-to-end Transformer-based OCR model for text recognition with pre-trained CV and NLP models. It leverages the Transformer architecture for both image understanding and wordpiece-level text generation. It first resizes the input text image into and then the image is split into a sequence of 16 patches which are used as the input to image Transformers. Standard Transformer architecture with the self-attention mechanism is leveraged on both encoder and decoder parts, where wordpiece units are generated as the recognized text from the input image.
Bilateral grid is a new data structure that enables fast edge-aware image processing. It enables edge-aware image manipulations such as local tone mapping on high resolution images in real time. Source: Chen et al. Image source: Chen et al.
Wasserstein GAN (Gradient Penalty)
Wasserstein GAN + Gradient Penalty, or WGAN-GP, is a generative adversarial network that uses the Wasserstein loss formulation plus a gradient norm penalty to achieve Lipschitz continuity. The original WGAN uses weight clipping to achieve 1-Lipschitz functions, but this can lead to undesirable behaviour by creating pathological value surfaces and capacity underuse, as well as gradient explosion/vanishing without careful tuning of the weight clipping parameter . A Gradient Penalty is a soft version of the Lipschitz constraint, which follows from the fact that functions are 1-Lipschitz iff the gradients are of norm at most 1 everywhere. The squared difference from norm 1 is used as the gradient penalty.
NAS-FPN is a Feature Pyramid Network that is discovered via Neural Architecture Search in a novel scalable search space covering all cross-scale connections. The discovered architecture consists of a combination of top-down and bottom-up connections to fuse features across scales
Self-Attention Network
Self-Attention Network (SANet) proposes two variations of self-attention used for image recognition: 1) pairwise self-attention which generalizes standard dot-product attention and is fundamentally a set operator, and 2) patchwise self-attention which is strictly more powerful than convolution.
RepPoints is a representation for object detection that consists of a set of points which indicate the spatial extent of an object and semantically significant local areas. This representation is learned via weak localization supervision from rectangular ground-truth boxes and implicit recognition feedback. Based on the richer RepPoints representation, the authors develop an anchor-free object detector that yields improved performance compared to using bounding boxes.
BigGAN-deep is a deeper version (4x) of BigGAN. The main difference is a slightly differently designed residual block. Here the vector is concatenated with the conditional vector without splitting it into chunks. It is also based on residual blocks with bottlenecks. BigGAN-deep uses a different strategy than BigGAN aimed at preserving identity throughout the skip connections. In G, where the number of channels needs to be reduced, BigGAN-deep simply retains the first group of channels and drop the rest to produce the required number of channels. In D, where the number of channels should be increased, BigGAN-deep passes the input channels unperturbed, and concatenates them with the remaining channels produced by a 1 × 1 convolution. As far as the network configuration is concerned, the discriminator is an exact reflection of the generator. There are two blocks at each resolution (BigGAN uses one), and as a result BigGAN-deep is four times deeper than BigGAN. Despite their increased depth, the BigGAN-deep models have significantly fewer parameters mainly due to the bottleneck structure of their residual blocks.
A Global Context Network, or GCNet, utilises global context blocks to model long-range dependencies in images. It is based on the Non-Local Network, but it modifies the architecture so less computation is required. Global context blocks are applied to multiple layers in a backbone network to construct the GCNet.
FixRes is an image scaling strategy that seeks to optimize classifier performance. It is motivated by the observation that data augmentations induce a significant discrepancy between the size of the objects seen by the classifier at train and test time: in fact, a lower train resolution improves the classification at test time! FixRes is a simple strategy to optimize the classifier performance, that employs different train and test resolutions. The calibrations are: (a) calibrating the object sizes by adjusting the crop size and (b) adjusting statistics before spatial pooling.
Patch AutoAugment
Patch AutoAugment is a patch-level automatic data augmentation algorithm that automatically searches for the optimal augmentation policies for the patches of an image. Specifically, PAA allows each patch DA operation to be controlled by an agent and models it as a Multi-Agent Reinforcement Learning (MARL) problem. At each step, PAA samples the most effective operation for each patch based on its content and the semantics of the whole image. The agents cooperate as a team and share a unified team reward for achieving the joint optimal DA policy of the whole image. PAA is co-trained with a target network through adversarial training. At each step, the policy network samples the most effective operation for each patch based on its content and the semantics of the image.
MoCo v3 aims to stabilize training of self-supervised ViTs. MoCo v3 is an incremental improvement of MoCo v1/2. Two crops are used for each image under random data augmentation. They are encoded by two encoders and with output vectors and . behaves like a "query", where the goal of learning is to retrieve the corresponding "key". The objective is to minimize a contrastive loss function of the following form: This approach aims to train the Transformer in the contrastive/Siamese paradigm. The encoder consists of a backbone (e.g., ResNet and ViT), a projection head, and an extra prediction head. The encoder has the back the backbone and projection head but not the prediction head. is updated by the moving average of , excluding the prediction head.
Test-time Local Converter
TLC convert the global operation to a local one so that it extract representations based on local spatial region of features as in training phase.
ResNet-D is a modification on the ResNet architecture that utilises an average pooling tweak for downsampling. The motivation is that in the unmodified ResNet, the 1 × 1 convolution for the downsampling block ignores 3/4 of input feature maps, so this is modified so no information will be ignored
ZCA Whitening is an image preprocessing method that leads to a transformation of data such that the covariance matrix is the identity matrix, leading to decorrelated features. Image Source: Alex Krizhevsky
Residual Multi-Layer Perceptrons
Residual Multi-Layer Perceptrons, or ResMLP, is an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. At the end of the network, the patch representations are average pooled, and fed to a linear classifier. Layer normalization is replaced with a simpler affine transformation, thanks to the absence of self-attention layers which makes training more stable. The affine operator is applied at the beginning ("pre-normalization") and end ("post-normalization") of each residual block. As a pre-normalization, Aff replaces LayerNorm without using channel-wise statistics. Initialization is achieved as , and . As a post-normalization, Aff is similar to LayerScale and is initialized with the same small value.
Negative Face Recognition
Negative Face Recognition, or NFR, is a face recognition approach that enhances the soft-biometric privacy on the template-level by representing face templates in a complementary (negative) domain. While ordinary templates characterize facial properties of an individual, negative templates describe facial properties that does not exist for this individual. This suppresses privacy-sensitive information from stored templates. Experiments are conducted on two publicly available datasets captured under controlled and uncontrolled scenarios on three privacy-sensitive attributes.
PoolFormer is instantiated from MetaFormer by specifying the token mixer as extremely simple operator, pooling. PoolFormer is utilized as a tool to verify MetaFormer hypothesis "MetaFormer is actually what you need" (vs "Attention is all you need").
Multiscale Vision Transformer
Multiscale Vision Transformer, or MViT, is a transformer architecture for modeling visual data such as images and videos. Unlike conventional transformers, which maintain a constant channel capacity and resolution throughout the network, Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.
A Reversible Residual Network, or RevNet, is a variant of a ResNet where each layer’s activations can be reconstructed exactly from the next layer’s. Therefore, the activations for most layers need not be stored in memory during backpropagation. The result is a network architecture whose activation storage requirements are independent of depth, and typically at least an order of magnitude smaller compared with equally sized ResNets. RevNets are composed of a series of reversible blocks. Units in each layer are partitioned into two groups, denoted and ; the authors find what works best is partitioning the channels. Each reversible block takes inputs and produces outputs according to the following additive coupling rules – inspired the transformation in NICE (nonlinear independent components estimation) – and residual functions and analogous to those in standard ResNets: Each layer’s activations can be reconstructed from the next layer’s activations as follows: Note that unlike residual blocks, reversible blocks must have a stride of 1 because otherwise the layer discards information, and therefore cannot be reversible. Standard ResNet architectures typically have a handful of layers with a larger stride. If we define a RevNet architecture analogously, the activations must be stored explicitly for all non-reversible layers.
Image Scale Augmentation is an augmentation technique where we randomly pick the short size of a image within a dimension range. One use case of this augmentation technique is in object detectiont asks.
VarifocalNet
VarifocalNet is a method aimed at accurately ranking a huge number of candidate detections in object detection. It consists of a new loss function, named Varifocal Loss, for training a dense object detector to predict the IACS, and a new efficient star-shaped bounding box feature representation for estimating the IACS and refining coarse bounding boxes. Combining these two new components and a bounding box refinement branch, results in a dense object detector on the FCOS architecture, what the authors call VarifocalNet or VFNet for short.
A Pyramidal Residual Unit is a type of residual unit where the number of channels gradually increases as a function of the depth at which the layer occurs, which is similar to a pyramid structure of which the shape gradually widens from the top downwards. It was introduced as part of the PyramidNet architecture.
Temporaral Difference Network
TDN, or Temporaral Difference Network, is an action recognition model that aims to capture multi-scale temporal information. To fully capture temporal information over the entire video, the TDN is established with a two-level difference modeling paradigm. Specifically, for local motion modeling, temporal difference over consecutive frames is used to supply 2D CNNs with finer motion pattern, while for global motion modeling, temporal difference across segments is incorporated to capture long-range structure for motion feature excitation.
Context Enhancement Module (CEM) is a feature extraction module used in object detection (specifically, ThunderNet) which aims to to enlarge the receptive field. The key idea of CEM is to aggregate multi-scale local context information and global context information to generate more discriminative features. In CEM, the feature maps from three scales are merged: , and . is the global context feature vector by applying a global average pooling on . We then apply a 1 × 1 convolution on each feature map to squeeze the number of channels to . Afterwards, is upsampled by 2× and is broadcast so that the spatial dimensions of the three feature maps are equal. At last, the three generated feature maps are aggregated. By leveraging both local and global context, CEM effectively enlarges the receptive field and refines the representation ability of the thin feature map. Compared with prior FPN structures, CEM involves only two 1×1 convolutions and a fc layer.
Spatial Feature Transform, or SFT, is a layer that generates affine transformation parameters for spatial-wise feature modulation, and was originally proposed within the context of image super-resolution. A Spatial Feature Transform (SFT) layer learns a mapping function that outputs a modulation parameter pair based on some prior condition . The learned parameter pair adaptively influences the outputs by applying an affine transformation spatially to each intermediate feature maps in an SR network. During testing, only a single forward pass is needed to generate the HR image given the LR input and segmentation probability maps. More precisely, the prior is modeled by a pair of affine transformation parameters through a mapping function . Consequently, After obtaining from conditions, the transformation is carried out by scaling and shifting feature maps of a specific layer: where denotes the feature maps, whose dimension is the same as and , and is referred to element-wise multiplication, i.e., Hadamard product. Since the spatial dimensions are preserved, the SFT layer not only performs feature-wise manipulation but also spatial-wise transformation.
Receptive Field Block
Receptive Field Block (RFB) is a module for strengthening the deep features learned from lightweight CNN models so that they can contribute to fast and accurate detectors. Specifically, RFB makes use of multi-branch pooling with varying kernels corresponding to RFs of different sizes, applies dilated convolution layers to control their eccentricities, and reshapes them to generate final representation.
A Pyramidal Bottleneck Residual Unit is a type of residual unit where the number of channels gradually increases as a function of the depth at which the layer occurs, which is similar to a pyramid structure of which the shape gradually widens from the top downwards. It also consists of a bottleneck using 1x1 convolutions. It was introduced as part of the PyramidNet architecture.
Keypoint Pose Encoding
MagFace is a category of losses for face recognition that learn a universal feature embedding whose magnitude can measure the quality of a given face. Under the new loss, it can be proven that the magnitude of the feature embedding monotonically increases if the subject is more likely to be recognized. In addition, MagFace introduces an adaptive mechanism to learn a well-structured within-class feature distributions by pulling easy samples to class centers while pushing hard samples away. For face recognition, MagFace helps prevent model overfitting on noisy and low-quality samples by an adaptive mechanism to learn well-structured within-class feature distributions -- by pulling easy samples to class centers while pushing hard samples away.
MLP-Mixer Layer
A Mixer layer is a layer used in the MLP-Mixer architecture proposed by Tolstikhin et. al (2021) for computer vision. Mixer layers consist purely of MLPs, without convolutions or attention. It takes an input of embedded image patches (tokens), with its output having the same shape as its input, similar to that of a Vision Transformer encoder. As suggested by its name, Mixer layers "mix" tokens and channels through its "token mixing" and "channel mixing" MLPs contained the layer. It utilizes previous techniques by other architectures, such as layer normalization, skip-connections, and regularization methods. Image credit: Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., ... & Dosovitskiy, A. (2021). Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34, 24261-24272.
gMLP is an MLP-based alternative to Transformers without self-attention, which simply consists of channel projections and spatial projections with static parameterization. It is built out of basic MLP layers with gating. The model consists of a stack of blocks with identical size and structure. Let be the token representations with sequence length and dimension . Each block is defined as: where is an activation function such as GeLU. and define linear projections along the channel dimension - the same as those in the FFNs of Transformers (e.g., their shapes are and for ). A key ingredient is , a layer which captures spatial interactions. When is an identity mapping, the above transformation degenerates to a regular FFN, where individual tokens are processed independently without any cross-token communication. One of the major focuses is therefore to design a good capable of capturing complex spatial interactions across tokens. This leads to the use of a Spatial Gating Unit which involves a modified linear gating. The overall block layout is inspired by inverted bottlenecks, which define as a spatial depthwise convolution. Note, unlike Transformers, gMLP does not require position embeddings because such information will be captured in .
Fast AutoAugment is an image data augmentation algorithm that finds effective augmentation policies via a search strategy based on density matching, motivated by Bayesian DA. The strategy is to improve the generalization performance of a given network by learning the augmentation policies which treat augmented data as missing data points of training data. However, different from Bayesian DA, the proposed method recovers those missing data points by the exploitation-and-exploration of a family of inference-time augmentations via Bayesian optimization in the policy search phase. This is realized by using an efficient density matching algorithm that does not require any back-propagation for network training for each policy evaluation.
FSAF, or Feature Selective Anchor-Free, is a building block for single-shot object detectors. It can be plugged into single-shot detectors with feature pyramid structure. The FSAF module addresses two limitations brought up by the conventional anchor-based detection: 1) heuristic-guided feature selection; 2) overlap-based anchor sampling. The general concept of the FSAF module is online feature selection applied to the training of multi-level anchor-free branches. Specifically, an anchor-free branch is attached to each level of the feature pyramid, allowing box encoding and decoding in the anchor-free manner at an arbitrary level. During training, we dynamically assign each instance to the most suitable feature level. At the time of inference, the FSAF module can work jointly with anchor-based branches by outputting predictions in parallel. We instantiate this concept with simple implementations of anchor-free branches and online feature selection strategy The general concept is presented in the Figure to the right. An anchor-free branch is built per level of feature pyramid, independent to the anchor-based branch. Similar to the anchor-based branch, it consists of a classification subnet and a regression subnet (not shown in figure). An instance can be assigned to arbitrary level of the anchor-free branch. During training, we dynamically select the most suitable level of feature for each instance based on the instance content instead of just the size of instance box. The selected level of feature then learns to detect the assigned instances. At inference, the FSAF module can run independently or jointly with anchor-based branches. The FSAF module is agnostic to the backbone network and can be applied to single-shot detectors with a structure of feature pyramid. Additionally, the instantiation of anchor-free branches and online feature selection can be various.
ScaleNet, or a Scale Aggregation Network, is a type of convolutional neural network which learns a neuron allocation for aggregating multi-scale information in different building blocks of a deep network. The most informative output neurons in each block are preserved while others are discarded, and thus neurons for multiple scales are competitively and adaptively allocated. The scale aggregation (SA) block concatenates feature maps at a wide range of scales. Feature maps for each scale are generated by a stack of downsampling, convolution and upsampling operations.
Submanifold Convolution (SC) is a spatially sparse convolution operation used for tasks with sparse data like semantic segmentation of 3D point clouds. An SC convolution computes the set of active sites in the same way as a regular convolution: it looks for the presence of any active sites in its receptive field of size . If the input has size then the output will have size . Unlike a regular convolution, an SC convolution discards the ground state for non-active sites by assuming that the input from those sites is zero. For more details see the paper, or the official code here.
A Scale Aggregation Block concatenates feature maps at a wide range of scales. Feature maps for each scale are generated by a stack of downsampling, convolution and upsampling operations. The proposed scale aggregation block is a standard computational module which readily replaces any given transformation , where , with and being the input and output channel number respectively. is any operator such as a convolution layer or a series of convolution layers. Assume we have scales. Each scale is generated by sequentially conducting a downsampling , a transformation and an unsampling operator : where , , and . Notably, has the similar structure as . We can concatenate all scales together, getting where indicates concatenating feature maps along the channel dimension, and is the final output feature maps of the scale aggregation block. In the reference implementation, the downsampling with factor is implemented by a max pool layer with kernel size and stride. The upsampling is implemented by resizing with the nearest neighbor interpolation.
VQ-VAE-2 is a type of variational autoencoder that combines a a two-level hierarchical VQ-VAE with a self-attention autoregressive model (PixelCNN) as a prior. The encoder and decoder architectures are kept simple and light-weight as in the original VQ-VAE, with the only difference that hierarchical multi-scale latent maps are used for increased resolution.
FairMOT is a model for multi-object tracking which consists of two homogeneous branches to predict pixel-wise objectness scores and re-ID features. The achieved fairness between the tasks is used to achieve high levels of detection and tracking accuracy. The detection branch is implemented in an anchor-free style which estimates object centers and sizes represented as position-aware measurement maps. Similarly, the re-ID branch estimates a re-ID feature for each pixel to characterize the object centered at the pixel. Note that the two branches are completely homogeneous which essentially differs from the previous methods which perform detection and re-ID in a cascaded style. It is also worth noting that FairMOT operates on high-resolution feature maps of strides four while the previous anchor-based methods operate on feature maps of stride 32. The elimination of anchors as well as the use of high-resolution feature maps better aligns re-ID features to object centers which significantly improves the tracking accuracy.
CornerNet is an object detection model that detects an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. It also utilises corner pooling, a new type of pooling layer than helps the network better localize corners.
Grab is a sensor processing system for cashier-free shopping. Grab needs to accurately identify and track customers, and associate each shopper with items he or she retrieves from shelves. To do this, it uses a keypoint-based pose tracker as a building block for identification and tracking, develops robust feature-based face trackers, and algorithms for associating and tracking arm movements. It also uses a probabilistic framework to fuse readings from camera, weight and RFID sensors in order to accurately assess which shopper picks up which item.
Convolutional LSTM based Residual Network
FractalNet is a type of convolutional neural network that eschews residual connections in favour of a "fractal" design. They involve repeated application of a simple expansion rule to generate deep networks whose structural layouts are precisely truncated fractals. These networks contain interacting subpaths of different lengths, but do not include any pass-through or residual connections; every internal signal is transformed by a filter and nonlinearity before being seen by subsequent layers.
CR-NET is a YOLO-based model proposed for license plate character detection and recognition
K-Net is a framework for unified semantic and instance segmentation that segments both instances and semantic categories consistently by a group of learnable kernels, where each kernel is responsible for generating a mask for either a potential instance or a stuff class. It begins with a set of kernels that are randomly initialized, and learns the kernels in accordance to the segmentation targets at hand, namely, semantic kernels for semantic categories and instance kernels for instance identities. A simple combination of semantic kernels and instance kernels allows panoptic segmentation naturally. In the forward pass, the kernels perform convolution on the image features to obtain the corresponding segmentation predictions. K-Net is formulated so that it dynamically updates the kernels to make them conditional to their activations on the image. Such a content-aware mechanism is crucial to ensure that each kernel, especially an instance kernel, responds accurately to varying objects in an image. Through applying this adaptive kernel update strategy iteratively, K-Net significantly improves the discriminative ability of the kernels and boosts the final segmentation performance. It is noteworthy that this strategy universally applies to kernels for all the segmentation tasks. It also utilises a bipartite matching strategy to assign learning targets for each kernel. This training approach is advantageous to conventional training strategies as it builds a one-to-one mapping between kernels and instances in an image. It thus resolves the problem of dealing with a varying number of instances in an image. In addition, it is purely mask-driven without involving boxes. Hence, K-Net is naturally NMS-free and box-free, which is appealing to real-time applications.
VirText, or Visual representations from Textual annotations is a pretraining approach using semantically dense captions to learn visual representations. First a ConvNet and Transformer are jointly trained from scratch to generate natural language captions for images. Then, the learned features are transferred to downstream visual recognition tasks.
PCA Whitening is a processing step for image based data that makes input less redundant. Adjacent pixel or feature values can be highly correlated, and whitening through the use of PCA reduces this degree of correlation. Image Source: Wikipedia
ZFNet is a classic convolutional neural network. The design was motivated by visualizing intermediate feature layers and the operation of the classifier. Compared to AlexNet, the filter sizes are reduced and the stride of the convolutions are reduced.