2,776 machine learning methods and techniques
YOLOv1 is a single-stage object detection model. Object detection is framed as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. The network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means the network reasons globally about the full image and all the objects in the image.
SNet is a convolutional neural network architecture and object detection backbone used for the ThunderNet two-stage object detector. SNet uses ShuffleNetV2 basic blocks but replaces all 3×3 depthwise convolutions with 5×5 depthwise convolutions.
LOGAN is a generative adversarial network that uses a latent optimization approach using natural gradient descent (NGD). For the Fisher matrix in NGD, the authors use the empirical Fisher with Tikhonov damping: They also use Euclidian Norm regularization for the optimization step. For LOGAN's base architecture, BigGAN-deep is used with a few modifications: increasing the size of the latent source from to , to compensate the randomness of the source lost when optimising . 2, using the uniform distribution instead of the standard normal distribution for to be consistent with the clipping operation, using leaky ReLU (with the slope of 0.2 for the negative part) instead of ReLU as the non-linearity for smoother gradient flow for .
Generalized Mean Pooling (GeM) computes the generalized mean of each channel in a tensor. Formally: where is a parameter. Setting this exponent as increases the contrast of the pooled feature map and focuses on the salient features of the image. GeM is a generalization of the average pooling commonly used in classification networks () and of spatial max-pooling layer (). Source: MultiGrain Image Source: Eva Mohedano
A Fractal Block is an image model block that utilizes an expansion rule that yields a structural layout of truncated fractals. For the base case where is a convolutional layer, we then have recursive fractals of the form: Where is the number of columns. For the join layer (green in Figure), we use the element-wise mean rather than concatenation or addition.
IFBlock is a video model block used in the IFNet architecture for video frame interpolation. IFBlocks do not contain expensive operators like cost volume or forward warping and use 3 × 3 convolution and deconvolution as building blocks. Each IFBlock has a feed-forward structure consisting of several convolutional layers and an upsampling operator. Except for the layer that outputs the optical flow residuals and the fusion map, PReLU activations are used.
Strip Pooling is a pooling strategy for scene parsing which considers a long but narrow kernel, i.e., or . As an alternative to global pooling, strip pooling offers two advantages. First, it deploys a long kernel shape along one spatial dimension and hence enables capturing long-range relations of isolated regions. Second, it keeps a narrow kernel shape along the other spatial dimension, which facilitates capturing local context and prevents irrelevant regions from interfering the label prediction. Integrating such long but narrow pooling kernels enables the scene parsing networks to simultaneously aggregate both global and local context. This is essentially different from the traditional spatial pooling which collects context from a fixed square region.
A Global Convolutional Network, or GCN, is a semantic segmentation building block that utilizes a large kernel to help perform classification and localization tasks simultaneously. It can be used in a FCN-like structure, where the GCN is used to generate semantic score maps. Instead of directly using larger kernels or global convolution, the GCN module employs a combination of and convolutions, which enables dense connections within a large region in the feature map
DeepMask is an object proposal algorithm based on a convolutional neural network. Given an input image patch, DeepMask generates a class-agnostic mask and an associated score which estimates the likelihood of the patch fully containing a centered object (without any notion of an object category). The core of the model is a ConvNet which jointly predicts the mask and the object score. A large part of the network is shared between those two tasks: only the last few network layers are specialized for separately outputting a mask and score prediction.
Criss-Cross Network
Criss-Cross Network (CCNet) aims to obtain full-image contextual information in an effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11× less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. 3) The state-of-the-art performance.
Large convolutional kernels
Usage of larger than typical convolutional kernel sizes, as also seen in 'Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs'
IFNet is an architecture for video frame interpolation that adopts a coarse-to-fine strategy with progressively increased resolutions: it iteratively updates intermediate flows and soft fusion mask via successive IFBlocks. Conceptually, according to the iteratively updated flow fields, we can move corresponding pixels from two input frames to the same location in a latent intermediate frame and use a fusion mask to combine pixels from two input frames. Unlike most previous optical flow models, IFBlocks do not contain expensive operators like cost volume or forward warping and use 3 × 3 convolution and deconvolution as building blocks.
Content-Aware ReAssembly of FEatures (CARAFE) is an operator for feature upsampling in convolutional neural networks. CARAFE has several appealing properties: (1) Large field of view. Unlike previous works (e.g. bilinear interpolation) that only exploit subpixel neighborhood, CARAFE can aggregate contextual information within a large receptive field. (2) Content-aware handling. Instead of using a fixed kernel for all samples (e.g. deconvolution), CARAFE enables instance-specific content-aware handling, which generates adaptive kernels on-the-fly. (3) Lightweight and fast to compute.
Dilated convolution with learnable spacings
Dilated convolution with learnable spacings (DCLS) is a type of convolution that allows the spacings between the non-zero elements of the kernel to be learned during training. This makes it possible to increase the receptive field of the convolution without increasing the number of parameters, which can improve the performance of the network on tasks that require long-range dependencies. A dilated convolution is a type of convolution that allows the kernel to be skipped over some of the input features. This is done by inserting zeros between the non-zero elements of the kernel. The effect of this is to increase the receptive field of the convolution without increasing the number of parameters. DCLS takes this idea one step further by allowing the spacings between the non-zero elements of the kernel to be learned during training. This means that the network can learn to skip over different input features depending on the task at hand. This can be particularly helpful for tasks that require long-range dependencies, such as image segmentation and object detection. DCLS has been shown to be effective for a variety of tasks, including image classification, object detection, and semantic segmentation. It is a promising new technique that has the potential to improve the performance of convolutional neural networks on a variety of tasks.
A Relativistic GAN is a type of generative adversarial network. It has a relativistic discriminator which estimates the probability that the given real data is more realistic than a randomly sampled fake data. The idea is to endow GANs with the property that the probability of real data being real () should decrease as the probability of fake data being real () increases. With a standard GAN, we can achieve this as follows. The standard GAN discriminator can be defined, in term of the non-transformed layer , as . A simple way to make discriminator relativistic - having the output of depend on both real and fake data - is to sample from real/fake data pairs and define it as . The modification can be interpreted as: the discriminator estimates the probability that the given real data is more realistic than a randomly sampled fake data. More generally a Relativistic GAN can be interpreted as having a discriminator of the form , where is the activation function, to be relativistic.
PolarNet is an improved grid representation for online, single-scan LiDAR point clouds. Instead of using common spherical or bird's-eye-view projection, the polar bird's-eye-view representation balances the points across grid cells in a polar coordinate system, indirectly aligning a segmentation network's attention with the long-tailed distribution of the points along the radial axis.
Fast Feedforward Networks
A log-time alternative to feedforward layers outperforming both the vanilla feedforward and mixture-of-experts approaches.
Vision-and-Language Transformer
ViLT is a minimal vision-and-language pre-training transformer model where processing of visual inputs is simplified to just the same convolution-free manner that text inputs are processed. The model-specific components of ViLT require less computation than the transformer component for multimodal interactions. ViLTThe model is pre-trained on the following objectives: image text matching, masked language modeling, and word patch alignment.
Sample Consistency Network (SCNet) is a method for instance segmentation which ensures the IoU distribution of the samples at training time are as close to that at inference time. To this end, only the outputs of the last box stage are used for mask predictions at both training and inference. The Figure shows the IoU distribution of the samples going to the mask branch at training time with/without sample consistency compared to that at inference time.
Visual Parsing is a vision and language pretrained model that adopts self-attention for visual feature learning where each visual token is an approximate weighted mixture of all tokens. Thus, visual parsing provides the dependencies of each visual token pair. It helps better learning of visual relation with the language and promote inter modal alignment. The model is composed of a vision Transformer that takes an image as input and outputs the visual tokens and a multimodal Transformer. It applies a linear layer and a Layer Normalization to embed the vision tokens. It follows BERT to get word embeddings. Vision and language tokens are concatenated to form the input sequences. A multi-modal Transformer is used to fuse the vision and language modality. A metric named Inter-Modality Flow (IMF) is used to quantify the interactions between two modalities. Three pretraining tasks are adopted: Masked Language Modeling (MLM), Image-Text Matching (ITM), and Masked Feature Regression (MFR). MFR is a novel task that is included to mask visual tokens with similar or correlated semantics in this framework.
SimpleNet is a convolutional neural network with 13 layers. The network employs a homogeneous design utilizing 3 × 3 kernels for convolutional layer and 2 × 2 kernels for pooling operations. The only layers which do not use 3 × 3 kernels are 11th and 12th layers, these layers, utilize 1 × 1 convolutional kernels. Feature-map down-sampling is carried out using nonoverlaping 2 × 2 max-pooling. In order to cope with the problem of vanishing gradient and also over-fitting, SimpleNet also uses batch-normalization with moving average fraction of 0.95 before any ReLU non-linearity.
Perturbed-Attention Guidance
Deep Extreme Cut
DEXTR, or Deep Extreme Cut, obtains an object segmentation from its four extreme points: the left-most, right-most, top, and bottom pixels. The annotated extreme points are given as a guiding signal to the input of the network. To this end, we create a heatmap with activations in the regions of extreme points. We center a 2D Gaussian around each of the points, in order to create a single heatmap. The heatmap is concatenated with the RGB channels of the input image, to form a 4-channel input for the CNN. In order to focus on the object of interest, the input is cropped by the bounding box, formed from the extreme point annotations. To include context on the resulting crop, we relax the tight bounding box by several pixels. After the pre-processing step that comes exclusively from the extreme clicks, the input consists of an RGB crop including an object, plus its extreme points. ResNet-101 is chosen as backbone of the architecture. We remove the fully connected layers as well as the max pooling layers in the last two stages to preserve acceptable output resolution for dense prediction, and we introduce atrous convolutions in the last two stages to maintain the same receptive field. After the last ResNet-101 stage, we introduce a pyramid scene parsing module to aggregate global context to the final feature map. The output of the CNN is a probability map representing whether a pixel belongs to the object that we want to segment or not. The CNN is trained to minimize the standard cross entropy loss, which takes into account that different classes occur with different frequency in a dataset.
In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at https://github.com/FlagAI-Open/FlagAI.
Involution is an atomic operation for deep neural networks that inverts the design principles of convolution. Involution kernels are distinct in the spatial extent but shared across channels. If involution kernels are parameterized as fixed-sized matrices like convolution kernels and updated using the back-propagation algorithm, the learned involution kernels are impeded from transferring between input images with variable resolutions. The authors argue for two benefits of involution over convolution: (i) involution can summarize the context in a wider spatial arrangement, thus overcome the difficulty of modeling long-range interactions well; (ii) involution can adaptively allocate the weights over different positions, so as to prioritize the most informative visual elements in the spatial domain.
DetNet is a backbone convolutional neural network for object detection. Different from traditional pre-trained models for ImageNet classification, DetNet maintains the spatial resolution of the features even though extra stages are included. DetNet attempts to stay efficient by employing a low complexity dilated bottleneck structure.
Optimal Transport Modeling
VL-T5 is a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation. The model learns to generate labels in text based on the visual and textual inputs. In contrast to other existing methods, the framework unifies tasks as generating text labels conditioned on multimodal inputs. This allows the model to tackle vision-and-language tasks with unified text generation objective. The models use text prefixes to adapt to different tasks.
RTMDet: An Empirical Study of Designing Real-Time Object Detectors
Please enter a description about the method here
Libra R-CNN is an object detection model that seeks to achieve a balanced training procedure. The authors motivation is that training in past detectors has suffered from imbalance during the training process, which generally consists in three levels – sample level, feature level, and objective level. To mitigate the adverse effects, Libra R-CNN integrates three novel components: IoU-balanced sampling, balanced feature pyramid, and balanced L1 loss, respectively for reducing the imbalance at sample, feature, and objective level.
CondConv, or Conditionally Parameterized Convolutions, are a type of convolution which learn specialized convolutional kernels for each example. In particular, we parameterize the convolutional kernels in a CondConv layer as a linear combination of experts , where are functions of the input learned through gradient descent. To efficiently increase the capacity of a CondConv layer, developers can increase the number of experts. This can be more computationally efficient than increasing the size of the convolutional kernel itself, because the convolutional kernel is applied at many different positions within the input, while the experts are combined only once per input.
A Contractive Autoencoder is an autoencoder that adds a penalty term to the classical reconstruction cost function. This penalty term corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. This penalty term results in a localized space contraction which in turn yields robust features on the activation layer. The penalty helps to carve a representation that better captures the local directions of variation dictated by the data, corresponding to a lower-dimensional non-linear manifold, while being more invariant to the vast majority of directions orthogonal to the manifold.
SNIPER is a multi-scale training approach for instance-level recognition tasks like object detection and instance-level segmentation. Instead of processing all pixels in an image pyramid, SNIPER selectively processes context regions around the ground-truth objects (a.k.a chips). This can help to speed up multi-scale training as it operates on low-resolution chips. Due to its memory-efficient design, SNIPER can benefit from Batch Normalization during training and it makes larger batch-sizes possible for instance-level recognition tasks on a single GPU.
Temporal Jittering is a method used in deep learning for video, where we sample multiple training clips from each video with random start times during at every epoch.
Fawkes is an image cloaking system that helps individuals inoculate their images against unauthorized facial recognition models. Fawkes achieves this by helping users add imperceptible pixel-level changes ("cloaks") to their own photos before releasing them. When used to train facial recognition models, these "cloaked" images produce functional models that consistently cause normal images of the user to be misidentified.
Feature Pyramid Grid
Feature Pyramid Grids, or FPG, is a deep multi-pathway feature pyramid, that represents the feature scale-space as a regular grid of parallel bottom-up pathways which are fused by multi-directional lateral connections. It connects the backbone features, , of a ConvNet with a regular structure of parallel top-down pyramid pathways which are fused by multi-directional lateral connections, AcrossSame, AcrossUp, AcrossDown, and AcrossSkip. AcrossSkip are direct connections while all other types use convolutional and ReLU layers. On a high-level, FPG is a deep generalization of FPN from one to pathways under a dense lateral connectivity structure.
A LAPGAN, or Laplacian Generative Adversarial Network, is a type of generative adversarial network that has a Laplacian pyramid representation. In the sampling procedure following training, we have a set of generative convnet models {}, each of which captures the distribution of coefficients for natural images at a different level of the Laplacian pyramid. Sampling an image is akin to a reconstruction procedure, except that the generative models are used to produce the ’s: The recurrence starts by setting and using the model at the final level to generate a residual image using noise vector : . Models at all levels except the final are conditional generative models that take an upsampled version of the current image as a conditioning variable, in addition to the noise vector . The generative models {} are trained using the CGAN approach at each level of the pyramid. Specifically, we construct a Laplacian pyramid from each training image . At each level we make a stochastic choice (with equal probability) to either (i) construct the coefficients either using the standard Laplacian pyramid coefficient generation procedure or (ii) generate them using G\{k}l\{k} = u\left(I\{k+1}\right)z\{k}D\{k}h\{k}\tilde{h}\{k}l\{k}h\{k}\tilde{h}\{k}\tilde{h}\{K} = G\{K}\left(z\{K}\right)D\{K}h\{K}\tilde{h}\{K}$ as input. Breaking the generation into successive refinements is the key idea. We give up any “global” notion of fidelity; an attempt is never made to train a network to discriminate between the output of a cascade and a real image and instead the focus is on making each step plausible.
SpineNet is a convolutional neural network backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search.
Height-driven Attention Network
Height-driven Attention Network, or HANet, is a general add-on module for improving semantic segmentation for urban-scene images. It emphasizes informative features or classes selectively according to the vertical position of a pixel. The pixel-wise class distributions are significantly different from each other among horizontally segmented sections in the urban-scene images. Likewise, urban-scene images have their own distinct characteristics, but most semantic segmentation networks do not reflect such unique attributes in the architecture. The proposed network architecture incorporates the capability exploiting the attributes to handle the urban scene dataset effectively.
PULSE is a self-supervised photo upsampling algorithm. Instead of starting with the LR image and slowly adding detail, PULSE traverses the high-resolution natural image manifold, searching for images that downscale to the original LR image. This is formalized through the downscaling loss, which guides exploration through the latent space of a generative model. By leveraging properties of high-dimensional Gaussians, the authors aim to restrict the search space to guarantee realistic outputs.
Voxel RoI Pooling is a RoI feature extractor extracts RoI features directly from voxel features for further refinement. It starts by dividing a region proposal into regular sub-voxels. The center point is taken as the grid point of the corresponding sub-voxel. Since feature volumes are extremely sparse (non-empty voxels account for spaces), we cannot directly utilize max pooling over features of each sub-voxel. Instead, features are integrated from neighboring voxels into the grid points for feature extraction. Specifically, given a grid point , we first exploit voxel query to group a set of neighboring voxels Then, we aggregate the neighboring voxel features with a PointNet module as: where represents the relative coordinates, is the voxel feature of , and indicates an MLP. The max pooling operation is performed along the channels to obtain the aggregated feature vector Particularly, Voxel RoI pooling is exploited to extract voxel features from the 3D feature volumes out of the last two stages in the backbone network. And for each stage, two Manhattan distance thresholds are set to group voxels with multiple scales. Then, we concatenate the aggregated features pooled from different stages and scales to obtain the RoI features.
Channel-wise Cross Attention is a module for semantic segmentation used in the UCTransNet architecture. It is used to fuse features of inconsistent semantics between the Channel Transformer and U-Net decoder. It guides the channel and information filtration of the Transformer features and eliminates the ambiguity with the decoder features. Mathematically, we take the -th level Transformer output and i-th level decoder feature map as the inputs of Channel-wise Cross Attention. Spatial squeeze is performed by a global average pooling (GAP) layer, producing vector with its th channel . We use this operation to embed the global spatial information and then generate the attention mask: where and and being weights of two Linear layers and the ReLU operator . This operation in the equation above encodes the channel-wise dependencies. Following ECA-Net which empirically showed avoiding dimensionality reduction is important for learning channel attention, the authors use a single Linear layer and sigmoid function to build the channel attention map. The resultant vector is used to recalibrate or excite to , where the activation indicates the importance of each channel. Finally, the masked is concatenated with the up-sampled features of the -th level decoder.
Matrix Non-Maximum Suppression
Matrix NMS, or Matrix Non-Maximum Suppression, performs non-maximum suppression with parallel matrix operations in one shot. It is motivated by Soft-NMS. Soft-NMS decays the other detection scores as a monotonic decreasing function of their overlaps. By decaying the scores according to IoUs recursively, higher IoU detections will be eliminated with a minimum score threshold. However, such process is sequential like traditional Greedy NMS and can not be implemented in parallel. Matrix NMS views this process from another perspective by considering how a predicted mask being suppressed. For , its decay factor is affected by: (a) The penalty of each prediction on , where and are the confidence scores; and (b) the probability of being suppressed. For (a), the penalty of each prediction on could be easily computed by iou . For (b), the probability of being suppressed is not so elegant to be computed. However, the probability usually has positive correlation with the IoUs. So here we directly approximate the probability by the most overlapped prediction on as To this end, the final decay factor becomes and the updated score is computed by decay The authors consider the two most simple decremented functions, denoted as linear iou iou , and Gaussian iou .
MobileViTv2 is a vision transformer that is tuned to mobile device. MobileViTv2 introduced a separable self-attention method to reduce cost than MobileViT
A Multiscale Dilated Convolution Block is an Inception-style convolutional block motivated by the ideas that image features naturally occur at multiple scales, that a network’s expressivity is proportional to the range of functions it can represent divided by its total number of parameters, and by the desire to efficiently expand a network’s receptive field. The Multiscale Dilated Convolution (MDC) block applies a single filter at multiple dilation factors, then performs a weighted elementwise sum of each dilated filter’s output, allowing the network to simultaneously learn a set of features and the relevant scales at which those features occur with a minimal increase in parameters. This also rapidly expands the network’s receptive field without requiring an increase in depth or the number of parameters.
Minibatch Discrimination is a discriminative technique for generative adversarial networks where we discriminate between whole minibatches of samples rather than between individual samples. This is intended to avoid collapse of the generator.
Pathology Language and Image Pre-Training
Pathology Language and Image Pre-Training (PLIP) is a vision-and-language foundation model created by fine-tuning CLIP on pathology images.
Co-Scale Conv-attentional Image Transformer
Co-Scale Conv-Attentional Image Transformer (CoaT) is a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other. Second, the conv-attentional mechanism is designed by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities.
Scattering Transform
A wavelet scattering transform computes a translation invariant representation, which is stable to deformation, using a deep convolution network architecture. It computes non-linear invariants with modulus and averaging pooling functions. It helps to eliminate the image variability due to translation and is stable to deformations. Image source: Bruna and Mallat