Methods

2,776 machine learning methods and techniques

All Audio Computer Vision General Graphs Natural Language Processing Reinforcement Learning Sequential

YOLOv1

YOLOv1 is a single-stage object detection model. Object detection is framed as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. The network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means the network reasons globally about the full image and all the objects in the image.

Computer VisionIntroduced 20006 papers

SNet

SNet is a convolutional neural network architecture and object detection backbone used for the ThunderNet two-stage object detector. SNet uses ShuffleNetV2 basic blocks but replaces all 3×3 depthwise convolutions with 5×5 depthwise convolutions.

Computer VisionIntroduced 20006 papers

LOGAN

LOGAN is a generative adversarial network that uses a latent optimization approach using natural gradient descent (NGD). For the Fisher matrix in NGD, the authors use the empirical Fisher with Tikhonov damping: They also use Euclidian Norm regularization for the optimization step. For LOGAN's base architecture, BigGAN-deep is used with a few modifications: increasing the size of the latent source from to , to compensate the randomness of the source lost when optimising . 2, using the uniform distribution instead of the standard normal distribution for to be consistent with the clipping operation, using leaky ReLU (with the slope of 0.2 for the negative part) instead of ReLU as the non-linearity for smoother gradient flow for .

Computer VisionIntroduced 20006 papers

Generalized Mean Pooling

Generalized Mean Pooling (GeM) computes the generalized mean of each channel in a tensor. Formally: where is a parameter. Setting this exponent as increases the contrast of the pooled feature map and focuses on the salient features of the image. GeM is a generalization of the average pooling commonly used in classification networks () and of spatial max-pooling layer (). Source: MultiGrain Image Source: Eva Mohedano

Computer VisionIntroduced 20006 papers

Fractal Block

A Fractal Block is an image model block that utilizes an expansion rule that yields a structural layout of truncated fractals. For the base case where is a convolutional layer, we then have recursive fractals of the form: Where is the number of columns. For the join layer (green in Figure), we use the element-wise mean rather than concatenation or addition.

Computer VisionIntroduced 20006 papers

IFBlock

IFBlock is a video model block used in the IFNet architecture for video frame interpolation. IFBlocks do not contain expensive operators like cost volume or forward warping and use 3 × 3 convolution and deconvolution as building blocks. Each IFBlock has a feed-forward structure consisting of several convolutional layers and an upsampling operator. Except for the layer that outputs the optical flow residuals and the fusion map, PReLU activations are used.

Computer VisionIntroduced 20006 papers

Strip Pooling

Strip Pooling is a pooling strategy for scene parsing which considers a long but narrow kernel, i.e., or . As an alternative to global pooling, strip pooling offers two advantages. First, it deploys a long kernel shape along one spatial dimension and hence enables capturing long-range relations of isolated regions. Second, it keeps a narrow kernel shape along the other spatial dimension, which facilitates capturing local context and prevents irrelevant regions from interfering the label prediction. Integrating such long but narrow pooling kernels enables the scene parsing networks to simultaneously aggregate both global and local context. This is essentially different from the traditional spatial pooling which collects context from a fixed square region.

Computer VisionIntroduced 20006 papers

Global Convolutional Network

A Global Convolutional Network, or GCN, is a semantic segmentation building block that utilizes a large kernel to help perform classification and localization tasks simultaneously. It can be used in a FCN-like structure, where the GCN is used to generate semantic score maps. Instead of directly using larger kernels or global convolution, the GCN module employs a combination of and convolutions, which enables dense connections within a large region in the feature map

Computer VisionIntroduced 20006 papers

DeepMask

DeepMask is an object proposal algorithm based on a convolutional neural network. Given an input image patch, DeepMask generates a class-agnostic mask and an associated score which estimates the likelihood of the patch fully containing a centered object (without any notion of an object category). The core of the model is a ConvNet which jointly predicts the mask and the object score. A large part of the network is shared between those two tasks: only the last few network layers are specialized for separately outputting a mask and score prediction.

Computer VisionIntroduced 20006 papers

CCNet

Criss-Cross Network

Criss-Cross Network (CCNet) aims to obtain full-image contextual information in an effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11× less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. 3) The state-of-the-art performance.

Computer VisionIntroduced 20006 papers

Large Kernel Size

Large convolutional kernels

Usage of larger than typical convolutional kernel sizes, as also seen in 'Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs'

Computer VisionIntroduced 20006 papers

IFNet

IFNet is an architecture for video frame interpolation that adopts a coarse-to-fine strategy with progressively increased resolutions: it iteratively updates intermediate flows and soft fusion mask via successive IFBlocks. Conceptually, according to the iteratively updated flow fields, we can move corresponding pixels from two input frames to the same location in a latent intermediate frame and use a fusion mask to combine pixels from two input frames. Unlike most previous optical flow models, IFBlocks do not contain expensive operators like cost volume or forward warping and use 3 × 3 convolution and deconvolution as building blocks.

Computer VisionIntroduced 20006 papers

CARAFE

Content-Aware ReAssembly of FEatures (CARAFE) is an operator for feature upsampling in convolutional neural networks. CARAFE has several appealing properties: (1) Large field of view. Unlike previous works (e.g. bilinear interpolation) that only exploit subpixel neighborhood, CARAFE can aggregate contextual information within a large receptive field. (2) Content-aware handling. Instead of using a fixed kernel for all samples (e.g. deconvolution), CARAFE enables instance-specific content-aware handling, which generates adaptive kernels on-the-fly. (3) Lightweight and fast to compute.

Computer VisionIntroduced 20006 papers

DCLS

Dilated convolution with learnable spacings

Dilated convolution with learnable spacings (DCLS) is a type of convolution that allows the spacings between the non-zero elements of the kernel to be learned during training. This makes it possible to increase the receptive field of the convolution without increasing the number of parameters, which can improve the performance of the network on tasks that require long-range dependencies. A dilated convolution is a type of convolution that allows the kernel to be skipped over some of the input features. This is done by inserting zeros between the non-zero elements of the kernel. The effect of this is to increase the receptive field of the convolution without increasing the number of parameters. DCLS takes this idea one step further by allowing the spacings between the non-zero elements of the kernel to be learned during training. This means that the network can learn to skip over different input features depending on the task at hand. This can be particularly helpful for tasks that require long-range dependencies, such as image segmentation and object detection. DCLS has been shown to be effective for a variety of tasks, including image classification, object detection, and semantic segmentation. It is a promising new technique that has the potential to improve the performance of convolutional neural networks on a variety of tasks.

Computer VisionIntroduced 20006 papers

Relativistic GAN

A Relativistic GAN is a type of generative adversarial network. It has a relativistic discriminator which estimates the probability that the given real data is more realistic than a randomly sampled fake data. The idea is to endow GANs with the property that the probability of real data being real () should decrease as the probability of fake data being real () increases. With a standard GAN, we can achieve this as follows. The standard GAN discriminator can be defined, in term of the non-transformed layer , as . A simple way to make discriminator relativistic - having the output of depend on both real and fake data - is to sample from real/fake data pairs and define it as . The modification can be interpreted as: the discriminator estimates the probability that the given real data is more realistic than a randomly sampled fake data. More generally a Relativistic GAN can be interpreted as having a discriminator of the form , where is the activation function, to be relativistic.

Computer VisionIntroduced 20006 papers

PolarNet

PolarNet is an improved grid representation for online, single-scan LiDAR point clouds. Instead of using common spherical or bird's-eye-view projection, the polar bird's-eye-view representation balances the points across grid cells in a polar coordinate system, indirectly aligning a segmentation network's attention with the long-tailed distribution of the points along the radial axis.

Computer VisionIntroduced 20006 papers

FFF

Fast Feedforward Networks

A log-time alternative to feedforward layers outperforming both the vanilla feedforward and mixture-of-experts approaches.

Computer VisionIntroduced 20006 papers

ViLT

Vision-and-Language Transformer

ViLT is a minimal vision-and-language pre-training transformer model where processing of visual inputs is simplified to just the same convolution-free manner that text inputs are processed. The model-specific components of ViLT require less computation than the transformer component for multimodal interactions. ViLTThe model is pre-trained on the following objectives: image text matching, masked language modeling, and word patch alignment.

Computer VisionIntroduced 20006 papers

SCNet

Sample Consistency Network (SCNet) is a method for instance segmentation which ensures the IoU distribution of the samples at training time are as close to that at inference time. To this end, only the outputs of the last box stage are used for mask predictions at both training and inference. The Figure shows the IoU distribution of the samples going to the mask branch at training time with/without sample consistency compared to that at inference time.

Computer VisionIntroduced 20006 papers

Visual Parsing

Visual Parsing is a vision and language pretrained model that adopts self-attention for visual feature learning where each visual token is an approximate weighted mixture of all tokens. Thus, visual parsing provides the dependencies of each visual token pair. It helps better learning of visual relation with the language and promote inter modal alignment. The model is composed of a vision Transformer that takes an image as input and outputs the visual tokens and a multimodal Transformer. It applies a linear layer and a Layer Normalization to embed the vision tokens. It follows BERT to get word embeddings. Vision and language tokens are concatenated to form the input sequences. A multi-modal Transformer is used to fuse the vision and language modality. A metric named Inter-Modality Flow (IMF) is used to quantify the interactions between two modalities. Three pretraining tasks are adopted: Masked Language Modeling (MLM), Image-Text Matching (ITM), and Masked Feature Regression (MFR). MFR is a novel task that is included to mask visual tokens with similar or correlated semantics in this framework.

Computer VisionIntroduced 20006 papers

SimpleNet

SimpleNet is a convolutional neural network with 13 layers. The network employs a homogeneous design utilizing 3 × 3 kernels for convolutional layer and 2 × 2 kernels for pooling operations. The only layers which do not use 3 × 3 kernels are 11th and 12th layers, these layers, utilize 1 × 1 convolutional kernels. Feature-map down-sampling is carried out using nonoverlaping 2 × 2 max-pooling. In order to cope with the problem of vanishing gradient and also over-fitting, SimpleNet also uses batch-normalization with moving average fraction of 0.95 before any ReLU non-linearity.

Computer VisionIntroduced 20006 papers

PAG

Perturbed-Attention Guidance

Methods

YOLOv1

SNet

LOGAN

Generalized Mean Pooling

Fractal Block

IFBlock

Strip Pooling

Global Convolutional Network

DeepMask

CCNet

Large Kernel Size

IFNet

CARAFE

DCLS

Relativistic GAN

PolarNet

FFF

ViLT

SCNet

Visual Parsing

SimpleNet

PAG

DEXTR

Fast-YOLOv2

AltCLIP

Involution

DetNet

OTM

VL-T5

RTMDet

Libra R-CNN

CondConv

Contractive Autoencoder

SNIPER

Temporal Jittering

Fawkes

FPG

LAPGAN

SpineNet

HANet

PULSE

Voxel RoI Pooling

Channel-wise Cross Attention

Matrix NMS

MobileViTv2

Multiscale Dilated Convolution Block

Minibatch Discrimination

PLIP

CoaT

ScatNet

Methods

YOLOv1

SNet

LOGAN

Generalized Mean Pooling

Fractal Block

IFBlock

Strip Pooling

Global Convolutional Network

DeepMask

CCNet

Large Kernel Size

IFNet

CARAFE

DCLS

Relativistic GAN

PolarNet

FFF

ViLT

SCNet

Visual Parsing

SimpleNet

PAG

DEXTR

Fast-YOLOv2

AltCLIP

Involution

DetNet

OTM