Methods

Low-resolution input

Deformable DETR

Deformable DETR is an object detection method that aims mitigates the slow convergence and high complexity issues of DETR. It combines the best of the sparse spatial sampling of deformable convolution, and the relation modeling capability of Transformers. Specifically, it introduces a deformable attention module, which attends to a small set of sampling locations as a pre-filter for prominent key elements out of all the feature map pixels. The module can be naturally extended to aggregating multi-scale features, without the help of FPN.

InfoGAN

InfoGAN is a type of generative adversarial network that modifies the GAN objective to encourage it to learn interpretable and meaningful representations. This is done by maximizing the mutual information between a fixed small subset of the GAN’s noise variables and the observations. Formally, InfoGAN is defined as a minimax game with a variational regularization of mutual information and the hyperparameter : Where is an auxiliary distribution that approximates the posterior - the probability of the latent code given the data - and is the variational lower bound of the mutual information between the latent code and the observations. In the practical implementation, there is another fully-connected layer to output parameters for the conditional distribution (negligible computation ontop of regular GAN structures). Q is represented with a softmax non-linearity for a categorical latent code. For a continuous latent code, the authors assume a factored Gaussian.

Computer VisionIntroduced 200034 papers

R-CNN

R-CNN, or Regions with CNN Features, is an object detection model that uses high-capacity CNNs to bottom-up region proposals in order to localize and segment objects. It uses selective search to identify a number of bounding-box object region candidates (“regions of interest”), and then extracts features from each region independently for classification.

SRGAN Residual Block

SRGAN Residual Block is a residual block used in the SRGAN generator for image super-resolution. It is similar to standard residual blocks, although it uses a PReLU activation function to help training (preventing sparse gradients during GAN training).

Computer VisionIntroduced 200034 papers

Cascade R-CNN

Cascade R-CNN is an object detection architecture that seeks to address problems with degrading performance with increased IoU thresholds (due to overfitting during training and inference-time mismatch between IoUs for which detector is optimal and the inputs). It is a multi-stage extension of the R-CNN, where detector stages deeper into the cascade are sequentially more selective against close false positives. The cascade of R-CNN stages are trained sequentially, using the output of one stage to train the next. This is motivated by the observation that the output IoU of a regressor is almost invariably better than the input IoU. Cascade R-CNN does not aim to mine hard negatives. Instead, by adjusting bounding boxes, each stage aims to find a good set of close false positives for training the next stage. When operating in this manner, a sequence of detectors adapted to increasingly higher IoUs can beat the overfitting problem, and thus be effectively trained. At inference, the same cascade procedure is applied. The progressively improved hypotheses are better matched to the increasing detector quality at each stage.

Computer VisionIntroduced 200034 papers

HiFi-GAN

HiFi-GAN is a generative adversarial network for speech synthesis. HiFi-GAN consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator and discriminators are trained adversarially, along with two additional losses for improving training stability and model performance. The generator is a fully convolutional neural network. It uses a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms. Every transposed convolution is followed by a multi-receptive field fusion (MRF) module. For the discriminator, a multi-period discriminator (MPD) is used consisting of several sub-discriminators each handling a portion of periodic signals of input audio. Additionally, to capture consecutive patterns and long-term dependencies, the multi-scale discriminator (MSD) proposed in MelGAN is used, which consecutively evaluates audio samples at different levels.

Computer VisionIntroduced 200033 papers

OFA

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https://github.com/OFA-Sys/OFA.

Computer VisionIntroduced 200032 papers

R-FCN

Region-based Fully Convolutional Network

Region-based Fully Convolutional Networks, or R-FCNs, are a type of region-based object detector. In contrast to previous region-based object detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, R-FCN is fully convolutional with almost all computation shared on the entire image. To achieve this, R-FCN utilises position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection.

Computer VisionIntroduced 200032 papers

High-resolution input

Computer VisionIntroduced 200032 papers

ViLBERT

Vision-and-Language BERT

Vision-and-Language BERT (ViLBERT) is a BERT-based model for learning task-agnostic joint representations of image content and natural language. ViLBERT extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

Computer VisionIntroduced 200030 papers

Ghost Module

A Ghost Module is an image block for convolutional neural network that aims to generate more features by using fewer parameters. Specifically, an ordinary convolutional layer in deep neural networks is split into two parts. The first part involves ordinary convolutions but their total number is controlled. Given the intrinsic feature maps from the first part, a series of simple linear operations are applied for generating more feature maps. Given the widely existing redundancy in intermediate feature maps calculated by mainstream CNNs, ghost modules aim to reduce them. In practice, given the input data , where is the number of input channels and and are the height and width of the input data, respectively, the operation of an arbitrary convolutional layer for producing feature maps can be formulated as where is the convolution operation, is the bias term, is the output feature map with channels, and is the convolution filters in this layer. In addition, and are the height and width of the output data, and is the kernel size of convolution filters , respectively. During this convolution procedure, the required number of FLOPs can be calculated as , which is often as large as hundreds of thousands since the number of filters and the channel number are generally very large (e.g. 256 or 512). Here, the number of parameters (in and ) to be optimized is explicitly determined by the dimensions of input and output feature maps. The output feature maps of convolutional layers often contain much redundancy, and some of them could be similar with each other. We point out that it is unnecessary to generate these redundant feature maps one by one with large number of FLOPs and parameters. Suppose that the output feature maps are ghosts of a handful of intrinsic feature maps with some cheap transformations. These intrinsic feature maps are often of smaller size and produced by ordinary convolution filters. Specifically, intrinsic feature maps are generated using a primary convolution: where is the utilized filters, and the bias term is omitted for simplicity. The hyper-parameters such as filter size, stride, padding, are the same as those in the ordinary convolution to keep the spatial size (ie and ) of the output feature maps consistent. To further obtain the desired feature maps, we apply a series of cheap linear operations on each intrinsic feature in to generate ghost features according to the following function: where is the -th intrinsic feature map in , in the above function is the -th (except the last one) linear operation for generating the -th ghost feature map , that is to say, can have one or more ghost feature maps . The last is the identity mapping for preserving the intrinsic feature maps. we can obtain feature maps as the output data of a Ghost module. Note that the linear operations operate on each channel whose computational cost is much less than the ordinary convolution. In practice, there could be several different linear operations in a Ghost module, eg and linear kernels, which will be analyzed in the experiment part.

Computer VisionIntroduced 200030 papers

Beta-VAE

Beta-VAE is a type of variational autoencoder that seeks to discover disentangled latent factors. It modifies VAEs with an adjustable hyperparameter that balances latent channel capacity and independence constraints with reconstruction accuracy. The idea is to maximize the probability of generating the real data while keeping the distance between the real and estimated distributions small, under a threshold . We can use the Kuhn-Tucker conditions to write this as a single equation: where the KKT multiplier is the regularization coefficient that constrains the capacity of the latent channel and puts implicit independence pressure on the learnt posterior due to the isotropic nature of the Gaussian prior . We write this again using the complementary slackness assumption to get the Beta-VAE formulation:

Computer VisionIntroduced 200030 papers

Transposed convolution

Computer VisionIntroduced 200029 papers

CapsNet

Capsule Network

Capsule Network is a machine learning system that is a type of artificial neural network that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization.

Computer VisionIntroduced 200029 papers

Selective Kernel Convolution

A Selective Kernel Convolution is a convolution that enables neurons to adaptively adjust their RF sizes among multiple kernels with different kernel sizes. Specifically, the SK convolution has three operators – Split, Fuse and Select. Multiple branches with different kernel sizes are fused using softmax attention that is guided by the information in these branches. Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer

Computer VisionIntroduced 200029 papers

GLIDE

Guided Language to Image Diffusion for Generation and Editing

GLIDE is a generative model based on text-guided diffusion models for more photorealistic image generation. Guided diffusion is applied to text-conditional image synthesis and the model is able to handle free-form prompts. The diffusion model uses a text encoder to condition on natural language descriptions. The model is provided with editing capabilities in addition to zero-shot generation, allowing for iterative improvement of model samples to match more complex prompts. The model is fine-tuned to perform image inpainting.

Computer VisionIntroduced 200028 papers

Stacked Hourglass Network

Stacked Hourglass Networks are a type of convolutional neural network for pose estimation. They are based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.

Computer VisionIntroduced 200028 papers

Reduction-A

Reduction-A is an image model block used in the Inception-v4 architecture.

Computer VisionIntroduced 200028 papers

Masked Convolution

A Masked Convolution is a type of convolution which masks certain pixels so that the model can only predict based on pixels already seen. This type of convolution was introduced with PixelRNN generative models, where an image is generated pixel by pixel, to ensure that the model was conditional only on pixels already visited.

Computer VisionIntroduced 200026 papers

RAE

Regularized Autoencoders

This method introduces several regularization schemes that can be applied to an Autoencoder. To make the model generative ex-post density estimation is proposed and consists in fitting a Mixture of Gaussian distribution on the train data embeddings after the model is trained.

Computer VisionIntroduced 200026 papers

Hierarchical VAE

Hierarchical Variational Autoencoder

APA

Adaptive Pseudo Augmentation

Res2Net

Res2Net is an image model that employs a variation on bottleneck residual blocks. The motivation is to be able to represent features at multiple scales. This is achieved through a novel building block for CNNs that constructs hierarchical residual-like connections within one single residual block. This represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.

DPT

Dense Prediction Transformer

Dense Prediction Transformers (DPT) are a type of vision transformer for dense prediction tasks. The input image is transformed into tokens (orange) either by extracting non-overlapping patches followed by a linear projection of their flattened representation (DPT-Base and DPT-Large) or by applying a ResNet-50 feature extractor (DPT-Hybrid). The image embedding is augmented with a positional embedding and a patch-independent readout token (red) is added. The tokens are passed through multiple transformer stages. The tokens are reassembled from different stages into an image-like representation at multiple resolutions (green). Fusion modules (purple) progressively fuse and upsample the representations to generate a fine-grained prediction.

Res2Net Block

A Res2Net Block is an image model block that constructs hierarchical residual-like connections within one single residual block. It was proposed as part of the Res2Net CNN architecture. The block represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The filters of channels is replaced with a set of smaller filter groups, each with channels. These smaller filter groups are connected in a hierarchical residual-like style to increase the number of scales that the output features can represent. Specifically, we divide input feature maps into several groups. A group of filters first extracts features from a group of input feature maps. Output features of the previous group are then sent to the next group of filters along with another group of input feature maps. This process repeats several times until all input feature maps are processed. Finally, feature maps from all groups are concatenated and sent to another group of filters to fuse information altogether. Along with any possible path in which input features are transformed to output features, the equivalent receptive field increases whenever it passes a filter, resulting in many equivalent feature scales due to combination effects. One way of thinking of these blocks is that they expose a new dimension, scale, alongside the existing dimensions of depth, width, and cardinality.

VisualBERT

VisualBERT aims to reuse self-attention to implicitly align elements of the input text and regions in the input image. Visual embeddings are used to model images where the representations are represented by a bounding region in an image obtained from an object detector. These visual embeddings are constructed by summing three embeddings: 1) visual feature representation, 2) a segment embedding indicate whether it is an image embedding, and 3) position embedding. Essentially, image regions and language are combined with a Transformer to allow self-attention to discover implicit alignments between language and vision. VisualBERT is trained using COCO, which consists of images paired with captions. It is pre-trained using two objectives: masked language modeling objective and sentence-image prediction task. It can then be fine-tuned on different downstream tasks.

ENet Bottleneck

ENet Bottleneck is an image model block used in the ENet semantic segmentation architecture. Each block consists of three convolutional layers: a 1 × 1 projection that reduces the dimensionality, a main convolutional layer, and a 1 × 1 expansion. We place Batch Normalization and PReLU between all convolutions. If the bottleneck is downsampling, a max pooling layer is added to the main branch. Also, the first 1 × 1 projection is replaced with a 2 × 2 convolution with stride 2 in both dimensions. We zero pad the activations, to match the number of feature maps.

ENet

ENet is a semantic segmentation architecture which utilises a compact encoder-decoder architecture. Some design choices include: 1. Using the SegNet approach to downsampling y saving indices of elements chosen in max pooling layers, and using them to produce sparse upsampled maps in the decoder. 2. Early downsampling to optimize the early stages of the network and reduce the cost of processing large input frames. The first two blocks of ENet heavily reduce the input size, and use only a small set of feature maps. 3. Using PReLUs as an activation function 4. Using dilated convolutions 5. Using Spatial Dropout

ENet Dilated Bottleneck

ENet Dilated Bottleneck is an image model block used in the ENet semantic segmentation architecture. It is the same as a regular ENet Bottleneck but employs dilated convolutions instead.

RegNetY

RegNetY is a convolutional network design space with simple, regular models with parameters: depth , initial width , and slope , and generates a different block width for each block . The key restriction for the RegNet types of model is that there is a linear parameterisation of block widths (the design space only contains models with this linear structure): For RegNetX we have additional restrictions: we set (the bottleneck ratio), , and (the width multiplier). For RegNetY we make one change, which is to include Squeeze-and-Excitation blocks.

ENet Initial Block

The ENet Initial Block is an image model block used in the ENet semantic segmentation architecture. Max Pooling is performed with non-overlapping 2 × 2 windows, and the convolution has 13 filters, which sums up to 16 feature maps after concatenation. This is heavily inspired by Inception Modules.

Ghost Bottleneck

A Ghost BottleNeck is a skip connection block, similar to the basic residual block in ResNet in which several convolutional layers and shortcuts are integrated, but stacks Ghost Modules instead (two stacked Ghost modules). It was proposed as part of the GhostNet CNN architecture. The first Ghost module acts as an expansion layer increasing the number of channels. The ratio between the number of the output channels and that of the input is referred to as the expansion ratio. The second Ghost module reduces the number of channels to match the shortcut path. Then the shortcut is connected between the inputs and the outputs of these two Ghost modules. The batch normalization (BN) and ReLU nonlinearity are applied after each layer, except that ReLU is not used after the second Ghost module as suggested by MobileNetV2. The Ghost bottleneck described above is for stride=1. As for the case where stride=2, the shortcut path is implemented by a downsampling layer and a depthwise convolution with stride=2 is inserted between the two Ghost modules. In practice, the primary convolution in Ghost module here is pointwise convolution for its efficiency.

CABiNet

Context Aggregated Bi-lateral Network for Semantic Segmentation

With the increasing demand of autonomous systems, pixelwise semantic segmentation for visual scene understanding needs to be not only accurate but also efficient for potential real-time applications. In this paper, we propose Context Aggregation Network, a dual branch convolutional neural network, with significantly lower computational costs as compared to the state-of-the-art, while maintaining a competitive prediction accuracy. Building upon the existing dual branch architectures for high-speed semantic segmentation, we design a high resolution branch for effective spatial detailing and a context branch with light-weight versions of global aggregation and local distribution blocks, potent to capture both long-range and local contextual dependencies required for accurate semantic segmentation, with low computational overheads. We evaluate our method on two semantic segmentation datasets, namely Cityscapes dataset and UAVid dataset. For Cityscapes test set, our model achieves state-of-the-art results with mIOU of 75.9%, at 76 FPS on an NVIDIA RTX 2080Ti and 8 FPS on a Jetson Xavier NX. With regards to UAVid dataset, our proposed network achieves mIOU score of 63.5% with high execution speed (15 FPS).

Selective Search

Selective Search is a region proposal algorithm for object detection tasks. It starts by over-segmenting the image based on intensity of the pixels using a graph-based segmentation method by Felzenszwalb and Huttenlocher. Selective Search then takes these oversegments as initial input and performs the following steps 1. Add all bounding boxes corresponding to segmented parts to the list of regional proposals 2. Group adjacent segments based on similarity 3. Go to step 1 At each iteration, larger segments are formed and added to the list of region proposals. Hence we create region proposals from smaller segments to larger segments in a bottom-up approach. This is what we mean by computing “hierarchical” segmentations using Felzenszwalb and Huttenlocher’s oversegments.

ESPNet

ESPNet is a convolutional neural network for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a convolutional module, efficient spatial pyramid (ESP), which is efficient in terms of computation, memory, and power.

ARCH

Animatable Reconstruction of Clothed Humans

Animatable Reconstruction of Clothed Humans is an end-to-end framework for accurate reconstruction of animation-ready 3D clothed humans from a monocular image. ARCH is a learned pose-aware model that produces detailed 3D rigged full-body human avatars from a single unconstrained RGB image. A Semantic Space and a Semantic Deformation Field are created using a parametric 3D body estimator. They allow the transformation of 2D/3D clothed humans into a canonical space, reducing ambiguities in geometry caused by pose variations and occlusions in training data. Detailed surface geometry and appearance are learned using an implicit function representation with spatial local features.

Cascade Mask R-CNN

Cascade Mask R-CNN extends Cascade R-CNN to instance segmentation, by adding a mask head to the cascade. In the Mask R-CNN, the segmentation branch is inserted in parallel to the detection branch. However, the Cascade R-CNN has multiple detection branches. This raises the questions of 1) where to add the segmentation branch and 2) how many segmentation branches to add. The authors consider three strategies for mask prediction in the Cascade R-CNN. The first two strategies address the first question, adding a single mask prediction head at either the first or last stage of the Cascade R-CNN. Since the instances used to train the segmentation branch are the positives of the detection branch, their number varies in these two strategies. Placing the segmentation head later on the cascade leads to more examples. However, because segmentation is a pixel-wise operation, a large number of highly overlapping instances is not necessarily as helpful as for object detection, which is a patch-based operation. The third strategy addresses the second question, adding a segmentation branch to each cascade stage. This maximizes the diversity of samples used to learn the mask prediction task. At inference time, all three strategies predict the segmentation masks on the patches produced by the final object detection stage, irrespective of the cascade stage on which the segmentation mask is implemented and how many segmentation branches there are.