Methods

Adaptive Feature Pooling

Adaptive Feature Pooling pools features from all levels for each proposal in object detection and fuses them for the following prediction. For each proposal, we map them to different feature levels. Following the idea of Mask R-CNN, RoIAlign is used to pool feature grids from each level. Then a fusion operation (element-wise max or sum) is utilized to fuse feature grids from different levels. The motivation for this technique is that in an FPN we assign proposals to different feature levels based on the size of proposals, which could be suboptimal if images with small differences are assigned to different levels, or if the importance of features is not strongly correlated to their level which they belong.

AugMix

AugMix mixes augmented images through linear interpolations. Consequently it is like Mixup but instead mixes augmented versions of the same image.

MetaFormer

MetaFormer is a general architecture abstracted from Transformers by not specifying the token mixer.

MFR

Meta Face Recognition

Meta Face Recognition (MFR) is a meta-learning face recognition method. MFR synthesizes the source/target domain shift with a meta-optimization objective, which requires the model to learn effective representations not only on synthesized source domains but also on synthesized target domains. Specifically, domain-shift batches are built through a domain-level sampling strategy and back-propagated gradients/meta-gradients are obtained on synthesized source/target domains by optimizing multi-domain distributions. The gradients and meta-gradients are further combined to update the model to improve generalization.

Random Erasing

Random Erasing is a data augmentation method for training the convolutional neural network (CNN), which randomly selects a rectangle region in an image and erases its pixels with random values. In this process, training images with various levels of occlusion are generated, which reduces the risk of over-fitting and makes the model robust to occlusion. Random Erasing is parameter learning free, easy to implement, and can be integrated with most of the CNN-based recognition models. Random Erasing is complementary to commonly used data augmentation techniques such as random cropping and flipping, and can be implemented in various vision tasks, such as image classification, object detection, semantic segmentation. In the Albumentations library, there is a generalization of RandomErasing called CoarseDropout, which allows masking an arbitrary number of regions of rectangular shape. It could be applied to images, segmentation masks, and key points. Documentation for CoarseDropout

DPN

Dual Path Network

A Dual Path Network (DPN) is a convolutional neural network which presents a new topology of connection paths internally. The intuition is that ResNets enables feature re-usage while DenseNet enables new feature exploration, and both are important for learning good representations. To enjoy the benefits from both path topologies, Dual Path Networks share common features while maintaining the flexibility to explore new features through dual path architectures. We formulate such a dual path architecture as follows: where and denote the extracted information at -th step from individual path, is a feature learning function as . The first equation refers to the densely connected path that enables exploring new features. The second equation refers to the residual path that enables common features re-usage. The third equation defines the dual path that integrates them and feeds them to the last transformation function in the last equation.

RealNVP

RealNVP is a generative model that utilises real-valued non-volume preserving (real NVP) transformations for density estimation. The model can perform efficient and exact inference, sampling and log-density estimation of data points.

CAG

Class activation guide

Class activation guide is a module which uses weak localization information from the instrument activation maps to guide the verb and target recognition. Image source: Nwoye et al.

LSGAN

LSGAN, or Least Squares GAN, is a type of generative adversarial network that adopts the least squares loss function for the discriminator. Minimizing the objective function of LSGAN yields minimizing the Pearson divergence. The objective function can be defined as: where and are the labels for fake data and real data and denotes the value that wants to believe for fake data.

CheXNet

CheXNet is a 121-layer DenseNet trained on ChestX-ray14 for pneumonia detection.

PANet

Path Aggregation Network, or PANet, aims to boost information flow in a proposal-based instance segmentation framework. Specifically, the feature hierarchy is enhanced with accurate localization signals in lower layers by bottom-up path augmentation, which shortens the information path between lower layers and topmost feature. Additionally, adaptive feature pooling is employed, which links feature grid and all feature levels to make useful information in each feature level propagate directly to following proposal subnetworks. A complementary branch capturing different views for each proposal is created to further improve mask prediction.

TimeSformer

TimeSformer is a convolution-free approach to video classification built exclusively on self-attention over space and time. It adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Specifically, the method adapts the image model [Vision Transformer](https//www.paperswithcode.com/method/vision-transformer) (ViT) to video by extending the self-attention mechanism from the image space to the space-time 3D volume. As in ViT, each patch is linearly mapped into an embedding and augmented with positional information. This makes it possible to interpret the resulting sequence of vector

Inception-ResNet-v2 Reduction-B

Inception-ResNet-v2 Reduction-B is an image model block used in the Inception-ResNet-v2 architecture.

Computer VisionIntroduced 200017 papers

(2+1)D Convolution

A (2+1)D Convolution is a type of convolution used for action recognition convolutional neural networks, with a spatiotemporal volume. As opposed to applying a 3D Convolution over the entire volume, which can be computationally expensive and lead to overfitting, a (2+1)D convolution splits computation into two convolutions: a spatial 2D convolution followed by a temporal 1D convolution.

SNIP

SNIP, or Scale Normalization for Image Pyramids, is a multi-scale training scheme that selectively back-propagates the gradients of object instances of different sizes as a function of the image scale. SNIP is a modified version of MST where only the object instances that have a resolution close to the pre-training dataset, which is typically 224x224, are used for training the detector. In multi-scale training (MST), each image is observed at different resolutions therefore, at a high resolution (like 1400x2000) large objects are hard to classify and at a low resolution (like 480x800) small objects are hard to classify. Fortunately, each object instance appears at several different scales and some of those appearances fall in the desired scale range. In order to eliminate extreme scale objects, either too large or too small, training is only performed on objects that fall in the desired scale range and the remainder are simply ignored during back-propagation. Effectively, SNIP uses all the object instances during training, which helps capture all the variations in appearance and pose, while reducing the domain-shift in the scale-space for the pre-trained network.

Computer VisionIntroduced 200017 papers

ALBEF

ALBEF introduces a contrastive loss to align the image and text representations before fusing them through cross-modal attention. This enables more grounded vision and language representation learning. ALBEF also doesn't require bounding box annotations. The model consists of an image encode, a text encoder, and a multimodal encoder. The image-text contrastive loss helps to align the unimodal representations of an image-text pair before fusion. The image-text matching loss and a masked language modeling loss are applied to learn multimodal interactions between image and text. In addition, momentum distillation is used to generate pseudo-targets. This improves learning with noisy data.

Computer VisionIntroduced 200017 papers

CoordConv

A CoordConv layer is a simple extension to the standard convolutional layer. It has the same functional signature as a convolutional layer, but accomplishes the mapping by first concatenating extra channels to the incoming representation. These channels contain hard-coded coordinates, the most basic version of which is one channel for the coordinate and one for the coordinate. The CoordConv layer keeps the properties of few parameters and efficient computation from convolutions, but allows the network to learn to keep or to discard translation invariance as is needed for the task being learned. This is useful for coordinate transform based tasks where regular convolutions can fail.

Computer VisionIntroduced 200016 papers

Spatial Propagation

Surface Nomral-based Spatial Propagation

Inspired by the spatial propagation mechanism utilized in the depth completion task \cite{NLSPN}, we introduce a normal incorporated non-local disparity propagation module in which we hub NDP to generate non-local affinities and offsets for spatial propagation at the disparity level. The motivation lies that the sampled pixels for edges and occluded regions are supposed to be selected. The propagation process aggregates disparities via plane affinity relations, which alleviates the phenomenon of disparity blurring at object edges due to frontal parallel windows. And the disparities in occluded areas are also optimized at the same time by being propagated from non-occluded areas where the predicted disparities are with high confidence.

Computer VisionIntroduced 200016 papers

Inception v2

Inception v2 is the second generation of Inception convolutional neural network architectures which notably uses batch normalization. Other changes include dropping dropout and removing local response normalization, due to the benefits of batch normalization.

Computer VisionIntroduced 200015 papers

Spatial Broadcast Decoder

Spatial Broadcast Decoder is an architecture that aims to improve disentangling, reconstruction accuracy, and generalization to held-out regions in data space. It provides a particularly dramatic benefit when applied to datasets with small objects. Source: Watters et al. Image source: Watters et al.

PDC

Prime Dilated Convolution

R(2+1)D

A R(2+1)D convolutional neural network is a network for action recognition that employs R(2+1)D convolutions in a ResNet inspired architecture. The use of these convolutions over regular 3D Convolutions reduces computational complexity, prevents overfitting, and introduces more non-linearities that allow for a better functional relationship to be modeled.

CPN

Contour Proposal Network

The Contour Proposal Network (CPN) detects possibly overlapping objects in an image while simultaneously fitting pixel-precise closed object contours. The CPN can incorporate state of the art object detection architectures as backbone networks into a fast single-stage instance segmentation model that can be trained end-to-end.

ShuffleNet V2 Downsampling Block

ShuffleNet V2 Downsampling Block is a block for spatial downsampling used in the ShuffleNet V2 architecture. Unlike the regular ShuffleNet V2 block, the channel split operator is removed so the number of output channels is doubled.

Inception-A

Inception-A is an image model block used in the Inception-v4 architecture.

RFE

Rank Flow Embedding

Inception-B

Inception-B is an image model block used in the Inception-v4 architecture.

Inception-C

Inception-C is an image model block used in the Inception-v4 architecture.

MDETR

MDETR is an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. It utilizes a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. The network is pre-trained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. The network is then fine-tuned on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation.

ResNeSt

A ResNest is a variant on a ResNet, which instead stacks Split-Attention blocks. The cardinal group representations are then concatenated along the channel dimension: {}. As in standard residual blocks, the final output of otheur Split-Attention block is produced using a shortcut connection: , if the input and output feature-map share the same shape. For blocks with a stride, an appropriate transformation is applied to the shortcut connection to align the output shapes: . For example, can be strided convolution or combined convolution-with-pooling.

Reduction-B

Reduction-B is an image model block used in the Inception-v4 architecture.

Inception-v4

Inception-v4 is a convolutional neural network architecture that builds on previous iterations of the Inception family by simplifying the architecture and using more inception modules than Inception-v3.

Random Scaling

Random Scaling is a type of image data augmentation in which we randomly change the scale of the image within a specified range. The Albumentations library has generalization of the RandomScaling called Affine Affine transform allows randomly scale as RandomScaling, but you may also randomly rotate, translate, and shear.

RepVGG

RepVGG is a VGG-style convolutional architecture. It has the following advantages: - The model has a VGG-like plain (a.k.a. feed-forward) topology 1 without any branches. I.e., every layer takes the output of its only preceding layer as input and feeds the output into its only following layer. - The model’s body uses only 3 × 3 conv and ReLU. - The concrete architecture (including the specific depth and layer widths) is instantiated with no automatic search, manual refinement, compound scaling, nor other heavy designs.

SNGAN

Spectrally Normalised GAN

SNGAN, or Spectrally Normalised GAN, is a type of generative adversarial network that uses spectral normalization, a type of weight normalization, to stabilise the training of the discriminator.

MixConv

Mixed Depthwise Convolution

MixConv, or Mixed Depthwise Convolution, is a type of depthwise convolution that naturally mixes up multiple kernel sizes in a single convolution. It is based on the insight that depthwise convolution applies a single kernel size to all channels, which MixConv overcomes by combining the benefits of multiple kernel sizes. It does this by partitioning channels into groups and applying a different kernel size to each group.

CvT

Convolutional Vision Transformer

The Convolutional vision Transformer (CvT) is an architecture which incorporates convolutions into the Transformer. The CvT design introduces convolutions to two core sections of the ViT architecture. First, the Transformers are partitioned into multiple stages that form a hierarchical structure of Transformers. The beginning of each stage consists of a convolutional token embedding that performs an overlapping convolution operation with stride on a 2D-reshaped token map (i.e., reshaping flattened token sequences back to the spatial grid), followed by layer normalization. This allows the model to not only capture local information, but also progressively decrease the sequence length while simultaneously increasing the dimension of token features across stages, achieving spatial downsampling while concurrently increasing the number of feature maps, as is performed in CNNs. Second, the linear projection prior to every self-attention block in the Transformer module is replaced with a proposed convolutional projection, which employs a s × s depth-wise separable convolution operation on an 2D-reshaped token map. This allows the model to further capture local spatial context and reduce semantic ambiguity in the attention mechanism. It also permits management of computational complexity, as the stride of convolution can be used to subsample the key and value matrices to improve efficiency by 4× or more, with minimal degradation of performance.

Spatially Separable Convolution

A Spatially Separable Convolution decomposes a convolution into two separate operations. In regular convolution, if we have a 3 x 3 kernel then we directly convolve this with the image. We can divide a 3 x 3 kernel into a 3 x 1 kernel and a 1 x 3 kernel. Then, in spatially separable convolution, we first convolve the 3 x 1 kernel then the 1 x 3 kernel. This requires 6 instead of 9 parameters compared to regular convolution, and so it is more parameter efficient (additionally less matrix multiplications are required). Image Source: Kunlun Bai

FBNet

FBNet is a type of convolutional neural architectures discovered through DNAS neural architecture search. It utilises a basic type of image model block inspired by MobileNetv2 that utilises depthwise convolutions and an inverted residual structure (see components).

Style Transfer Module

Modules used in GAN's style transfer.

Florence

Florence is a computer vision foundation model aiming to learn universal visual-language representations that be adapted to various computer vision tasks, visual question answering, image captioning, video retrieval, among other tasks. Florence's workflow consists of data curation, unified learning, Transformer architectures and adaption. Florence is pre-trained in an image-label-description space, utilizing a unified image-text contrastive learning. It involves a two-tower architecture: 12-layer Transformer for the language encoder, and a Vision Transformer for the image encoder. Two linear projection layers are added on top of the image encoder and language encoder to match the dimensions of image and language features. Compared to previous methods for cross-modal shared representations, Florence expands beyond simple classification and retrieval capabilities to advanced representations that support object level, multiple modality, and videos respectively.

TNT

Transformer in Transformer

Transformer is a type of self-attention-based neural networks originally applied for NLP tasks. Recently, pure transformer-based models are proposed to solve computer vision problems. These visual transformers usually view an image as a sequence of patches while they ignore the intrinsic structure information inside each patch. In this paper, we propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation. In each TNT block, an outer transformer block is utilized to process patch embeddings, and an inner transformer block extracts local features from pixel embeddings. The pixel-level feature is projected to the space of patch embedding by a linear transformation layer and then added into the patch. By stacking the TNT blocks, we build the TNT model for image recognition. Image source: Han et al.

MixNet

MixNet is a type of convolutional neural network discovered via AutoML that utilises MixConvs instead of regular depthwise convolutions.