TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods

2,776 machine learning methods and techniques

AllAudioComputer VisionGeneralGraphsNatural Language ProcessingReinforcement LearningSequential

GhostNet

A GhostNet is a type of convolutional neural network that is built using Ghost modules, which aim to generate more features by using fewer parameters (allowing for greater efficiency). GhostNet mainly consists of a stack of Ghost bottlenecks with the Ghost modules as the building block. The first layer is a standard convolutional layer with 16 filters, then a series of Ghost bottlenecks with gradually increased channels follow. These Ghost bottlenecks are grouped into different stages according to the sizes of their input feature maps. All the Ghost bottlenecks are applied with stride=1 except that the last one in each stage is with stride=2. At last a global average pooling and a convolutional layer are utilized to transform the feature maps to a 1280-dimensional feature vector for final classification. The squeeze and excite (SE) module is also applied to the residual layer in some ghost bottlenecks. In contrast to MobileNetV3, GhostNet does not use hard-swish nonlinearity function due to its large latency.

Computer VisionIntroduced 200022 papers

Soft-NMS

Non-maximum suppression is an integral part of the object detection pipeline. First, it sorts all detection boxes on the basis of their scores. The detection box with the maximum score is selected and all other detection boxes with a significant overlap (using a pre-defined threshold) with are suppressed. This process is recursively applied on the remaining boxes. As per the design of the algorithm, if an object lies within the predefined overlap threshold, it leads to a miss. Soft-NMS solves this problem by decaying the detection scores of all other objects as a continuous function of their overlap with M. Hence, no object is eliminated in this process.

Computer VisionIntroduced 200022 papers

SAG

Self-Attention Guidance

Computer VisionIntroduced 200022 papers

ShuffleNet V2 Block

ShuffleNet V2 Block is an image model block used in the ShuffleNet V2 architecture, where speed is the metric optimized for (instead of indirect ones like FLOPs). It utilizes a simple operator called channel split. At the beginning of each unit, the input of feature channels are split into two branches with and channels, respectively. Following G3, one branch remains as identity. The other branch consists of three convolutions with the same input and output channels to satisfy G1. The two convolutions are no longer group-wise, unlike the original ShuffleNet. This is partially to follow G2, and partially because the split operation already produces two groups. After convolution, the two branches are concatenated. So, the number of channels keeps the same (G1). The same “channel shuffle” operation as in ShuffleNet is then used to enable information communication between the two branches. The motivation behind channel split is that alternative architectures, where pointwise group convolutions and bottleneck structures are used, lead to increased memory access cost. Additionally more network fragmentation with group convolutions reduces parallelism (less friendly for GPU), and the element-wise addition operation, while they have low FLOPs, have high memory access cost. Channel split is an alternative where we can maintain a large number of equally wide channels (equally wide minimizes memory access cost) without having dense convolutions or too many groups.

Computer VisionIntroduced 200021 papers

DKL

Deep Kernel Learning

Computer VisionIntroduced 200020 papers

Adaptive Feature Pooling

Adaptive Feature Pooling pools features from all levels for each proposal in object detection and fuses them for the following prediction. For each proposal, we map them to different feature levels. Following the idea of Mask R-CNN, RoIAlign is used to pool feature grids from each level. Then a fusion operation (element-wise max or sum) is utilized to fuse feature grids from different levels. The motivation for this technique is that in an FPN we assign proposals to different feature levels based on the size of proposals, which could be suboptimal if images with small differences are assigned to different levels, or if the importance of features is not strongly correlated to their level which they belong.

Computer VisionIntroduced 200020 papers

AugMix

AugMix mixes augmented images through linear interpolations. Consequently it is like Mixup but instead mixes augmented versions of the same image.

Computer VisionIntroduced 200020 papers

MetaFormer

MetaFormer is a general architecture abstracted from Transformers by not specifying the token mixer.

Computer VisionIntroduced 200020 papers

MFR

Meta Face Recognition

Meta Face Recognition (MFR) is a meta-learning face recognition method. MFR synthesizes the source/target domain shift with a meta-optimization objective, which requires the model to learn effective representations not only on synthesized source domains but also on synthesized target domains. Specifically, domain-shift batches are built through a domain-level sampling strategy and back-propagated gradients/meta-gradients are obtained on synthesized source/target domains by optimizing multi-domain distributions. The gradients and meta-gradients are further combined to update the model to improve generalization.

Computer VisionIntroduced 200019 papers

Random Erasing

Random Erasing is a data augmentation method for training the convolutional neural network (CNN), which randomly selects a rectangle region in an image and erases its pixels with random values. In this process, training images with various levels of occlusion are generated, which reduces the risk of over-fitting and makes the model robust to occlusion. Random Erasing is parameter learning free, easy to implement, and can be integrated with most of the CNN-based recognition models. Random Erasing is complementary to commonly used data augmentation techniques such as random cropping and flipping, and can be implemented in various vision tasks, such as image classification, object detection, semantic segmentation. In the Albumentations library, there is a generalization of RandomErasing called CoarseDropout, which allows masking an arbitrary number of regions of rectangular shape. It could be applied to images, segmentation masks, and key points. Documentation for CoarseDropout

Computer VisionIntroduced 200019 papers

DPN

Dual Path Network

A Dual Path Network (DPN) is a convolutional neural network which presents a new topology of connection paths internally. The intuition is that ResNets enables feature re-usage while DenseNet enables new feature exploration, and both are important for learning good representations. To enjoy the benefits from both path topologies, Dual Path Networks share common features while maintaining the flexibility to explore new features through dual path architectures. We formulate such a dual path architecture as follows: where and denote the extracted information at -th step from individual path, is a feature learning function as . The first equation refers to the densely connected path that enables exploring new features. The second equation refers to the residual path that enables common features re-usage. The third equation defines the dual path that integrates them and feeds them to the last transformation function in the last equation.

Computer VisionIntroduced 200019 papers

RealNVP

RealNVP is a generative model that utilises real-valued non-volume preserving (real NVP) transformations for density estimation. The model can perform efficient and exact inference, sampling and log-density estimation of data points.

Computer VisionIntroduced 200019 papers

CAG

Class activation guide

Class activation guide is a module which uses weak localization information from the instrument activation maps to guide the verb and target recognition. Image source: Nwoye et al.

Computer VisionIntroduced 200018 papers

LSGAN

LSGAN, or Least Squares GAN, is a type of generative adversarial network that adopts the least squares loss function for the discriminator. Minimizing the objective function of LSGAN yields minimizing the Pearson divergence. The objective function can be defined as: where and are the labels for fake data and real data and denotes the value that wants to believe for fake data.

Computer VisionIntroduced 200018 papers

CheXNet

CheXNet is a 121-layer DenseNet trained on ChestX-ray14 for pneumonia detection.

Computer VisionIntroduced 200018 papers

PANet

Path Aggregation Network, or PANet, aims to boost information flow in a proposal-based instance segmentation framework. Specifically, the feature hierarchy is enhanced with accurate localization signals in lower layers by bottom-up path augmentation, which shortens the information path between lower layers and topmost feature. Additionally, adaptive feature pooling is employed, which links feature grid and all feature levels to make useful information in each feature level propagate directly to following proposal subnetworks. A complementary branch capturing different views for each proposal is created to further improve mask prediction.

Computer VisionIntroduced 200018 papers

TimeSformer

TimeSformer is a convolution-free approach to video classification built exclusively on self-attention over space and time. It adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Specifically, the method adapts the image model [Vision Transformer](https//www.paperswithcode.com/method/vision-transformer) (ViT) to video by extending the self-attention mechanism from the image space to the space-time 3D volume. As in ViT, each patch is linearly mapped into an embedding and augmented with positional information. This makes it possible to interpret the resulting sequence of vector

Computer VisionIntroduced 200018 papers

Inception-ResNet-v2 Reduction-B

Inception-ResNet-v2 Reduction-B is an image model block used in the Inception-ResNet-v2 architecture.

Computer VisionIntroduced 200018 papers

(2+1)D Convolution

A (2+1)D Convolution is a type of convolution used for action recognition convolutional neural networks, with a spatiotemporal volume. As opposed to applying a 3D Convolution over the entire volume, which can be computationally expensive and lead to overfitting, a (2+1)D convolution splits computation into two convolutions: a spatial 2D convolution followed by a temporal 1D convolution.

Computer VisionIntroduced 200017 papers

SNIP

SNIP, or Scale Normalization for Image Pyramids, is a multi-scale training scheme that selectively back-propagates the gradients of object instances of different sizes as a function of the image scale. SNIP is a modified version of MST where only the object instances that have a resolution close to the pre-training dataset, which is typically 224x224, are used for training the detector. In multi-scale training (MST), each image is observed at different resolutions therefore, at a high resolution (like 1400x2000) large objects are hard to classify and at a low resolution (like 480x800) small objects are hard to classify. Fortunately, each object instance appears at several different scales and some of those appearances fall in the desired scale range. In order to eliminate extreme scale objects, either too large or too small, training is only performed on objects that fall in the desired scale range and the remainder are simply ignored during back-propagation. Effectively, SNIP uses all the object instances during training, which helps capture all the variations in appearance and pose, while reducing the domain-shift in the scale-space for the pre-trained network.

Computer VisionIntroduced 200017 papers

ALBEF

ALBEF introduces a contrastive loss to align the image and text representations before fusing them through cross-modal attention. This enables more grounded vision and language representation learning. ALBEF also doesn't require bounding box annotations. The model consists of an image encode, a text encoder, and a multimodal encoder. The image-text contrastive loss helps to align the unimodal representations of an image-text pair before fusion. The image-text matching loss and a masked language modeling loss are applied to learn multimodal interactions between image and text. In addition, momentum distillation is used to generate pseudo-targets. This improves learning with noisy data.

Computer VisionIntroduced 200017 papers

CoordConv

A CoordConv layer is a simple extension to the standard convolutional layer. It has the same functional signature as a convolutional layer, but accomplishes the mapping by first concatenating extra channels to the incoming representation. These channels contain hard-coded coordinates, the most basic version of which is one channel for the coordinate and one for the coordinate. The CoordConv layer keeps the properties of few parameters and efficient computation from convolutions, but allows the network to learn to keep or to discard translation invariance as is needed for the task being learned. This is useful for coordinate transform based tasks where regular convolutions can fail.

Computer VisionIntroduced 200016 papers

Spatial Propagation

Surface Nomral-based Spatial Propagation

Inspired by the spatial propagation mechanism utilized in the depth completion task \cite{NLSPN}, we introduce a normal incorporated non-local disparity propagation module in which we hub NDP to generate non-local affinities and offsets for spatial propagation at the disparity level. The motivation lies that the sampled pixels for edges and occluded regions are supposed to be selected. The propagation process aggregates disparities via plane affinity relations, which alleviates the phenomenon of disparity blurring at object edges due to frontal parallel windows. And the disparities in occluded areas are also optimized at the same time by being propagated from non-occluded areas where the predicted disparities are with high confidence.

Computer VisionIntroduced 200016 papers

Inception v2

Inception v2 is the second generation of Inception convolutional neural network architectures which notably uses batch normalization. Other changes include dropping dropout and removing local response normalization, due to the benefits of batch normalization.

Computer VisionIntroduced 200015 papers

Spatial Broadcast Decoder

Spatial Broadcast Decoder is an architecture that aims to improve disentangling, reconstruction accuracy, and generalization to held-out regions in data space. It provides a particularly dramatic benefit when applied to datasets with small objects. Source: Watters et al. Image source: Watters et al.

Computer VisionIntroduced 200014 papers

PDC

Prime Dilated Convolution

Computer VisionIntroduced 200014 papers

R(2+1)D

A R(2+1)D convolutional neural network is a network for action recognition that employs R(2+1)D convolutions in a ResNet inspired architecture. The use of these convolutions over regular 3D Convolutions reduces computational complexity, prevents overfitting, and introduces more non-linearities that allow for a better functional relationship to be modeled.

Computer VisionIntroduced 200014 papers

CPN

Contour Proposal Network

The Contour Proposal Network (CPN) detects possibly overlapping objects in an image while simultaneously fitting pixel-precise closed object contours. The CPN can incorporate state of the art object detection architectures as backbone networks into a fast single-stage instance segmentation model that can be trained end-to-end.

Computer VisionIntroduced 200014 papers

ShuffleNet V2 Downsampling Block

ShuffleNet V2 Downsampling Block is a block for spatial downsampling used in the ShuffleNet V2 architecture. Unlike the regular ShuffleNet V2 block, the channel split operator is removed so the number of output channels is doubled.

Computer VisionIntroduced 200014 papers

Inception-A

Inception-A is an image model block used in the Inception-v4 architecture.

Computer VisionIntroduced 200013 papers

RFE

Rank Flow Embedding

Computer VisionIntroduced 200013 papers

Inception-B

Inception-B is an image model block used in the Inception-v4 architecture.

Computer VisionIntroduced 200013 papers

Inception-C

Inception-C is an image model block used in the Inception-v4 architecture.

Computer VisionIntroduced 200013 papers

MDETR

MDETR is an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. It utilizes a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. The network is pre-trained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. The network is then fine-tuned on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation.

Computer VisionIntroduced 200013 papers

ResNeSt

A ResNest is a variant on a ResNet, which instead stacks Split-Attention blocks. The cardinal group representations are then concatenated along the channel dimension: {}. As in standard residual blocks, the final output of otheur Split-Attention block is produced using a shortcut connection: , if the input and output feature-map share the same shape. For blocks with a stride, an appropriate transformation is applied to the shortcut connection to align the output shapes: . For example, can be strided convolution or combined convolution-with-pooling.

Computer VisionIntroduced 200013 papers

Reduction-B

Reduction-B is an image model block used in the Inception-v4 architecture.

Computer VisionIntroduced 200013 papers

Inception-v4

Inception-v4 is a convolutional neural network architecture that builds on previous iterations of the Inception family by simplifying the architecture and using more inception modules than Inception-v3.

Computer VisionIntroduced 200013 papers

Random Scaling

Random Scaling is a type of image data augmentation in which we randomly change the scale of the image within a specified range. The Albumentations library has generalization of the RandomScaling called Affine Affine transform allows randomly scale as RandomScaling, but you may also randomly rotate, translate, and shear.

Computer VisionIntroduced 200013 papers

RepVGG

RepVGG is a VGG-style convolutional architecture. It has the following advantages: - The model has a VGG-like plain (a.k.a. feed-forward) topology 1 without any branches. I.e., every layer takes the output of its only preceding layer as input and feeds the output into its only following layer. - The model’s body uses only 3 × 3 conv and ReLU. - The concrete architecture (including the specific depth and layer widths) is instantiated with no automatic search, manual refinement, compound scaling, nor other heavy designs.

Computer VisionIntroduced 200013 papers

SNGAN

Spectrally Normalised GAN

SNGAN, or Spectrally Normalised GAN, is a type of generative adversarial network that uses spectral normalization, a type of weight normalization, to stabilise the training of the discriminator.

Computer VisionIntroduced 200012 papers

MixConv

Mixed Depthwise Convolution

MixConv, or Mixed Depthwise Convolution, is a type of depthwise convolution that naturally mixes up multiple kernel sizes in a single convolution. It is based on the insight that depthwise convolution applies a single kernel size to all channels, which MixConv overcomes by combining the benefits of multiple kernel sizes. It does this by partitioning channels into groups and applying a different kernel size to each group.

Computer VisionIntroduced 200012 papers

CvT

Convolutional Vision Transformer

The Convolutional vision Transformer (CvT) is an architecture which incorporates convolutions into the Transformer. The CvT design introduces convolutions to two core sections of the ViT architecture. First, the Transformers are partitioned into multiple stages that form a hierarchical structure of Transformers. The beginning of each stage consists of a convolutional token embedding that performs an overlapping convolution operation with stride on a 2D-reshaped token map (i.e., reshaping flattened token sequences back to the spatial grid), followed by layer normalization. This allows the model to not only capture local information, but also progressively decrease the sequence length while simultaneously increasing the dimension of token features across stages, achieving spatial downsampling while concurrently increasing the number of feature maps, as is performed in CNNs. Second, the linear projection prior to every self-attention block in the Transformer module is replaced with a proposed convolutional projection, which employs a s × s depth-wise separable convolution operation on an 2D-reshaped token map. This allows the model to further capture local spatial context and reduce semantic ambiguity in the attention mechanism. It also permits management of computational complexity, as the stride of convolution can be used to subsample the key and value matrices to improve efficiency by 4× or more, with minimal degradation of performance.

Computer VisionIntroduced 200012 papers

Spatially Separable Convolution

A Spatially Separable Convolution decomposes a convolution into two separate operations. In regular convolution, if we have a 3 x 3 kernel then we directly convolve this with the image. We can divide a 3 x 3 kernel into a 3 x 1 kernel and a 1 x 3 kernel. Then, in spatially separable convolution, we first convolve the 3 x 1 kernel then the 1 x 3 kernel. This requires 6 instead of 9 parameters compared to regular convolution, and so it is more parameter efficient (additionally less matrix multiplications are required). Image Source: Kunlun Bai

Computer VisionIntroduced 200012 papers

FBNet

FBNet is a type of convolutional neural architectures discovered through DNAS neural architecture search. It utilises a basic type of image model block inspired by MobileNetv2 that utilises depthwise convolutions and an inverted residual structure (see components).

Computer VisionIntroduced 200012 papers

Style Transfer Module

Modules used in GAN's style transfer.

Computer VisionIntroduced 200012 papers

Florence

Florence is a computer vision foundation model aiming to learn universal visual-language representations that be adapted to various computer vision tasks, visual question answering, image captioning, video retrieval, among other tasks. Florence's workflow consists of data curation, unified learning, Transformer architectures and adaption. Florence is pre-trained in an image-label-description space, utilizing a unified image-text contrastive learning. It involves a two-tower architecture: 12-layer Transformer for the language encoder, and a Vision Transformer for the image encoder. Two linear projection layers are added on top of the image encoder and language encoder to match the dimensions of image and language features. Compared to previous methods for cross-modal shared representations, Florence expands beyond simple classification and retrieval capabilities to advanced representations that support object level, multiple modality, and videos respectively.

Computer VisionIntroduced 200012 papers

TNT

Transformer in Transformer

Transformer is a type of self-attention-based neural networks originally applied for NLP tasks. Recently, pure transformer-based models are proposed to solve computer vision problems. These visual transformers usually view an image as a sequence of patches while they ignore the intrinsic structure information inside each patch. In this paper, we propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation. In each TNT block, an outer transformer block is utilized to process patch embeddings, and an inner transformer block extracts local features from pixel embeddings. The pixel-level feature is projected to the space of patch embedding by a linear transformation layer and then added into the patch. By stacking the TNT blocks, we build the TNT model for image recognition. Image source: Han et al.

Computer VisionIntroduced 200012 papers

MixNet

MixNet is a type of convolutional neural network discovered via AutoML that utilises MixConvs instead of regular depthwise convolutions.

Computer VisionIntroduced 200012 papers

FLAVA

FLAVA aims at building a single holistic universal model that targets all modalities at once. FLAVA is a language vision alignment model that learns strong representations from multimodal data (image-text pairs) and unimodal data (unpaired images and text). The model consists of an image encode transformer to capture unimodal image representations, a text encoder transformer to process unimodal text information, and a multimodal encode transformer that takes as input the encoded unimodal image and text and integrates their representations for multimodal reasoning. During pretraining, masked image modeling (MIM) and mask language modeling (MLM) losses are applied onto the image and text encoders over a single image or a text piece, respectively, while contrastive, masked multimodal modeling (MMM), and image-text matching (ITM) loss are used over paired image-text data. For downstream tasks, classification heads are applied on the outputs from the image, text, and multimodal encoders respectively for visual recognition, language understanding, and multimodal reasoning tasks It can be applied to broad scope of tasks from three domains (visual recognition, language understanding, and multimodal reasoning) under a common transformer model architecture.

Computer VisionIntroduced 200011 papers

One-Shot Aggregation

One-Shot Aggregation is an image model block that is an alternative to Dense Blocks, by aggregating intermediate features. It is proposed as part of the VoVNet architecture. Each convolution layer is connected by two-way connection. One way is connected to the subsequent layer to produce the feature with a larger receptive field while the other way is aggregated only once into the final output feature map. The difference with DenseNet is that the output of each layer is not routed to all subsequent intermediate layers which makes the input size of intermediate layers constant.

Computer VisionIntroduced 200011 papers
PreviousPage 4 of 56Next