TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods

2,776 machine learning methods and techniques

AllAudioComputer VisionGeneralGraphsNatural Language ProcessingReinforcement LearningSequential

Guided Anchoring

Guided Anchoring is an anchoring scheme for object detection which leverages semantic features to guide the anchoring. The method is motivated by the observation that objects are not distributed evenly over the image. The scale of an object is also closely related to the imagery content, its location and geometry of the scene. Following this intuition, the method generates sparse anchors in two steps: first identifying sub-regions that may contain objects and then determining the shapes at different locations.

Computer VisionIntroduced 20003 papers

CrossTransformers

CrossTransformers is a Transformer-based neural network architecture which can take a small number of labeled images and an unlabeled query, find coarse spatial correspondence between the query and the labeled images, and then infer class membership by computing distances between spatially-corresponding features.

Computer VisionIntroduced 20003 papers

Two-Way Dense Layer

Two-Way Dense Layer is an image model block used in the PeleeNet architectures. Motivated by GoogLeNet, the 2-way dense layer is used to get different scales of receptive fields. One way of the layer uses a 3x3 kernel size. The other way of the layer uses two stacked 3x3 convolution to learn visual patterns for large objects.

Computer VisionIntroduced 20003 papers

PeleeNet

PeleeNet is a convolutional neural network and object detection backbone that is a variation of DenseNet with optimizations to meet a memory and computational budget. Unlike competing networks, it does not use depthwise convolutions and instead relies on regular convolutions.

Computer VisionIntroduced 20003 papers

U2-Net

U2-Net is a two-level nested U-structure architecture that is designed for salient object detection (SOD). The architecture allows the network to go deeper, attain high resolution, without significantly increasing the memory and computation cost. This is achieved by a nested U-structure: on the bottom level, with a novel ReSidual U-block (RSU) module, which is able to extract intra-stage multi-scale features without degrading the feature map resolution; on the top level, there is a U-Net like structure, in which each stage is filled by a RSU block.

Computer VisionIntroduced 20003 papers

Random Grayscale

Random Grayscale is an image data augmentation that converts an image to grayscale with probability .

Computer VisionIntroduced 20003 papers

RDNet

Please enter a description about the method here

Computer VisionIntroduced 20003 papers

SqueezeNeXt Block

A SqueezeNeXt Block is a two-stage bottleneck module used in the SqueezeNeXt architecture to reduce the number of input channels to the 3 × 3 convolution. We decompose with separable convolutions to further reduce the number of parameters (orange parts), followed by a 1 × 1 expansion module.

Computer VisionIntroduced 20003 papers

CrossViT

CrossViT is a type of vision transformer that uses a dual-branch architecture to extract multi-scale feature representations for image classification. The architecture combines image patches (i.e. tokens in a transformer) of different sizes to produce stronger visual features for image classification. It processes small and large patch tokens with two separate branches of different computational complexities and these tokens are fused together multiple times to complement each other. Fusion is achieved by an efficient cross-attention module, in which each transformer branch creates a non-patch token as an agent to exchange information with the other branch by attention. This allows for linear-time generation of the attention map in fusion instead of quadratic time otherwise.

Computer VisionIntroduced 20003 papers

ISPL

Implicit Subspace Prior Learning

Implicit Subspace Prior Learning, or ISPL, is a framework to approach dual-blind face restoration, with two major distinctions from previous restoration methods: 1) Instead of assuming an explicit degradation function between LQ and HQ domain, it establishes an implicit correspondence between both domains via a mutual embedding space, thus avoid solving the pathological inverse problem directly. 2) A subspace prior decomposition and fusion mechanism to dynamically handle inputs at varying degradation levels with consistent high-quality restoration results.

Computer VisionIntroduced 20003 papers

PolarMask

PolarMask is an anchor-box free and single-shot instance segmentation method. Specifically, PolarMask takes an image as input and predicts the distance from a sampled positive location (ie a candidate object's center) with respect to the object's contour at each angle, and then assembles the predicted points to produce the final mask. There are several benefits to the system: (1) The polar representation unifies instance segmentation (masks) and object detection (bounding boxes) into a single framework (2) Two modules are designed (i.e. soft polar centerness and polar IoU loss) to sample high-quality center examples and optimize polar contour regression, making the performance of PolarMask does not depend on the bounding box prediction results and more efficient in training. (3) PolarMask is fully convolutional and can be embedded into most off-the-shelf detection methods.

Computer VisionIntroduced 20003 papers

DiffAugment

Differentiable Augmentation (DiffAugment) is a set of differentiable image transformations used to augment data during GAN training. The transformations are applied to the real and generated images. It enables the gradients to be propagated through the augmentation back to the generator, regularizes the discriminator without manipulating the target distribution, and maintains the balance of training dynamics. Three choices of transformation are preferred by the authors in their experiments: Translation, CutOut, and Color.

Computer VisionIntroduced 20003 papers

ControlVAE

ControlVAE is a variational autoencoder (VAE) framework that combines the automatic control theory with the basic VAE to stabilize the KL-divergence of VAE models to a specified value. It leverages a non-linear PI controller, a variant of the proportional-integral-derivative (PID) control, to dynamically tune the weight of the KL-divergence term in the evidence lower bound (ELBO) using the output KL-divergence as feedback. This allows for control of the KL-divergence to a desired value (set point), which is effective in avoiding posterior collapse and learning disentangled representations.

Computer VisionIntroduced 20003 papers

ZoomNet

ZoomNet is a 2D human whole-body pose estimation technique. It aims to localize dense landmarks on the entire human body including face, hands, body, and feet. ZoomNet follows the top-down paradigm. Given a human bounding box of each person, ZoomNet first localizes the easy-to-detect body keypoints and estimates the rough position of hands and face. Then it zooms in to focus on the hand/face areas and predicts keypoints using features with higher resolution for accurate localization. Unlike previous approaches which usually assemble multiple networks, ZoomNet has a single network that is end-to-end trainable. It unifies five network heads including the human body pose estimator, hand and face detectors, and hand and face pose estimators into a single network with shared low-level features.

Computer VisionIntroduced 20003 papers

BCA-Segmentation

Segmentation of patchy areas in biomedical images based on local edge density estimation

An effective approach to the quantification of patchiness in biomedical images according to their local edge densities.

Computer VisionIntroduced 20003 papers

GridMask

GridMask is a data augmentation method that randomly removes some pixels of an input image. Unlike other methods, the region that the algorithm removes is neither a continuous region nor random pixels in dropout. Instead, the algorithm removes a region with disconnected pixel sets, as shown in the Figure. We express the setting as where represents the input image, is the binary mask that stores pixels to be removed, and is the result produced by the algorithm. For the binary mask , if we keep pixel in the input image; otherwise we remove it. GridMask is applied after the image normalization operation. The shape of looks like a grid, as shown in the Figure . Four numbers are used to represent a unique . Every mask is formed by tiling the units. is the ratio of the shorter gray edge in a unit. is the length of one unit. and are the distances between the first intact unit and boundary of the image.

Computer VisionIntroduced 20003 papers

BezierAlign

BezierAlign is a feature sampling method for arbitrarily-shaped scene text recognition that exploits parameterization nature of a compact Bezier curve bounding box. Unlike RoIAlign, the shape of sampling grid of BezierAlign is not rectangular. Instead, each column of the arbitrarily-shaped grid is orthogonal to the Bezier curve boundary of the text. The sampling points have equidistant interval in width and height, respectively, which are bilinear interpolated with respect to the coordinates. Formally given an input feature map and Bezier curve control points, we concurrently process all the output pixels of the rectangular output feature map with size . Taking pixel with position (from output feature map) as an example, we calculate by: We then calculate the point of upper Bezier curve boundary and lower Bezier curve boundary . Using and , we can linearly index the sampling point by: With the position of , we can easily apply bilinear interpolation to calculate the result. Comparisons among previous sampling methods and BezierAlign are shown in the Figure.

Computer VisionIntroduced 20003 papers

Style-based Recalibration Module

A Style-based Recalibration Module (SRM) is a module for convolutional neural networks that adaptively recalibrates intermediate feature maps by exploiting their styles. SRM first extracts the style information from each channel of the feature maps by style pooling, then estimates per-channel recalibration weight via channel-independent style integration. By incorporating the relative importance of individual styles into feature maps, SRM is aimed at enhancing the representational ability of a CNN. The overall structure of SRM is illustrated in the Figure to the right. It consists of two main components: style pooling and style integration. The style pooling operator extracts style features from each channel by summarizing feature responses across spatial dimensions. It is followed by the style integration operator, which produces example-specific style weights by utilizing the style features via channel-wise operation. The style weights finally recalibrate the feature maps to either emphasize or suppress their information.

Computer VisionIntroduced 20003 papers

SimVLM

Simple Visual Language Model

SimVLM is a minimalist pretraining framework to reduce training complexity by exploiting large-scale weak supervision. It is trained end-to-end with a single prefix language modeling (PrefixLM) objective. PrefixLM enables bidirectional attention within the prefix sequence, and thus it is applicable for both decoder-only and encoder-decoder sequence-to-sequence language models.

Computer VisionIntroduced 20003 papers

CSPResNeXt Block

CSPResNeXt Block is an extended ResNext Block where we partition the feature map of the base layer into two parts and then merge them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.

Computer VisionIntroduced 20003 papers

Dynamic R-CNN

Dynamic R-CNN is an object detection method that adjusts the label assignment criteria (IoU threshold) and the shape of regression loss function (parameters of Smooth L1 Loss) automatically based on the statistics of proposals during training. The motivation is that in previous two-stage object detectors, there is an inconsistency problem between the fixed network settings and the dynamic training procedure. For example, the fixed label assignment strategy and regression loss function cannot fit the distribution change of proposals and thus are harmful to training high quality detectors. It consists of two components: Dynamic Label Assignment and Dynamic Smooth L1 Loss, which are designed for the classification and regression branches, respectively. For Dynamic Label Assignment, we want our model to be discriminative for high IoU proposals, so we gradually adjust the IoU threshold for positive/negative samples based on the proposals distribution in the training procedure. Specifically, we set the threshold as the IoU of the proposal at a certain percentage since it can reflect the quality of the overall distribution. For Dynamic Smooth L1 Loss, we want to change the shape of the regression loss function to adaptively fit the distribution change of error and ensure the contribution of high quality samples to training. This is achieved by adjusting the in Smooth L1 Loss based on the error distribution of the regression loss function, in which actually controls the magnitude of the gradient of small errors.

Computer VisionIntroduced 20003 papers

Panoptic FPN

A Panoptic FPN is an extension of an FPN that can generate both instance and semantic segmentations via FPN. The approach starts with an FPN backbone and adds a branch for performing semantic segmentation in parallel with the existing region-based branch for instance segmentation. No changes are made to the FPN backbone when adding the dense-prediction branch, making it compatible with existing instance segmentation methods. The new semantic segmentation branch achieves its goal as follows. Starting from the deepest FPN level (at 1/32 scale), we perform three upsampling stages to yield a feature map at 1/4 scale, where each upsampling stage consists of 3×3 convolution, group norm, ReLU, and 2× bilinear upsampling. This strategy is repeated for FPN scales 1/16, 1/8, and 1/4 (with progressively fewer upsampling stages). The result is a set of feature maps at the same 1/4 scale, which are then element-wise summed. A final 1×1 convolution, 4× bilinear upsampling, and softmax are used to generate the per-pixel class labels at the original image resolution. In addition to stuff classes, this branch also outputs a special ‘other’ class for all pixels belonging to objects (to avoid predicting stuff classes for such pixels).

Computer VisionIntroduced 20003 papers

Flow Alignment Module

Flow Alignment Module, or FAM, is a flow-based align module for scene parsing to learn Semantic Flow between feature maps of adjacent levels and broadcast high-level features to high resolution features effectively and efficiently. The concept of Semantic Flow is inspired from optical flow, which is widely used in video processing task to represent the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by relative motion. The authors postulate that the relatinship between two feature maps of arbitrary resolutions from the same image can also be represented with the “motion” of every pixel from one feature map to the other one. Once precise Semantic Flow is obtained, the network is able to propagate semantic features with minimal information loss. In the FAM module, the transformed high-resolution feature map are combined with the low-resolution feature map to generate the semantic flow field, which is utilized to warp the low-resolution feature map to high-resolution feature map.

Computer VisionIntroduced 20003 papers

k-Sparse Autoencoder

k-Sparse Autoencoders are autoencoders with linear activation function, where in hidden layers only the highest activities are kept. This achieves exact sparsity in the hidden representation. Backpropagation only goes through the the top activated units. This can be achieved with a ReLU layer with an adjustable threshold.

Computer VisionIntroduced 20003 papers

VisTR

VisTR is a Transformer based video instance segmentation model. It views video instance segmentation as a direct end-to-end parallel sequence decoding/prediction problem. Given a video clip consisting of multiple image frames as input, VisTR outputs the sequence of masks for each instance in the video in order directly. At the core is a new, effective instance sequence matching and segmentation strategy, which supervises and segments instances at the sequence level as a whole. VisTR frames the instance segmentation and tracking in the same perspective of similarity learning, thus considerably simplifying the overall pipeline and is significantly different from existing approaches.

Computer VisionIntroduced 20003 papers

MDTVSFA

Computer VisionIntroduced 20003 papers

CS-GAN

CS-GAN is a type of generative adversarial network that uses a form of deep compressed sensing, and latent optimisation, to improve the quality of generated samples.

Computer VisionIntroduced 20003 papers

XGrad-CAM

XGrad-CAM, or Axiom-based Grad-CAM, is a class-discriminative visualization method and able to highlight the regions belonging to the objects of interest. Two axiomatic properties are introduced in the derivation of XGrad-CAM: Sensitivity and Conservation. In particular, the proposed XGrad-CAM is still a linear combination of feature maps, but able to meet the constraints of those two axioms.

Computer VisionIntroduced 20003 papers

CSPResNeXt

CSPResNeXt is a convolutional neural network where we apply the Cross Stage Partial Network (CSPNet) approach to ResNeXt. The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.

Computer VisionIntroduced 20003 papers

3D ResNet-RS

3D ResNet-RS is an architecture and scaling strategy for 3D ResNets for video recognition. The key additions are: - 3D ResNet-D stem: The ResNet-D stem is adapted to 3D inputs by using three consecutive 3D convolutional layers. The first convolutional layer employs a temporal kernel size of 5 while the remaining two convolutional layers employ a temporal kernel size of 1. - 3D Squeeze-and-Excitation: Squeeze-and-Excite is adapted to spatio-temporal inputs by using a 3D global average pooling operation for the squeeze operation. A SE ratio of 0.25 is applied in each 3D bottleneck block for all experiments. - Self-gating: A self-gating module is used in each 3D bottleneck block after the SE module.

Computer VisionIntroduced 20003 papers

VATT

Video-Audio-Text Transformer, or VATT, is a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, it takes raw signals as inputs and extracts multidimensional representations that are rich enough to benefit a variety of downstream tasks. VATT borrows the exact architecture from BERT and ViT except the layer of tokenization and linear projection reserved for each modality separately. The design follows the same spirit as ViT that makes the minimal changes to the architecture so that the learned model can transfer its weights to various frameworks and tasks. VATT linearly projects each modality into a feature vector and feeds it into a Transformer encoder. A semantically hierarchical common space is defined to account for the granularity of different modalities and noise contrastive estimation is employed to train the model.

Computer VisionIntroduced 20003 papers

Active Convolution

An Active Convolution is a type of convolution which does not have a fixed shape of the receptive field, and can be used to take more diverse forms of receptive fields for convolutions. Its shape can be learned through backpropagation during training. It can be seen as a generalization of convolution; it can define not only all conventional convolutions, but also convolutions with fractional pixel coordinates. We can freely change the shape of the convolution, which provides greater freedom to form CNN structures. Second, the shape of the convolution is learned while training and there is no need to tune it by hand

Computer VisionIntroduced 20003 papers

MPRNet

MPRNet is a multi-stage progressive image restoration architecture that progressively learns restoration functions for the degraded inputs, thereby breaking down the overall recovery process into more manageable steps. Specifically, the model first learns the contextualized features using encoder-decoder architectures and later combines them with a high-resolution branch that retains local information. At each stage, a per-pixel adaptive design is introduced that leverages in-situ supervised attention to reweight the local features.

Computer VisionIntroduced 20003 papers

ConViT

ConViT is a type of vision transformer that uses a gated positional self-attention module (GPSA), a form of positional self-attention which can be equipped with a “soft” convolutional inductive bias. The GPSA layers are initialized to mimic the locality of convolutional layers, then each attention head is given the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information.

Computer VisionIntroduced 20003 papers

Adaptive NMS

Adaptive Non-Maximum Suppression is a non-maximum suppression algorithm that applies a dynamic suppression threshold to an instance according to the target density. The motivation is to find an NMS algorithm that works well for pedestrian detection in a crowd. Intuitively, a high NMS threshold keeps more crowded instances while a low NMS threshold wipes out more false positives. The adaptive-NMS thus applies a dynamic suppression strategy, where the threshold rises as instances gather and occlude each other and decays when instances appear separately. To this end, an auxiliary and learnable sub-network is designed to predict the adaptive NMS threshold for each instance.

Computer VisionIntroduced 20003 papers

Differential Diffusion

Differential Diffusion is an enhancement of image-to-image diffusion models that adds the ability to control the amount of change applied to each image fragment via a change map.

Computer VisionIntroduced 20003 papers

SqueezeNeXt

SqueezeNeXt is a type of convolutional neural network that uses the SqueezeNet architecture as a baseline, but makes a number of changes. First, a more aggressive channel reduction is used by incorporating a two-stage squeeze module. This significantly reduces the total number of parameters used with the 3×3 convolutions. Secondly, it uses separable 3 × 3 convolutions to further reduce the model size, and removes the additional 1×1 branch after the squeeze module. Thirdly, the network use an element-wise addition skip connection similar to that of ResNet architecture.

Computer VisionIntroduced 20003 papers

pGAN

Parallel GAN

Computer VisionIntroduced 20003 papers

CBNet

Composite Backbone Network

CBNet is a backbone architecture that consists of multiple identical backbones (specially called Assistant Backbones and Lead Backbone) and composite connections between neighbor backbones. From left to right, the output of each stage in an Assistant Backbone, namely higher-level features, flows to the parallel stage of the succeeding backbone as part of inputs through composite connections. Finally, the feature maps of the last backbone named Lead Backbone are used for object detection. The features extracted by CBNet for object detection fuse the high-level and low-level features of multiple backbones, hence improve the detection performance.

Computer VisionIntroduced 20003 papers

Population Based Augmentation

Population Based Augmentation, or PBA, is a data augmentation strategy (PBA), which generates nonstationary augmentation policy schedules instead of a fixed augmentation policy. In PBA we consider the augmentation policy search problem as a special case of hyperparameter schedule learning. It leverages Population Based Training (PBT), a hyperparameter search algorithm which optimizes the parameters of a network jointly with their hyperparameters to maximize performance. The output of PBT is not an optimal hyperparameter configuration but rather a trained model and schedule of hyperparameters. In PBA, we are only interested in the learned schedule and discard the child model result (similar to AutoAugment). This learned augmentation schedule can then be used to improve the training of different (i.e., larger and costlier to train) models on the same dataset. PBT executes as follows. To start, a fixed population of models are randomly initialized and trained in parallel. At certain intervals, an “exploit-and-explore” procedure is applied to the worse performing population members, where the model clones the weights of a better performing model (i.e., exploitation) and then perturbs the hyperparameters of the cloned model to search in the hyperparameter space (i.e., exploration). Because the weights of the models are cloned and never reinitialized, the total computation required is the computation to train a single model times the population size.

Computer VisionIntroduced 20003 papers

Adversarial Color Enhancement

Adversarial Color Enhancement is an approach to generating unrestricted adversarial images by optimizing a color filter via gradient descent.

Computer VisionIntroduced 20003 papers

CurricularFace

CurricularFace, or Adaptive Curriculum Learning, is a method for face recognition that embeds the idea of curriculum learning into the loss function to achieve a new training scheme. This training scheme mainly addresses easy samples in the early training stage and hard ones in the later stage. Specifically, CurricularFace adaptively adjusts the relative importance of easy and hard samples during different training stages.

Computer VisionIntroduced 20003 papers

Class-MLP

Class-MLP is an alternative to average pooling, which is an adaptation of the class-attention token introduced in CaiT. In CaiT, this consists of two layers that have the same structure as the transformer, but in which only the class token is updated based on the frozen patch embeddings. In Class-MLP, the same approach is used, but after aggregating the patches with a linear layer, we replace the attention-based interaction between the class and patch embeddings by simple linear layers, still keeping the patch embeddings frozen. This increases the performance, at the expense of adding some parameters and computational cost. This pooling variant is referred to as “class-MLP”, since the purpose of these few layers is to replace average pooling.

Computer VisionIntroduced 20003 papers

ABCNet

Adaptive Bezier-Curve Network

Adaptive Bezier-Curve Network, or ABCNet, is an end-to-end framework for arbitrarily-shaped scene text spotting. It adaptively fits arbitrary-shaped text by a parameterized bezier curve. It also utilizes a feature alignment layer, BezierAlign, to calculate convolutional features of text instances in curved shapes. These features are then passed to a light-weight recognition head.

Computer VisionIntroduced 20003 papers

IoU-Net

IoU-Net is an object detection architecture that introduces localization confidence. IoU-Net learns to predict the IoU between each detected bounding box and the matched ground-truth. The network acquires this confidence of localization, which improves the NMS procedure by preserving accurately localized bounding boxes. Furthermore, an optimization-based bounding box refinement method is proposed, where the predicted IoU is formulated as the objective.

Computer VisionIntroduced 20003 papers

Local Mixup

Computer VisionIntroduced 20002 papers

DRPNN

Deep Residual Pansharpening Neural Network

In the field of fusing multi-spectral and panchromatic images (Pan-sharpening), the impressive effectiveness of deep neural networks has been recently employed to overcome the drawbacks of traditional linear models and boost the fusing accuracy. However, to the best of our knowledge, existing research works are mainly based on simple and flat networks with relatively shallow architecture, which severely limited their performances. In this paper, the concept of residual learning has been introduced to form a very deep convolutional neural network to make a full use of the high non-linearity of deep learning models. By both quantitative and visual assessments on a large number of high quality multi-spectral images from various sources, it has been supported that our proposed model is superior to all mainstream algorithms included in the comparison, and achieved the highest spatial-spectral unified accuracy.

Computer VisionIntroduced 20002 papers

StereoLayers

Computer VisionIntroduced 20002 papers

SegSort

Segment Sorting

Computer VisionIntroduced 20002 papers

CKConv

Continuous Kernel Convolution

Computer VisionIntroduced 20002 papers
PreviousPage 8 of 56Next