TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods

2,776 machine learning methods and techniques

AllAudioComputer VisionGeneralGraphsNatural Language ProcessingReinforcement LearningSequential

Convolution

A convolution is a type of matrix operation, consisting of a kernel, a small matrix of weights, that slides over input data performing element-wise multiplication with the part of the input it is on, then summing the results into an output. Intuitively, a convolution allows for weight sharing - reducing the number of effective parameters - and image translation (allowing for the same feature to be detected in different parts of the input space). Image Source: https://arxiv.org/pdf/1603.07285.pdf

Computer VisionIntroduced 198019588 papers

Max Pooling

Max Pooling is a pooling operation that calculates the maximum value for patches of a feature map, and uses it to create a downsampled (pooled) feature map. It is usually used after a convolutional layer. It adds a small amount of translation invariance - meaning translating the image by a small amount does not significantly affect the values of most pooled outputs. Image Source: here

Computer VisionIntroduced 20007126 papers

1x1 Convolution

A 1 x 1 Convolution is a convolution with some special properties in that it can be used for dimensionality reduction, efficient low dimensional embeddings, and applying non-linearity after convolutions. It maps an input pixel with all its channels to an output pixel which can be squeezed to a desired output depth. It can be viewed as an MLP looking at a particular pixel location. Image Credit: http://deeplearning.ai

Computer VisionIntroduced 20005641 papers

ALIGN

In the ALIGN method, visual and language representations are jointly trained from noisy image alt-text data. The image and text encoders are learned via contrastive loss (formulated as normalized softmax) that pushes the embeddings of the matched image-text pair together and pushing those of non-matched image-text pair apart. The model learns to align visual and language representations of the image and text pairs using the contrastive loss. The representations can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers zero-shot visual classification and cross-modal search including image-to-text search, text-to image search and even search with joint image+text queries.

Computer VisionIntroduced 20005527 papers

Global Average Pooling

Global Average Pooling is a pooling operation designed to replace fully connected layers in classical CNNs. The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the softmax layer. One advantage of global average pooling over the fully connected layers is that it is more native to the convolution structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps. Another advantage is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer. Furthermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.

Computer VisionIntroduced 20004076 papers

CLIP

Contrastive Language-Image Pre-training

Contrastive Language-Image Pre-training (CLIP), consisting of a simplified version of ConVIRT trained from scratch, is an efficient method of image representation learning from natural language supervision. , CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes. For pre-training, CLIP is trained to predict which of the possible (image, text) pairings across a batch actually occurred. CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the real pairs in the batch while minimizing the cosine similarity of the embeddings of the incorrect pairings. A symmetric cross entropy loss is optimized over these similarity scores. Image credit: Learning Transferable Visual Models From Natural Language Supervision

Computer VisionIntroduced 20003094 papers

Residual Block

Residual Blocks are skip-connection blocks that learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. They were introduced as part of the ResNet architecture. Formally, denoting the desired underlying mapping as , we let the stacked nonlinear layers fit another mapping of . The original mapping is recast into . The acts like a residual, hence the name 'residual block'. The intuition is that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. Having skip connections allows the network to more easily learn identity-like mappings. Note that in practice, Bottleneck Residual Blocks are used for deeper ResNets, such as ResNet-50 and ResNet-101, as these bottleneck blocks are less computationally intensive.

Computer VisionIntroduced 20002807 papers

Vision Transformer

The Vision Transformer, or ViT, is a model for image classification that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder. In order to perform classification, the standard approach of adding an extra learnable “classification token” to the sequence is used.

Computer VisionIntroduced 20002145 papers

Bottleneck Residual Block

A Bottleneck Residual Block is a variant of the residual block that utilises 1x1 convolutions to create a bottleneck. The use of a bottleneck reduces the number of parameters and matrix multiplications. The idea is to make residual blocks as thin as possible to increase depth and have less parameters. They were introduced as part of the ResNet architecture, and are used as part of deeper ResNets such as ResNet-50 and ResNet-101.

Computer VisionIntroduced 20002049 papers

PCA

Principal Components Analysis

Principle Components Analysis (PCA) is an unsupervised method primary used for dimensionality reduction within machine learning. PCA is calculated via a singular value decomposition (SVD) of the design matrix, or alternatively, by calculating the covariance matrix of the data and performing eigenvalue decomposition on the covariance matrix. The results of PCA provide a low-dimensional picture of the structure of the data and the leading (uncorrelated) latent factors determining variation in the data. Image Source: Wikipedia

Computer VisionIntroduced 20001323 papers

Depthwise Convolution

Depthwise Convolution is a type of convolution where we apply a single convolutional filter for each input channel. In the regular 2D convolution performed over multiple input channels, the filter is as deep as the input and lets us freely mix channels to generate each element in the output. In contrast, depthwise convolutions keep each channel separate. To summarize the steps, we: 1. Split the input and filter into channels. 2. We convolve each input with the respective filter. 3. We stack the convolved outputs together. Image Credit: Chi-Feng Wang

Computer VisionIntroduced 20161321 papers

Pointwise Convolution

Pointwise Convolution is a type of convolution that uses a 1x1 kernel: a kernel that iterates through every single point. This kernel has a depth of however many channels the input image has. It can be used in conjunction with depthwise convolutions to produce an efficient class of convolutions known as depthwise-separable convolutions. Image Credit: Chi-Feng Wang

Computer VisionIntroduced 20161306 papers

Depthwise Separable Convolution

While standard convolution performs the channelwise and spatial-wise computation in one step, Depthwise Separable Convolution splits the computation into two steps: depthwise convolution applies a single convolutional filter per each input channel and pointwise convolution is used to create a linear combination of the output of the depthwise convolution. The comparison of standard convolution and depthwise separable convolution is shown to the right. Credit: Depthwise Convolution Is All You Need for Learning Multiple Visual Domains

Computer VisionIntroduced 20001174 papers

RPN

Region Proposal Network

A Region Proposal Network, or RPN, is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals. RPN and algorithms like Fast R-CNN can be merged into a single network by sharing their convolutional features - using the recently popular terminology of neural networks with attention mechanisms, the RPN component tells the unified network where to look. RPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios. RPNs use anchor boxes that serve as references at multiple scales and aspect ratios. The scheme can be thought of as a pyramid of regression references, which avoids enumerating images or filters of multiple scales or aspect ratios.

Computer VisionIntroduced 20001045 papers

SAM

Segment Anything Model

Computer VisionIntroduced 2000905 papers

GPS

Greedy Policy Search

Greedy Policy Search (GPS) is a simple algorithm that learns a policy for test-time data augmentation based on the predictive performance on a validation set. GPS starts with an empty policy and builds it in an iterative fashion. Each step selects a sub-policy that provides the largest improvement in calibrated log-likelihood of ensemble predictions and adds it to the current policy.

Computer VisionIntroduced 2000707 papers

Mixup

Mixup is a data augmentation technique that generates a weighted combination of random image pairs from the training data. Given two images and their ground truth labels: , a synthetic training example is generated as: where is independently sampled for each augmented example.

Computer VisionIntroduced 2000651 papers

RoIAlign

Region of Interest Align, or RoIAlign, is an operation for extracting a small feature map from each RoI in detection and segmentation based tasks. It removes the harsh quantization of RoI Pool, properly aligning the extracted features with the input. To avoid any quantization of the RoI boundaries or bins (using instead of ), RoIAlign uses bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and the result is then aggregated (using max or average).

Computer VisionIntroduced 2000611 papers

Grouped Convolution

A Grouped Convolution uses a group of convolutions - multiple kernels per layer - resulting in multiple channel outputs per layer. This leads to wider networks helping a network learn a varied set of low level and high level features. The original motivation of using Grouped Convolutions in AlexNet was to distribute the model over multiple GPUs as an engineering compromise. But later, with models such as ResNeXt, it was shown this module could be used to improve classification accuracy. Specifically by exposing a new dimension through grouped convolutions, cardinality (the size of set of transformations), we can increase accuracy by increasing it.

Computer VisionIntroduced 2000575 papers

Squeeze-and-Excitation Block

The Squeeze-and-Excitation Block is an architectural unit designed to improve the representational power of a network by enabling it to perform dynamic channel-wise feature recalibration. The process is: - The block has a convolutional block as an input. - Each channel is "squeezed" into a single numeric value using average pooling. - A dense layer followed by a ReLU adds non-linearity and output channel complexity is reduced by a ratio. - Another dense layer followed by a sigmoid gives each channel a smooth gating function. - Finally, we weight each feature map of the convolutional block based on the side network; the "excitation".

Computer VisionIntroduced 2000543 papers

Faster R-CNN

Faster R-CNN is an object detection model that improves on Fast R-CNN by utilising a region proposal network (RPN) with the CNN model. The RPN shares full-image convolutional features with the detection network, enabling nearly cost-free region proposals. It is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. RPN and Fast R-CNN are merged into a single network by sharing their convolutional features: the RPN component tells the unified network where to look. As a whole, Faster R-CNN consists of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector that uses the proposed regions.

Computer VisionIntroduced 2000499 papers

Mask R-CNN

Mask R-CNN extends Faster R-CNN to solve instance segmentation tasks. It achieves this by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. In principle, Mask R-CNN is an intuitive extension of Faster R-CNN, but constructing the mask branch properly is critical for good results. Most importantly, Faster R-CNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is evident in how RoIPool, the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, Mask R-CNN utilises a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations. Secondly, Mask R-CNN decouples mask and class prediction: it predicts a binary mask for each class independently, without competition among classes, and relies on the network's RoI classification branch to predict the category. In contrast, an FCN usually perform per-pixel multi-class categorization, which couples segmentation and classification.

Computer VisionIntroduced 2000420 papers

Non Maximum Suppression

Non Maximum Suppression is a computer vision method that selects a single entity out of many overlapping entities (for example bounding boxes in object detection). The criteria is usually discarding entities that are below a given probability bound. With remaining entities we repeatedly pick the entity with the highest probability, output that as the prediction, and discard any remaining box where a with the box output in the previous step. Image Credit: Martin Kersner

Computer VisionIntroduced 2000389 papers

Spatial Pyramid Pooling

Spatial Pyramid Pooling (SPP) is a pooling layer that removes the fixed-size constraint of the network, i.e. a CNN does not require a fixed-size input image. Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixed-length outputs, which are then fed into the fully-connected layers (or other classifiers). In other words, we perform some information aggregation at a deeper stage of the network hierarchy (between convolutional layers and fully-connected layers) to avoid the need for cropping or warping at the beginning.

Computer VisionIntroduced 2000285 papers

FCN

Fully Convolutional Network

Fully Convolutional Networks, or FCNs, are an architecture used mainly for semantic segmentation. They employ solely locally connected layers, such as convolution, pooling and upsampling. Avoiding the use of dense layers means less parameters (making the networks faster to train). It also means an FCN can work for variable image sizes given all connections are local. The network consists of a downsampling path, used to extract and interpret the context, and an upsampling path, which allows for localization. FCNs also employ skip connections to recover the fine-grained spatial information lost in the downsampling path.

Computer VisionIntroduced 2000285 papers

How do I speak to a person at Expedia?-/+/

To make a reservation or communicate with Expedia, the quickest option is typically to call their customer service at +1-805-330-4056 or +1-805-330-4056. You can also use the live chat feature on their website or app, or contact them via social media.ggfdf How do I speak to a person at Expedia?How do I speak to a person at Expedia?To make a reservation or communicate with Expedia, the quickest option is typically to call their customer service at +1-805-330-4056 or +1-805-330-4056. You can also use the live chat feature on their website or app, or contact them via social media.To make a reservation or communicate with Expedia, the quickest option is typically to call their customer service at +1-805-330-4056 or +1-805-330-4056. You can also use the live chat feature on their website or app, or contact them via social media. To make a reservation or communicate with Expedia, the quickest option is typically to call their customer service at +1-805-330-4056 or +1-805-330-4056. You can also use the live chat feature on their website or app, or contact them via social media.To make a reservation or communicate with Expedia, the quickest option is typically to call their customer service at +1-805-330-4056 or +1-805-330-4056. You can also use the live chat feature on their website or app, or contact them via social media.To make a reservation or communicate with Expedia, the quickest option is typically to call their customer service at +1-805-330-4056 or +1-805-330-4056. You can also use the live chat feature on their website or app, or contact them via social media.chgd

Computer VisionIntroduced 2000283 papers

SSD

SSD is a single-stage object detection method that discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. The fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage. Improvements over competing single-stage methods include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales.

Computer VisionIntroduced 2000278 papers

Random Gaussian Blur

Random Gaussian Blur is an image data augmentation technique where we randomly blur the image using a Gaussian distribution. Image Source: Wikipedia

Computer VisionIntroduced 2000260 papers

YOLOv3

YOLOv3 is a real-time, single-stage object detection model that builds on YOLOv2 with several improvements. Improvements include the use of a new backbone network, Darknet-53 that utilises residual connections, or in the words of the author, "those newfangled residual network stuff", as well as some improvements to the bounding box prediction step, and use of three different scales from which to extract features (similar to an FPN).

Computer VisionIntroduced 2000258 papers

YOLOv8

You Only Look Once

Computer VisionIntroduced 2000254 papers

TS

Spatio-temporal stability analysis

Spatio-temporal features extraction that measure the stabilty. The proposed method is based on a compression algorithm named Run Length Encoding. The workflow of the method is presented bellow.

Computer VisionIntroduced 2000242 papers

RetinaNet

RetinaNet is a one-stage object detection model that utilizes a focal loss function to address class imbalance during training. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard negative examples. RetinaNet is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-shelf convolutional network. The first subnet performs convolutional object classification on the backbone's output; the second subnet performs convolutional bounding box regression. The two subnetworks feature a simple design that the authors propose specifically for one-stage, dense detection. We can see the motivation for focal loss by comparing with two-stage object detectors. Here class imbalance is addressed by a two-stage cascade and sampling heuristics. The proposal stage (e.g., Selective Search, EdgeBoxes, DeepMask, RPN) rapidly narrows down the number of candidate object locations to a small number (e.g., 1-2k), filtering out most background samples. In the second classification stage, sampling heuristics, such as a fixed foreground-to-background ratio, or online hard example mining (OHEM), are performed to maintain a manageable balance between foreground and background. In contrast, a one-stage detector must process a much larger set of candidate object locations regularly sampled across an image. To tackle this, RetinaNet uses a focal loss function, a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. Formally, the Focal Loss adds a factor to the standard cross entropy criterion. Setting reduces the relative loss for well-classified examples (), putting more focus on hard, misclassified examples. Here there is tunable focusing parameter .

Computer VisionIntroduced 2000210 papers

DINO

self-DIstillation with NO labels

DINO (self-distillation with no labels) is a self-supervised learning method that directly predicts the output of a teacher network - built with a momentum encoder - using a standard cross-entropy loss. In the example to the right, DINO is illustrated in the case of one single pair of views for simplicity. The model passes two different random transformations of an input image to the student and teacher networks. Both networks have the same architecture but other parameters. The output of the teacher network is centered with a mean computed over the batch. Each network outputs a dimensional feature normalized with a temperature softmax over the feature dimension. Their similarity is then measured with a cross-entropy loss. A stop-gradient (sg) operator is applied to the teacher to propagate gradients only through the student. The teacher parameters are updated with the student parameters' exponential moving average (ema).

Computer VisionIntroduced 2000208 papers

CutMix

CutMix is an image data augmentation strategy. Instead of simply removing pixels as in Cutout, we replace the removed regions with a patch from another image. The ground truth labels are also mixed proportionally to the number of pixels of combined images. The added patches further enhance localization ability by requiring the model to identify the object from a partial view.

Computer VisionIntroduced 2000208 papers

3D Convolution

A 3D Convolution is a type of convolution where the kernel slides in 3 dimensions as opposed to 2 dimensions with 2D convolutions. One example use case is medical imaging where a model is constructed using 3D image slices. Additionally video based data has an additional temporal dimension over images making it suitable for this module. Image: Lung nodule detection based on 3D convolutional neural networks, Fan et al

Computer VisionIntroduced 2015208 papers

Inception Module

An Inception Module is an image model block that aims to approximate an optimal local sparse structure in a CNN. Put simply, it allows for us to use multiple types of filter size, instead of being restricted to a single filter size, in a single image block, which we then concatenate and pass onto the next layer.

Computer VisionIntroduced 2000206 papers

VQ-VAE

VQ-VAE is a type of variational autoencoder that uses vector quantisation to obtain a discrete latent representation. It differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, ideas from vector quantisation (VQ) are incorporated. Using the VQ method allows the model to circumvent issues of posterior collapse - where the latents are ignored when they are paired with a powerful autoregressive decoder - typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes.

Computer VisionIntroduced 2000197 papers

Denoising Autoencoder

A Denoising Autoencoder is a modification on the autoencoder to prevent the network learning the identity function. Specifically, if the autoencoder is too big, then it can just learn the data, so the output equals the input, and does not perform any useful representation learning or dimensionality reduction. Denoising autoencoders solve this problem by corrupting the input data on purpose, adding noise or masking some of the input values. Image Credit: Kumar et al

Computer VisionIntroduced 2008182 papers

Non-Local Operation

A Non-Local Operation is a component for capturing long-range dependencies with deep neural networks. It is a generalization of the classical non-local mean operation in computer vision. Intuitively a non-local operation computes the response at a position as a weighted sum of the features at all positions in the input feature maps. The set of positions can be in space, time, or spacetime, implying that these operations are applicable for image, sequence, and video problems. Following the non-local mean operation, a generic non-local operation for deep neural networks is defined as: Here is the index of an output position (in space, time, or spacetime) whose response is to be computed and is the index that enumerates all possible positions. x is the input signal (image, sequence, video; often their features) and is the output signal of the same size as . A pairwise function computes a scalar (representing relationship such as affinity) between and all . The unary function computes a representation of the input signal at the position . The response is normalized by a factor . The non-local behavior is due to the fact that all positions () are considered in the operation. As a comparison, a convolutional operation sums up the weighted input in a local neighborhood (e.g., in a 1D case with kernel size 3), and a recurrent operation at time is often based only on the current and the latest time steps (e.g., or ). The non-local operation is also different from a fully-connected (fc) layer. The equation above computes responses based on relationships between different locations, whereas fc uses learned weights. In other words, the relationship between and is not a function of the input data in fc, unlike in nonlocal layers. Furthermore, the formulation in the equation above supports inputs of variable sizes, and maintains the corresponding size in the output. On the contrary, an fc layer requires a fixed-size input/output and loses positional correspondence (e.g., that from to at the position ). A non-local operation is a flexible building block and can be easily used together with convolutional/recurrent layers. It can be added into the earlier part of deep neural networks, unlike fc layers that are often used in the end. This allows us to build a richer hierarchy that combines both non-local and local information. In terms of parameterisation, we usually parameterise as a linear embedding of the form , where is a weight matrix to be learned. This is implemented as, e.g., 1×1 convolution in space or 1×1×1 convolution in spacetime. For we use an affinity function, a list of which can be found here.

Computer VisionIntroduced 2000181 papers

3D CNN

3 Dimensional Convolutional Neural Network

Computer VisionIntroduced 2000178 papers

Spatial Transformer

A Spatial Transformer is an image model block that explicitly allows the spatial manipulation of data within a convolutional neural network. It gives CNNs the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process. Unlike pooling layers, where the receptive fields are fixed and local, the spatial transformer module is a dynamic mechanism that can actively spatially transform an image (or a feature map) by producing an appropriate transformation for each input sample. The transformation is then performed on the entire feature map (non-locally) and can include scaling, cropping, rotations, as well as non-rigid deformations. The architecture is shown in the Figure to the right. The input feature map is passed to a localisation network which regresses the transformation parameters . The regular spatial grid over is transformed to the sampling grid , which is applied to , producing the warped output feature map . The combination of the localisation network and sampling mechanism defines a spatial transformer.

Computer VisionIntroduced 2000169 papers

ConvNeXt

Computer VisionIntroduced 2000165 papers

SMOTE

Synthetic Minority Over-sampling Technique.

Perhaps the most widely used approach to synthesizing new examples is called the Synthetic Minority Oversampling Technique, or SMOTE for short. This technique was described by Nitesh Chawla, et al. in their 2002 paper named for the technique titled “SMOTE: Synthetic Minority Over-sampling Technique.” SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.

Computer VisionIntroduced 2000156 papers

Deformable Convolution

Deformable convolutions add 2D offsets to the regular grid sampling locations in the standard convolution. It enables free form deformation of the sampling grid. The offsets are learned from the preceding feature maps, via additional convolutional layers. Thus, the deformation is conditioned on the input features in a local, dense, and adaptive manner.

Computer VisionIntroduced 2000152 papers

SAGAN

Self-Attention GAN

The Self-Attention Generative Adversarial Network, or SAGAN, allows for attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other.

Computer VisionIntroduced 2000138 papers

ResNeXt Block

A ResNeXt Block is a type of residual block used as part of the ResNeXt CNN architecture. It uses a "split-transform-merge" strategy (branched paths within a single module) similar to an Inception module, i.e. it aggregates a set of transformations. Compared to a Residual Block, it exposes a new dimension, cardinality (size of set of transformations) , as an essential factor in addition to depth and width. Formally, a set of aggregated transformations can be represented as: , where can be an arbitrary function. Analogous to a simple neuron, should project into an (optionally low-dimensional) embedding and then transform it.

Computer VisionIntroduced 2000132 papers

ResNeXt

A ResNeXt repeats a building block that aggregates a set of transformations with the same topology. Compared to a ResNet, it exposes a new dimension, cardinality (the size of the set of transformations) , as an essential factor in addition to the dimensions of depth and width. Formally, a set of aggregated transformations can be represented as: , where can be an arbitrary function. Analogous to a simple neuron, should project into an (optionally low-dimensional) embedding and then transform it.

Computer VisionIntroduced 2000132 papers

CSPDarknet53

CSPDarknet53 is a convolutional neural network and backbone for object detection that uses DarkNet-53. It employs a CSPNet strategy to partition the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network. This CNN is used as the backbone for YOLOv4.

Computer VisionIntroduced 2000132 papers

EBM

energy-based model

Computer VisionIntroduced 2000128 papers

PAFPN

PAFPN is a feature pyramid module used in Path Aggregation networks (PANet) that combines FPNs with bottom-up path augmentation, which shortens the information path between lower layers and topmost feature.

Computer VisionIntroduced 2000124 papers
Page 1 of 56Next