TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods

2,776 machine learning methods and techniques

AllAudioComputer VisionGeneralGraphsNatural Language ProcessingReinforcement LearningSequential

InternVideo

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo.

Computer VisionIntroduced 20005 papers

HyperDenseNet

Recently, dense connections have attracted substantial attention in computer vision because they facilitate gradient flow and implicit deep supervision during training. Particularly, DenseNet that connects each layer to every other layer in a feed-forward fashion and has shown impressive performances in natural image classification tasks. We propose HyperDenseNet, a 3-D fully convolutional neural network that extends the definition of dense connectivity to multi-modal segmentation problems. Each imaging modality has a path, and dense connections occur not only between the pairs of layers within the same path but also between those across different paths. This contrasts with the existing multi-modal CNN approaches, in which modeling several modalities relies entirely on a single joint layer (or level of abstraction) for fusion, typically either at the input or at the output of the network. Therefore, the proposed network has total freedom to learn more complex combinations between the modalities, within and in-between all the levels of abstraction, which increases significantly the learning representation. We report extensive evaluations over two different and highly competitive multi-modal brain tissue segmentation challenges, iSEG 2017 and MRBrainS 2013, with the former focusing on six month infant data and the latter on adult images. HyperDenseNet yielded significant improvements over many state-of-the-art segmentation networks, ranking at the top on both benchmarks. We further provide a comprehensive experimental analysis of features re-use, which confirms the importance of hyper-dense connections in multi-modal representation learning.

Computer VisionIntroduced 20005 papers

2D DWT

2D Discrete Wavelet Transform

Computer VisionIntroduced 20004 papers

XCiT Layer

An XCiT Layer is the main building block of the XCiT architecture which uses a [cross-covariance attention]() operator as its principal operation. The XCiT layer consists of three main blocks, each preceded by LayerNorm and followed by a residual connection: (i) the core cross-covariance attention (XCA) operation, (ii) the local patch interaction (LPI) module, and (iii) a feed-forward network (FFN). By transposing the query-key interaction, the computational complexity of XCA is linear in the number of data elements N, rather than quadratic as in conventional self-attention.

Computer VisionIntroduced 20004 papers

Siamese U-Net

Siamese U-Net model with a pre-trained ResNet34 architecture as an encoder for data efficient Change Detection

Computer VisionIntroduced 20004 papers

Social-STGCNN

Social-STGCNN is a method for human trajectory prediction. Pedestrian trajectories are not only influenced by the pedestrian itself but also by interaction with surrounding objects.

Computer VisionIntroduced 20004 papers

CoVR

Composed Video Retrieval

The composed video retrieval (CoVR) task is a new task, where the goal is to find a video that matches both a query image and a query text. The query image represents a visual concept that the user is interested in, and the query text specifies how the concept should be modified or refined. For example, given an image of a fountain and the text during show at night, the CoVR task is to retrieve a video that shows the fountain at night with a show.

Computer VisionIntroduced 20004 papers

Big-Little Module

Big-Little Modules are blocks for image models that have two branches: each of which represents a separate block from a deep model and a less deep counterpart. They were proposed as part of the BigLittle-Net architecture. The two branches are fused with a linear combination and unit weights. These two branches are known as Big-Branch (more layers and channels at low resolutions) and Little-Branch (fewer layers and channels at high resolution).

Computer VisionIntroduced 20004 papers

PixelRNN

Pixel Recurrent Neural Network

PixelRNNs are generative neural networks that sequentially predicts the pixels in an image along the two spatial dimensions. They model the discrete probability of the raw pixel values and encode the complete set of dependencies in the image. Variants include the Row LSTM and the Diagonal BiLSTM, that scale more easily to larger datasets. Pixel values are treated as discrete random variables by using a softmax layer in the conditional distributions. Masked convolutions are employed to allow PixelRNNs to model full dependencies between the color channels.

Computer VisionIntroduced 20004 papers

XCiT

Cross-Covariance Image Transformers, or XCiT, is a type of vision transformer that aims to combine the accuracy of conventional transformers with the scalability of convolutional architectures. The self-attention operation underlying transformers yields global interactions between all tokens, i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. The authors propose a “transposed” version of self-attention called cross-covariance attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariances matrix between keys and queries.

Computer VisionIntroduced 20004 papers

Revision Network

Revision Network is a style transfer module that aims to revise the rough stylized image via generating residual details image , while the final stylized image is generated by combining and rough stylized image . This procedure ensures that the distribution of global style pattern in is properly kept. Meanwhile, learning to revise local style patterns with residual details image is easier for the Revision Network. As shown in the Figure, the Revision Network is designed as a simple yet effective encoder-decoder architecture, with only one down-sampling and one up-sampling layer. Further, a patch discriminator is used to help Revision Network to capture fine patch textures under adversarial learning setting. The patch discriminator is defined following SinGAN, where owns 5 convolution layers and 32 hidden channels. A relatively shallow is chosen to (1) avoid overfitting since we only have one style image and (2) control the receptive field to ensure D can only capture local patterns.

Computer VisionIntroduced 20004 papers

EmbraceNet

EmbraceNet: A robust deep learning architecture for multimodal classification

Computer VisionIntroduced 20004 papers

Deformable RoI Pooling

Deformable RoI Pooling adds an offset to each bin position in the regular bin partition of the RoI Pooling. Similarly, the offsets are learned from the preceding feature maps and the RoIs, enabling adaptive part localization for objects with different shapes.

Computer VisionIntroduced 20004 papers

DeepLabv2

DeepLabv2 is an architecture for semantic segmentation that build on DeepLab with an atrous spatial pyramid pooling scheme. Here we have parallel dilated convolutions with different rates applied in the input feature map, which are then fused together. As objects of the same class can have different sizes in the image, ASPP helps to account for different object sizes.

Computer VisionIntroduced 20004 papers

FoveaBox

FoveaBox is anchor-free framework for object detection. Instead of using predefined anchors to enumerate possible locations, scales and aspect ratios for the search of the objects, FoveaBox directly learns the object existing possibility and the bounding box coordinates without anchor reference. This is achieved by: (a) predicting category-sensitive semantic maps for the object existing possibility, and (b) producing category-agnostic bounding box for each position that potentially contains an object. The scales of target boxes are naturally associated with feature pyramid representations for each input image It is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-shelf convolutional network. The first subnet performs per pixel classification on the backbone’s output; the second subnet performs bounding box prediction for the corresponding position.

Computer VisionIntroduced 20004 papers

PP-OCR

PP-OCR is an OCR system that consists of three parts, text detection, detected boxes rectification and text recognition. The purpose of text detection is to locate the text area in the image. In PP-OCR, Differentiable Binarization (DB) is used as text detector which is based on a simple segmentation network. It integrates feature extraction and sequence modeling. It adopts the Connectionist Temporal Classification (CTC) loss to avoid the inconsistency between prediction and label.

Computer VisionIntroduced 20004 papers

Anti-Alias Downsampling

Anti-Alias Downsampling (AA) aims to improve the shift-equivariance of deep networks. Max-pooling is inherently composed of two operations. The first operation is to densely evaluate the max operator and second operation is naive subsampling. AA is proposed as a low-pass filter between them to achieve practical anti-aliasing in any existing strided layer such as strided convolution. The smoothing factor can be adjusted by changing the blur kernel filter size, where a larger filter size results in increased blur.

Computer VisionIntroduced 20004 papers

SimAug

Simulation as Augmentation

SimAug, or Simulation as Augmentation, is a data augmentation method for trajectory prediction. It augments the representation such that it is robust to the variances in semantic scenes and camera views. First, to deal with the gap between real and synthetic semantic scene, it represents each training trajectory by high-level scene semantic segmentation features, and defends the model from adversarial examples generated by whitebox attack methods. Second, to overcome the changes in camera views, it generates multiple views for the same trajectory, and encourages the model to focus on the “hardest” view to which the model has learned. The classification loss is adopted and the view with the highest loss is favored during training. Finally, the augmented trajectory is computed as a convex combination of the trajectories generated in previous steps. The trajectory prediction model is built on a multi-scale representation and the final model is trained to minimize the empirical vicinal risk over the distribution of augmented trajectories.

Computer VisionIntroduced 20004 papers

VL-BERT

Visual-Linguistic BERT

VL-BERT is pre-trained on a large-scale image-captions dataset together with text-only corpus. The input to the model are either words from the input sentences or regions-of-interest (RoI) from input images. It can be fine-tuned to fit most visual-linguistic downstream tasks. Its backbone is a multi-layer bidirectional Transformer encoder, modified to accommodate visual contents, and new type of visual feature embedding to the input feature embeddings. VL-BERT takes both visual and linguistic elements as input, represented as RoIs in images and subwords in input sentences. Four different types of embeddings are used to represent each input: token embedding, visual feature embedding, segment embedding, and sequence position embedding. VL-BERT is pre-trained using Conceptual Captions and text-only datasets. Two pre-training tasks are used: masked language modeling with visual clues, and masked RoI classification with linguistic clues.

Computer VisionIntroduced 20004 papers

Precise RoI Pooling

Precise RoI Pooling, or PrRoI Pooling, is a region of interest feature extractor that avoids any quantization of coordinates and has a continuous gradient on bounding box coordinates. Given the feature map before RoI/PrRoI Pooling (eg from Conv4 in ResNet-50), let be the feature at one discrete location on the feature map. Using bilinear interpolation, the discrete feature map can be considered continuous at any continuous coordinates : where is the interpolation coefficient. Then denote a bin of a RoI as , where and are the continuous coordinates of the top-left and bottom-right points, respectively. We perform pooling (e.g. average pooling) given and feature map by computing a two-order integral:

Computer VisionIntroduced 20004 papers

EBC

Enhanced Blockwise Classification

Traditional methods are based on block-wise regression. This framework, Enhanced Blockwise Classification (EBC), however, is based on the idea that aims to classify the count value within each block into several pre-defined bins. The enhancement comes from 3 aspects: discretization policy, label correction and loss function. Notice that the original block-wise classification concept was introduced by Liu et al. in Counting Objects by Blockwise Classification.

Computer VisionIntroduced 20004 papers

MODNet

MODNet is a light-weight matting objective decomposition network that can process portrait matting from a single input image in real time. The design of MODNet benefits from optimizing a series of correlated sub-objectives simultaneously via explicit constraints. To overcome the domain shift problem, MODNet introduces a self-supervised strategy based on subobjective consistency (SOC) and a one-frame delay trick to smooth the results when applying MODNet to portrait video sequence. Given an input image , MODNet predicts human semantics , boundary details , and final alpha matte through three interdependent branches, , and , which are constrained by specific supervisions generated from the ground truth matte . Since the decomposed sub-objectives are correlated and help strengthen each other, we can optimize MODNet end-to-end.

Computer VisionIntroduced 20004 papers

Grid R-CNN

Grid R-CNN is an object detection framework, where the traditional regression formulation is replaced by a grid point guided localization mechanism. Grid R-CNN divides the object bounding box region into grids and employs a fully convolutional network (FCN) to predict the locations of grid points. Owing to the position sensitive property of fully convolutional architecture, Grid R-CNN maintains the explicit spatial information and grid points locations can be obtained in pixel level. When a certain number of grid points at specified location are known, the corresponding bounding box is definitely determined. Guided by the grid points, Grid R-CNN can determine more accurate object bounding box than regression method which lacks the guidance of explicit spatial information.

Computer VisionIntroduced 20004 papers

Spatial Attention Module (ThunderNet)

Spatial Attention Module (SAM) is a feature extraction module for object detection used in ThunderNet. The ThunderNet SAM explicitly re-weights the feature map before RoI warping over the spatial dimensions. The key idea of SAM is to use the knowledge from RPN to refine the feature distribution of the feature map. RPN is trained to recognize foreground regions under the supervision of ground truths. Therefore, the intermediate features in RPN can be used to distinguish foreground features from background features. SAM accepts two inputs: the intermediate feature map from RPN and the thin feature map from the Context Enhancement Module . The output of SAM is defined as: Here is a dimension transformation to match the number of channels in both feature maps. The sigmoid function is used to constrain the values within . At last, is re-weighted by the generated feature map for better feature distribution. For computational efficiency, we simply apply a 1×1 convolution as , so the computational cost of CEM is negligible. The Figure to the right shows the structure of SAM. SAM has two functions. The first one is to refine the feature distribution by strengthening foreground features and suppressing background features. The second one is to stabilize the training of RPN as SAM enables extra gradient flow from R-CNN subnet to RPN. As a result, RPN receives additional supervision from RCNN subnet, which helps the training of RPN.

Computer VisionIntroduced 20004 papers

TridentNet Block

A TridentNet Block is a feature extractor used in object detection models. Instead of feeding in multi-scale inputs like the image pyramid, in a TridentNet block we adapt the backbone network for different scales. These blocks create multiple scale-specific feature maps. With the help of dilated convolutions, different branches of trident blocks have the same network structure and share the same parameters yet have different receptive fields. Furthermore, to avoid training objects with extreme scales, a scale-aware training scheme is employed to make each branch specific to a given scale range matching its receptive field. Weight sharing is used to prevent overfitting.

Computer VisionIntroduced 20004 papers

WenLan

Proposes a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. A cross-modal pre-training model is defined based on the image-text retrieval task. The main goal is thus to learn two encoders that can embed image and text samples into the same space for effective image-text retrieval. To enforce such cross-modal embedding learning, we introduce contrastive learning with the InfoNCE loss into the BriVL model. Given text embedding, the learning objective aims to find the best image embedding from a batch of image embeddings. Similarly, for a given image embedding, the learning objective is to find the best text embedding from a batch of text embeddings. The pre-training model learns a cross-modal embedding space by jointly training the image and text encoders to maximize the cosine similarity of the image and text embeddings of the true pair for each sample in the batch while minimizing the cosine similarity of the embeddings of the other incorrect pairs.

Computer VisionIntroduced 20004 papers

STDC

Short-Term Dense Concatenate

STDC, or Short-Term Dense Concatenate, is a module for semantic segmentation to extract deep features with scalable receptive field and multi-scale information. It aims to remove structure redundancy in the BiSeNet architecture, specifically BiSeNet adds an extra path to encode spatial information which can be time-consuming,. Instead, STDC gradually reduces the dimension of feature maps and use the aggregation of them for image representation. We concatenate response maps from multiple continuous layers, each of which encodes input image/feature in different scales and respective fields, leading to multi-scale feature representation. To speed up, the filter size of layers is gradually reduced with negligible loss in segmentation performance.

Computer VisionIntroduced 20004 papers

Focal Transformers

The focal self-attention is built to make Transformer layers scalable to high-resolution inputs. Instead of attending all tokens at fine-grain, the approach attends the fine-grain tokens only locally, but the summarized ones globally. As such, it can cover as many regions as standard self-attention but with much less cost. An image is first partitioned into patches, resulting in visual tokens. Then a patch embedding layer, consisting of a convolutional layer with filter and stride of same size, to project the patches into hidden features. This spatial feature map in then passed to four stages of focal Transformer blocks. Each focal Transformer block consists of focal Transformer layers. Patch embedding layers are used in between to reduce spatial size of feature map by factor 2, while feature dimension increased by 2.

Computer VisionIntroduced 20004 papers

DE-GAN

DE-GAN: A Conditional Generative Adversarial Network for Document Enhancement

Documents often exhibit various forms of degradation, which make it hard to be read and substantially deteriorate the performance of an OCR system. In this paper, we propose an effective end-to-end framework named Document Enhancement Generative Adversarial Networks (DE-GAN) that uses the conditional GANs (cGANs) to restore severely degraded document images. To the best of our knowledge, this practice has not been studied within the context of generative adversarial deep networks. We demonstrate that, in different tasks (document clean up, binarization, deblurring and watermark removal), DE-GAN can produce an enhanced version of the degraded document with a high quality. In addition, our approach provides consistent improvements compared to state-of-the-art methods over the widely used DIBCO 2013, DIBCO 2017 and H-DIBCO 2018 datasets, proving its ability to restore a degraded document image to its ideal condition. The obtained results on a wide variety of degradation reveal the flexibility of the proposed model to be exploited in other document enhancement problems.

Computer VisionIntroduced 20004 papers

ThunderNet

ThunderNet is a two-stage object detection model. The design of ThunderNet aims at the computationally expensive structures in state-of-the-art two-stage detectors. The backbone utilises a ShuffleNetV2 inspired network called SNet designed for object detection. In the detection part, ThunderNet follows the detection head design in Light-Head R-CNN, and further compresses the RPN and R-CNN subnet. To eliminate the performance degradation induced by small backbones and small feature maps, ThunderNet uses two new efficient architecture blocks, Context Enhancement Module (CEM) and Spatial Attention Module (SAM). CEM combines the feature maps from multiple scales to leverage local and global context information, while SAM uses the information learned in RPN to refine the feature distribution in RoI warping.

Computer VisionIntroduced 20004 papers

MAVL

Multiscale Attention ViT with Late fusion

Multiscale Attention ViT with Late fusion (MAVL) is a multi-modal network, trained with aligned image-text pairs, capable of performing targeted detection using human understandable natural language text queries. It utilizes multi-scale image features and uses deformable convolutions with late multi-modal fusion. The authors demonstrate excellent ability of MAVL as class-agnostic object detector when queried using general human understandable natural language command, such as "all objects", "all entities", etc.

Computer VisionIntroduced 20004 papers

SKNet

SKNet is a type of convolutional neural network that employs selective kernel units, with selective kernel convolutions, in its architecture. This allows for a type of attention where the network can learn to attend to different receptive fields.

Computer VisionIntroduced 20004 papers

LRNet

Local Relation Network

The Local Relation Network (LR-Net) is a network built with local relation layers which represent a feature image extractor. This feature extractor adaptively determines aggregation weights based on the compositional relationship of local pixel pairs.

Computer VisionIntroduced 20004 papers

Feature-Centric Voting

Computer VisionIntroduced 20004 papers

PanNet

Pansharpening Network

We propose a deep network architecture for the pansharpening problem called PanNet. We incorporate domain-specific knowledge to design our PanNet architecture by focusing on the two aims of the pan-sharpening problem: spectral and spatial preservation. For spectral preservation, we add up-sampled multispectral images to the network output, which directly propagates the spectral information to the reconstructed image. To preserve the spatial structure, we train our network parameters in the high-pass filtering domain rather than the image domain. We show that the trained network generalizes well to images from different satellites without needing retraining. Experiments show significant improvement over state-of-the-art methods visually and in terms of standard quality metrics.

Computer VisionIntroduced 20004 papers

Position-Sensitive RoIAlign

Position-Sensitive RoIAlign is a positive sensitive version of RoIAlign - i.e. it performs selective alignment, allowing for the learning of position-sensitive region of interest aligning.

Computer VisionIntroduced 20004 papers

MUSIQ

MUSIQ, or Multi-scale Image Quality Transformer, is a Transformer-based model for multi-scale image quality assessment. It processes native resolution images with varying sizes and aspect ratios. In MUSIQ, we construct a multi-scale image representation as input, including the native resolution image and its ARP resized variants. Each image is split into fixed-size patches which are embedded by a patch encoding module (blue boxes). To capture 2D structure of the image and handle images of varying aspect ratios, the spatial embedding is encoded by hashing the patch position to within a grid of learnable embeddings (red boxes). Scale Embedding (green boxes) is introduced to capture scale information. The Transformer encoder takes the input tokens and performs multi-head self-attention. To predict the image quality, MUSIQ follows a common strategy in Transformers to add an [CLS] token to the sequence to represent the whole multi-scale input and the corresponding Transformer output is used as the final representation.

Computer VisionIntroduced 20004 papers

UNIMO

UNIMO is a multi-modal pre-training architecture that can effectively adapt to both single modal and multimodal understanding and generation tasks. UNIMO learns visual representations and textual representations simultaneously, and unifies them into the same semantic space via cross-modal contrastive learning (CMCL) based on a large-scale corpus of image collections, text corpus and image-text pairs. The CMCL aligns the visual representation and textual representation, and unifies them into the same semantic space based on image-text pairs.

Computer VisionIntroduced 20004 papers

DG-Net

Discriminative and Generative Network

Computer VisionIntroduced 20004 papers

BASNet

Boundary-Aware Segmentation Network

BASNet, or Boundary-Aware Segmentation Network, is an image segmentation architecture that consists of a predict-refine architecture and a hybrid loss. The proposed BASNet comprises a predict-refine architecture and a hybrid loss, for highly accurate image segmentation. The predict-refine architecture consists of a densely supervised encoder-decoder network and a residual refinement module, which are respectively used to predict and refine a segmentation probability map. The hybrid loss is a combination of the binary cross entropy, structural similarity and intersection-over-union losses, which guide the network to learn three-level (i.e., pixel-, patch- and map- level) hierarchy representations.

Computer VisionIntroduced 20004 papers

HTCN

Hierarchical Transferability Calibration Network

Hierarchical Transferability Calibration Network (HTCN) is an adaptive object detector that hierarchically (local-region/image/instance) calibrates the transferability of feature representations for harmonizing transferability and discriminability. The proposed model consists of three components: (1) Importance Weighted Adversarial Training with input Interpolation (IWAT-I), which strengthens the global discriminability by re-weighting the interpolated image-level features; (2) Context-aware Instance-Level Alignment (CILA) module, which enhances the local discriminability by capturing the complementary effect between the instance-level feature and the global context information for the instance-level feature alignment; (3) local feature masks that calibrate the local transferability to provide semantic guidance for the following discriminative pattern alignment.

Computer VisionIntroduced 20004 papers

PVTv2

Pyramid Vision Transformer v2

Pyramid Vision Transformer v2 (PVTv2) is a type of Vision Transformer for detection and segmentation tasks. It improves on PVTv1 through several design improvements: (1) overlapping patch embedding, (2) convolutional feed-forward networks, and (3) linear complexity attention layers that are orthogonal to the PVTv1 framework.

Computer VisionIntroduced 20004 papers

Local Patch Interaction

Local Patch Interaction, or LPI, is a module used for the XCiT layer to enable explicit communication across patches. LPI consists of two depth-wise 3×3 convolutional layers with Batch Normalization and GELU non-linearity in between. Due to its depth-wise structure, the LPI block has a negligible overhead in terms of parameters, as well as a limited overhead in terms of throughput and memory usage during inference.

Computer VisionIntroduced 20004 papers

PP-YOLO

PP-YOLO is an object detector based on YOLOv3. It mainly tries to combine various existing tricks that almost not increase the number of model parameters and FLOPs, to achieve the goal of improving the accuracy of detector as much as possible while ensuring that the speed is almost unchanged. Some of these changes include: - Changing the DarkNet-53 backbone with ResNet50-vd. Some of the convolutional layers in ResNet50-vd are also replaced with deformable convolutional layers. - A larger batch size is used - changing from 64 to 192. - An exponentially moving average is used for the parameters. - DropBlock is applied to the FPN. - An IoU loss is used. - An IoU prediction branch is added to measure the accuracy of localization. - Grid Sensitive is used, similar to YOLOv4. - Matrix NMS is used. - CoordConv is used for the FPN, replacing the 1x1 convolution layer, and also the first convolution layer in the detection head. - Spatial Pyramid Pooling is used for the top feature map.

Computer VisionIntroduced 20004 papers

FreeAnchor

FreeAnchor is an anchor supervision method for object detection. Many CNN-based object detectors assign anchors for ground-truth objects under the restriction of object-anchor Intersection-over-Unit (IoU). In contrast, FreeAnchor is a learning-to-match approach that breaks the IoU restriction, allowing objects to match anchors in a flexible manner. It updates hand-crafted anchor assignment to free anchor matching by formulating detector training as a maximum likelihood estimation (MLE) procedure. FreeAnchor targets at learning features which best explain a class of objects in terms of both classification and localization.

Computer VisionIntroduced 20003 papers

Vokenization

Vokenization is an approach for extrapolating multimodal alignments to language-only data by contextually mapping language tokens to their related images ("vokens") by retrieval. Instead of directly supervising the language model with visually grounded language datasets (e.g., MS COCO) these relative small datasets are used to train the vokenization processor (i.e. the vokenizer). Vokens are generated for large language corpora (e.g., English Wikipedia), and the visually-supervised language model takes the input supervision from these large datasets, thus bridging the gap between different data sources.

Computer VisionIntroduced 20003 papers

Composite Fields

Represent and associate with a composite of primitive fields.

Computer VisionIntroduced 20003 papers

PFGM

Poisson Flow Generative Models

Computer VisionIntroduced 20003 papers

Blended Diffusion

Blended Diffusion enables a zero-shot local text-guided image editing of natural images. Given an input image , an input mask and a target guiding text - the method enables to change the masked area within the image corresponding the the guiding text s.t. the unmasked area is left unchanged.

Computer VisionIntroduced 20003 papers

TridentNet

TridentNet is an object detection architecture that aims to generate scale-specific feature maps with a uniform representational power. A parallel multi-branch architecture is constructed in which each branch shares the same transformation parameters but with different receptive fields. A scale-aware training scheme is used to specialize each branch by sampling object instances of proper scales for training.

Computer VisionIntroduced 20003 papers
PreviousPage 7 of 56Next