TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods

8,725 machine learning methods and techniques

AllAudioComputer VisionGeneralGraphsNatural Language ProcessingReinforcement LearningSequential

Matrix NMS

Matrix Non-Maximum Suppression

Matrix NMS, or Matrix Non-Maximum Suppression, performs non-maximum suppression with parallel matrix operations in one shot. It is motivated by Soft-NMS. Soft-NMS decays the other detection scores as a monotonic decreasing function of their overlaps. By decaying the scores according to IoUs recursively, higher IoU detections will be eliminated with a minimum score threshold. However, such process is sequential like traditional Greedy NMS and can not be implemented in parallel. Matrix NMS views this process from another perspective by considering how a predicted mask being suppressed. For , its decay factor is affected by: (a) The penalty of each prediction on , where and are the confidence scores; and (b) the probability of being suppressed. For (a), the penalty of each prediction on could be easily computed by iou . For (b), the probability of being suppressed is not so elegant to be computed. However, the probability usually has positive correlation with the IoUs. So here we directly approximate the probability by the most overlapped prediction on as To this end, the final decay factor becomes and the updated score is computed by decay The authors consider the two most simple decremented functions, denoted as linear iou iou , and Gaussian iou .

Computer VisionIntroduced 20005 papers

MobileViTv2

MobileViTv2 is a vision transformer that is tuned to mobile device. MobileViTv2 introduced a separable self-attention method to reduce cost than MobileViT

Computer VisionIntroduced 20005 papers

Meta Pseudo Labels

Meta Pseudo Labels is a semi-supervised learning method that uses a teacher network to generate pseudo labels on unlabeled data to teach a student network. The teacher receives feedback from the student to inform the teacher to generate better pseudo labels. This feedback signal is used as a reward to train the teacher throughout the course of the student’s learning.

GeneralIntroduced 20005 papers

Multiscale Dilated Convolution Block

A Multiscale Dilated Convolution Block is an Inception-style convolutional block motivated by the ideas that image features naturally occur at multiple scales, that a network’s expressivity is proportional to the range of functions it can represent divided by its total number of parameters, and by the desire to efficiently expand a network’s receptive field. The Multiscale Dilated Convolution (MDC) block applies a single filter at multiple dilation factors, then performs a weighted elementwise sum of each dilated filter’s output, allowing the network to simultaneously learn a set of features and the relevant scales at which those features occur with a minimal increase in parameters. This also rapidly expands the network’s receptive field without requiring an increase in depth or the number of parameters.

Computer VisionIntroduced 20005 papers

GRIN

Graph Recurrent Imputation Network

SequentialIntroduced 20005 papers

Gradient-Based Subword Tokenization

GBST

GBST, or Gradient-based Subword Tokenization Module, is a soft gradient-based subword tokenization module that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. GBST learns a position-wise soft selection over candidate subword blocks by scoring them with a scoring network. In contrast to prior tokenization-free methods, GBST learns interpretable latent subwords, which enables easy inspection of lexical representations and is more efficient than other byte-based models.

Natural Language ProcessingIntroduced 20005 papers

Minibatch Discrimination

Minibatch Discrimination is a discriminative technique for generative adversarial networks where we discriminate between whole minibatches of samples rather than between individual samples. This is intended to avoid collapse of the generator.

Computer VisionIntroduced 20005 papers

PLIP

Pathology Language and Image Pre-Training

Pathology Language and Image Pre-Training (PLIP) is a vision-and-language foundation model created by fine-tuning CLIP on pathology images.

Computer VisionIntroduced 20005 papers

CoaT

Co-Scale Conv-attentional Image Transformer

Co-Scale Conv-Attentional Image Transformer (CoaT) is a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other. Second, the conv-attentional mechanism is designed by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities.

Computer VisionIntroduced 20005 papers

ScatNet

Scattering Transform

A wavelet scattering transform computes a translation invariant representation, which is stable to deformation, using a deep convolution network architecture. It computes non-linear invariants with modulus and averaging pooling functions. It helps to eliminate the image variability due to translation and is stable to deformations. Image source: Bruna and Mallat

Computer VisionIntroduced 20005 papers

InternVideo

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo.

Computer VisionIntroduced 20005 papers

TextGrad

TextGrad is a powerful framework building automatic differentiation'' via text. TextGrad implements backpropagation through text feedback provided by LLMs, strongly building on the gradient metaphor

GeneralIntroduced 20005 papers

Large-scale spectral clustering

Spectral Clustering Spectral clustering aims to partition the data points into clusters using the spectrum of the graph Laplacians Given a dataset with data points, spectral clustering algorithm first constructs similarity matrix , where indicates the similarity between data points and via a similarity measure metric. Let , where is called graph Laplacian and is a diagonal matrix with . The objective function of spectral clustering can be formulated based on the graph Laplacian as follow: \begin{equation} \label{eq:SCobj} {\max{{U}} \operatorname{tr}\left({U}^{T} {L} {U}\right)}, \\ {\text { s.t. } \quad {U}^{T} {{U}={I}}}, \end{equation} where denotes the trace norm of a matrix. The rows of matrix are the low dimensional embedding of the original data points. Generally, spectral clustering computes as the bottom eigenvectors of , and finally applies -means on to obtain the clustering results. Large-scale Spectral Clustering To capture the relationship between all data points in , an similarity matrix is needed to be constructed in conventional spectral clustering, which costs time and memory and is not feasible for large-scale clustering tasks. Instead of a full similarity matrix, many accelerated spectral clustering methods are using a similarity sub-matrix to represent each data points by the cross-similarity between data points and a set of representative data points (i.e., landmarks) via some similarity measures, as \begin{equation} \label{eq: cross-similarity} B = \Phi(X,R), \end{equation} where () is a set of landmarks with the same dimension to , indicate a similarity measure metric, and is the similarity sub-matrix to represent the with respect to the . For large-scale spectral clustering using such similarity matrix, a symmetric similarity matrix can be designed as \begin{equation} \label{eq: WusedB } W=\left[\begin{array}{ll} \mathbf{0} & B ; \\ B^{T} & \mathbf{0} \end{array}\right]. \end{equation} The size of matrix is . Taking the advantage of the bipartite structure, some fast eigen-decomposition methods can then be used to obtain the spectral embedding. Finally, -means is conducted on the embedding to obtain clustering results. The clustering result is directly related to the quality of that consists of the similarities between data points and landmarks. Thus, the performance of landmark selection is crucial to the clustering result.

GeneralIntroduced 20005 papers

Lovasz-Softmax

The Lovasz-Softmax loss is a loss function for multiclass semantic segmentation that incorporates the softmax operation in the Lovasz extension. The Lovasz extension is a means by which we can achieve direct optimization of the mean intersection-over-union loss in neural networks.

GeneralIntroduced 20005 papers

HyperDenseNet

Recently, dense connections have attracted substantial attention in computer vision because they facilitate gradient flow and implicit deep supervision during training. Particularly, DenseNet that connects each layer to every other layer in a feed-forward fashion and has shown impressive performances in natural image classification tasks. We propose HyperDenseNet, a 3-D fully convolutional neural network that extends the definition of dense connectivity to multi-modal segmentation problems. Each imaging modality has a path, and dense connections occur not only between the pairs of layers within the same path but also between those across different paths. This contrasts with the existing multi-modal CNN approaches, in which modeling several modalities relies entirely on a single joint layer (or level of abstraction) for fusion, typically either at the input or at the output of the network. Therefore, the proposed network has total freedom to learn more complex combinations between the modalities, within and in-between all the levels of abstraction, which increases significantly the learning representation. We report extensive evaluations over two different and highly competitive multi-modal brain tissue segmentation challenges, iSEG 2017 and MRBrainS 2013, with the former focusing on six month infant data and the latter on adult images. HyperDenseNet yielded significant improvements over many state-of-the-art segmentation networks, ranking at the top on both benchmarks. We further provide a comprehensive experimental analysis of features re-use, which confirms the importance of hyper-dense connections in multi-modal representation learning.

Computer VisionIntroduced 20005 papers

DRA

Dynamic Range Activator

Recursive functions with heteroscedasticity, sparse and high-variance target distributions introduces a huge complexity that makes their accurate modeling with Neural Networks a difficult task. A main property of recursive maps (e.g factorial function), is their dramatic growth and drop. Learning this recursive behavior requires not only fitting high-frequency patterns within a bounded region but also successfully extrapolating those patterns beyond that region. In time series prediction tasks, capturing periodic even behavior is a challenge. Various methods have been employed to model periodic patterns effectively. However, these approaches typically deal with uni-modal data that also exhibit relatively low variance in both In-Distribution (ID) and Out-Of-Distribution (OOD) regions and do not generalize well to recursive problems with the high-variance observed in our context. Thus, to enable Transformers to capture such behavior and perform proper inference for multi-modal recursive problems, we enhance them by introducing the Dynamic Range Activator (DRA). The DRA is designed to handle the recursive and factorial growth properties inherent in enumerative problems with minimal computational overhead and can be integrated into existing neural networks without requiring significant architectural changes. DRA integrates both harmonic and hyperbolic components as follows, \begin{equation} \mathrm{DRA}(x) := x + a \sin^2\left(\frac{x}{b}\right) + c \cos(bx) + d \tanh(bx) \,, \end{equation} where are learnable parameters. It allows the function to simultaneously model periodic data (through sine and cosine) and rapid growth or attenuation (through the hyperbolic tangent) response.

GeneralIntroduced 20004 papers

BiGCN

Bi-Directional Graph Convolutional Network

GraphsIntroduced 20004 papers

2D DWT

2D Discrete Wavelet Transform

Computer VisionIntroduced 20004 papers

Generalized Focal Loss

Generalized Focal Loss (GFL) is a loss function for object detection that combines Quality Focal Loss and Distribution Focal Loss into a general form.

GeneralIntroduced 20004 papers

XCiT Layer

An XCiT Layer is the main building block of the XCiT architecture which uses a [cross-covariance attention]() operator as its principal operation. The XCiT layer consists of three main blocks, each preceded by LayerNorm and followed by a residual connection: (i) the core cross-covariance attention (XCA) operation, (ii) the local patch interaction (LPI) module, and (iii) a feed-forward network (FFN). By transposing the query-key interaction, the computational complexity of XCA is linear in the number of data elements N, rather than quadratic as in conventional self-attention.

Computer VisionIntroduced 20004 papers

scSE

Spatial and Channel SE Blocks

To aggregate global spatial information, an SE block applies global pooling to the feature map. However, it ignores pixel-wise spatial information, which is important in dense prediction tasks. Therefore, Roy et al. proposed spatial and channel SE blocks (scSE). Like BAM, spatial SE blocks are used, complementing SE blocks, to provide spatial attention weights to focus on important regions. Given the input feature map , two parallel modules, spatial SE and channel SE, are applied to feature maps to encode spatial and channel information respectively. The channel SE module is an ordinary SE block, while the spatial SE module adopts convolution for spatial squeezing. The outputs from the two modules are fused. The overall process can be written as \begin{align} sc & = \sigma (W{2} \delta (W{1}\text{GAP}(X))) \end{align} \begin{align} X\text{chn} & = sc X \end{align} \begin{align} ss &= \sigma(\text{Conv}^{1\times 1}(X)) \end{align} \begin{align} X\text{spa} & = ss X \end{align} \begin{align} Y &= f(X\text{spa},X\text{chn}) \end{align} where denotes the fusion function, which can be maximum, addition, multiplication or concatenation. The proposed scSE block combines channel and spatial attention to enhance features as well as capturing pixel-wise spatial information. Segmentation tasks are greatly benefited as a result. The integration of an scSE block in F-CNNs makes a consistent improvement in semantic segmentation at negligible extra cost.

GeneralIntroduced 20004 papers

Siamese U-Net

Siamese U-Net model with a pre-trained ResNet34 architecture as an encoder for data efficient Change Detection

Computer VisionIntroduced 20004 papers

Social-STGCNN

Social-STGCNN is a method for human trajectory prediction. Pedestrian trajectories are not only influenced by the pedestrian itself but also by interaction with surrounding objects.

Computer VisionIntroduced 20004 papers

Symbolic rule learning

Symbolic rule learning methods find regularities in data that can be expressed in the form of 'if-then' rules based on symbolic representations of the data.

GeneralIntroduced 20004 papers

CoVR

Composed Video Retrieval

The composed video retrieval (CoVR) task is a new task, where the goal is to find a video that matches both a query image and a query text. The query image represents a visual concept that the user is interested in, and the query text specifies how the concept should be modified or refined. For example, given an image of a fountain and the text during show at night, the CoVR task is to retrieve a video that shows the fountain at night with a show.

Computer VisionIntroduced 20004 papers

Big-Little Module

Big-Little Modules are blocks for image models that have two branches: each of which represents a separate block from a deep model and a less deep counterpart. They were proposed as part of the BigLittle-Net architecture. The two branches are fused with a linear combination and unit weights. These two branches are known as Big-Branch (more layers and channels at low resolutions) and Little-Branch (fewer layers and channels at high resolution).

Computer VisionIntroduced 20004 papers

VSF

VisuoSpatial Foresight

VisuoSpatial Foresight is a method for robotic fabric manipulation that leverages a combination of RGB and depth information to learn goal conditioned fabric manipulation policies for a variety of long horizon tasks.

GeneralIntroduced 20004 papers

Self-adaptive Training

Self-adaptive Training is a training algorithm that dynamically corrects problematic training labels by model predictions to improve generalization of deep learning for potentially corrupted training data. Accumulated predictions are used to augment the training dynamics. The use of an exponential-moving-average scheme alleviates the instability issue of model predictions, smooths out the training target during the training process and enables the algorithm to completely change the training labels if necessary.

GeneralIntroduced 20004 papers

PixelRNN

Pixel Recurrent Neural Network

PixelRNNs are generative neural networks that sequentially predicts the pixels in an image along the two spatial dimensions. They model the discrete probability of the raw pixel values and encode the complete set of dependencies in the image. Variants include the Row LSTM and the Diagonal BiLSTM, that scale more easily to larger datasets. Pixel values are treated as discrete random variables by using a softmax layer in the conditional distributions. Masked convolutions are employed to allow PixelRNNs to model full dependencies between the color channels.

Computer VisionIntroduced 20004 papers

SCCL

Supporting Clustering with Contrastive Learning

SCCL, or Supporting Clustering with Contrastive Learning, is a framework to leverage contrastive learning to promote better separation in unsupervised clustering. It combines the top-down clustering with the bottom-up instance-wise contrastive learning to achieve better inter-cluster distance and intra-cluster distance. During training, we jointly optimize a clustering loss over the original data instances and an instance-wise contrastive loss over the associated augmented pairs.

GeneralIntroduced 20004 papers

XCiT

Cross-Covariance Image Transformers, or XCiT, is a type of vision transformer that aims to combine the accuracy of conventional transformers with the scalability of convolutional architectures. The self-attention operation underlying transformers yields global interactions between all tokens, i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. The authors propose a “transposed” version of self-attention called cross-covariance attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariances matrix between keys and queries.

Computer VisionIntroduced 20004 papers

Revision Network

Revision Network is a style transfer module that aims to revise the rough stylized image via generating residual details image , while the final stylized image is generated by combining and rough stylized image . This procedure ensures that the distribution of global style pattern in is properly kept. Meanwhile, learning to revise local style patterns with residual details image is easier for the Revision Network. As shown in the Figure, the Revision Network is designed as a simple yet effective encoder-decoder architecture, with only one down-sampling and one up-sampling layer. Further, a patch discriminator is used to help Revision Network to capture fine patch textures under adversarial learning setting. The patch discriminator is defined following SinGAN, where owns 5 convolution layers and 32 hidden channels. A relatively shallow is chosen to (1) avoid overfitting since we only have one style image and (2) control the receptive field to ensure D can only capture local patterns.

Computer VisionIntroduced 20004 papers

LayerDrop

LayerDrop is a form of structured dropout for Transformer models which has a regularization effect during training and allows for efficient pruning at inference time. It randomly drops layers from the Transformer according to an "every other" strategy where pruning with a rate means dropping the layers at depth such that .

GeneralIntroduced 20004 papers

EmbraceNet

EmbraceNet: A robust deep learning architecture for multimodal classification

Computer VisionIntroduced 20004 papers

Deformable RoI Pooling

Deformable RoI Pooling adds an offset to each bin position in the regular bin partition of the RoI Pooling. Similarly, the offsets are learned from the preceding feature maps and the RoIs, enabling adaptive part localization for objects with different shapes.

Computer VisionIntroduced 20004 papers

DeepLabv2

DeepLabv2 is an architecture for semantic segmentation that build on DeepLab with an atrous spatial pyramid pooling scheme. Here we have parallel dilated convolutions with different rates applied in the input feature map, which are then fused together. As objects of the same class can have different sizes in the image, ASPP helps to account for different object sizes.

Computer VisionIntroduced 20004 papers

ZeRO-Offload

ZeRO-Offload is a sharded data parallel method for distributed training. It exploits both CPU memory and compute for offloading, while offering a clear path towards efficiently scaling on multiple GPUs by working with ZeRO-powered data parallelism. The symbiosis allows ZeRO-Offload to maintain a single copy of the optimizer states on the CPU memory regardless of the data parallel degree. Furthermore, it keeps the aggregate communication volume between GPU and CPU, as well as the aggregate CPU computation a constant regardless of data parallelism, allowing ZeRO-Offload to effectively utilize the linear increase in CPU compute with the increase in the data parallelism degree.

GeneralIntroduced 20004 papers

APPO

Asynchronous Proximal Policy Optimization

Reinforcement LearningIntroduced 20004 papers

BigBiGAN

BigBiGAN is a type of BiGAN with a BigGAN image generator. The authors initially used ResNet as a baseline for the encoder followed by a 4-layer MLP with skip connections, but they experimented with RevNets and found they outperformed with increased network width, so opted for this type of encoder for the final architecture.

GeneralIntroduced 20004 papers

Graph2Tree

Graph-to-Tree MWP Solver

SequentialIntroduced 20004 papers

GANDALF

Gated Adaptive Network for Deep Automated Learning of Features

We propose a novel high-performance, interpretable, and parameter \& computationally efficient deep learning architecture for tabular data, Gated Adaptive Network for Deep Automated Learning of Features (GANDALF). GANDALF relies on a new tabular processing unit with a gating mechanism and in-built feature selection called Gated Feature Learning Unit (GFLU) as a feature representation learning unit. We demonstrate that GANDALF outperforms or stays at-par with SOTA approaches like XGBoost, SAINT, FT-Transformers, etc. by experiments on multiple established public benchmarks. We have made available the code at github.com/manujosephv/pytorchtabular under MIT License.

GeneralIntroduced 20004 papers

Virtual Data Augmentation

Virtual Data Augmentation, or VDA, is a framework for robustly fine-tuning pre-trained language model. Based on the original token embeddings, a multinomial mixture for augmenting virtual data is constructed, where a masked language model guarantees the semantic relevance and the Gaussian noise provides the augmentation diversity. Furthermore, a regularized training strategy is proposed to balance the two aspects.

GeneralIntroduced 20004 papers

FoveaBox

FoveaBox is anchor-free framework for object detection. Instead of using predefined anchors to enumerate possible locations, scales and aspect ratios for the search of the objects, FoveaBox directly learns the object existing possibility and the bounding box coordinates without anchor reference. This is achieved by: (a) predicting category-sensitive semantic maps for the object existing possibility, and (b) producing category-agnostic bounding box for each position that potentially contains an object. The scales of target boxes are naturally associated with feature pyramid representations for each input image It is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-shelf convolutional network. The first subnet performs per pixel classification on the backbone’s output; the second subnet performs bounding box prediction for the corresponding position.

Computer VisionIntroduced 20004 papers

Mobile Neural Network

MNN

Mobile Neural Network (MNN) is a mobile inference engine tailored to mobile applications. The contributions of MNN include: (1) presenting a mechanism called pre-inference that manages to conduct runtime optimization; (2) delivering thorough kernel optimization on operators to achieve optimal computation performance; (3) introducing backend abstraction module which enables hybrid scheduling and keeps the engine lightweight.

GeneralIntroduced 20004 papers

DouZero

DouZero is an AI system for the card game DouDizhu that enhances traditional Monte-Carlo methods with deep neural networks, action encoding, and parallel actors. The Q-network of DouZero consists of an LSTM to encode historical actions and six layers of MLP with hidden dimension of 512. The network predicts a value for a given state-action pair based on the concatenated representation of action and state.

Reinforcement LearningIntroduced 20004 papers

PP-OCR

PP-OCR is an OCR system that consists of three parts, text detection, detected boxes rectification and text recognition. The purpose of text detection is to locate the text area in the image. In PP-OCR, Differentiable Binarization (DB) is used as text detector which is based on a simple segmentation network. It integrates feature extraction and sequence modeling. It adopts the Connectionist Temporal Classification (CTC) loss to avoid the inconsistency between prediction and label.

Computer VisionIntroduced 20004 papers

Anti-Alias Downsampling

Anti-Alias Downsampling (AA) aims to improve the shift-equivariance of deep networks. Max-pooling is inherently composed of two operations. The first operation is to densely evaluate the max operator and second operation is naive subsampling. AA is proposed as a low-pass filter between them to achieve practical anti-aliasing in any existing strided layer such as strided convolution. The smoothing factor can be adjusted by changing the blur kernel filter size, where a larger filter size results in increased blur.

Computer VisionIntroduced 20004 papers

SimAug

Simulation as Augmentation

SimAug, or Simulation as Augmentation, is a data augmentation method for trajectory prediction. It augments the representation such that it is robust to the variances in semantic scenes and camera views. First, to deal with the gap between real and synthetic semantic scene, it represents each training trajectory by high-level scene semantic segmentation features, and defends the model from adversarial examples generated by whitebox attack methods. Second, to overcome the changes in camera views, it generates multiple views for the same trajectory, and encourages the model to focus on the “hardest” view to which the model has learned. The classification loss is adopted and the view with the highest loss is favored during training. Finally, the augmented trajectory is computed as a convex combination of the trajectories generated in previous steps. The trajectory prediction model is built on a multi-scale representation and the final model is trained to minimize the empirical vicinal risk over the distribution of augmented trajectories.

Computer VisionIntroduced 20004 papers

MDPO

Mirror Descent Policy Optimization

Mirror Descent Policy Optimization (MDPO) is a policy gradient algorithm based on the idea of iteratively solving a trust-region problem that minimizes a sum of two terms: a linearization of the standard RL objective function and a proximity term that restricts two consecutive updates to be close to each other. It is based on Mirror Descent, which is a general trust region method that attempts to keep consecutive iterates close to each other.

Reinforcement LearningIntroduced 20004 papers

VL-BERT

Visual-Linguistic BERT

VL-BERT is pre-trained on a large-scale image-captions dataset together with text-only corpus. The input to the model are either words from the input sentences or regions-of-interest (RoI) from input images. It can be fine-tuned to fit most visual-linguistic downstream tasks. Its backbone is a multi-layer bidirectional Transformer encoder, modified to accommodate visual contents, and new type of visual feature embedding to the input feature embeddings. VL-BERT takes both visual and linguistic elements as input, represented as RoIs in images and subwords in input sentences. Four different types of embeddings are used to represent each input: token embedding, visual feature embedding, segment embedding, and sequence position embedding. VL-BERT is pre-trained using Conceptual Captions and text-only datasets. Two pre-training tasks are used: masked language modeling with visual clues, and masked RoI classification with linguistic clues.

Computer VisionIntroduced 20004 papers
PreviousPage 19 of 175Next