TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods

8,725 machine learning methods and techniques

AllAudioComputer VisionGeneralGraphsNatural Language ProcessingReinforcement LearningSequential

Mix-FFN

Mix-FFN is a feedforward layer used in the SegFormer architecture. ViT uses positional encoding (PE) to introduce the location information. However, the resolution of is fixed. Therefore, when the test resolution is different from the training one, the positional code needs to be interpolated and this often leads to dropped accuracy. To alleviate this problem, CPVT uses Conv together with the PE to implement a data-driven PE. The authors of Mix-FFN argue that positional encoding is actually not necessary for semantic segmentation. Instead, they use Mix-FFN which considers the effect of zero padding to leak location information, by directly using a Conv in the feed-forward network (FFN). Mix-FFN can be formulated as: where is the feature from a self-attention module. Mix-FFN mixes a convolution and an MLP into each FFN.

GeneralIntroduced 200047 papers

PSPNet

PSPNet, or Pyramid Scene Parsing Network, is a semantic segmentation model that utilises a pyramid parsing module that exploits global context information by different-region based context aggregation. The local and global clues together make the final prediction more reliable. We also propose an optimization Given an input image, PSPNet use a pretrained CNN with the dilated network strategy to extract the feature map. The final feature map size is of the input image. On top of the map, we use the pyramid pooling module to gather context information. Using our 4-level pyramid, the pooling kernels cover the whole, half of, and small portions of the image. They are fused as the global prior. Then we concatenate the prior with the original feature map in the final part of. It is followed by a convolution layer to generate the final prediction map.

Computer VisionIntroduced 200047 papers

CenterNet

CenterNet is a one-stage object detector that detects each object as a triplet, rather than a pair, of keypoints. It utilizes two customized modules named cascade corner pooling and center pooling, which play the roles of enriching information collected by both top-left and bottom-right corners and providing more recognizable information at the central regions, respectively. The intuition is that, if a predicted bounding box has a high IoU with the ground-truth box, then the probability that the center keypoint in its central region is predicted as the same class is high, and vice versa. Thus, during inference, after a proposal is generated as a pair of corner keypoints, we determine if the proposal is indeed an object by checking if there is a center keypoint of the same class falling within its central region.

Computer VisionIntroduced 200047 papers

GTS

Goal-Driven Tree-Structured Neural Model

SequentialIntroduced 200046 papers

Cascade Corner Pooling

Cascade Corner Pooling is a pooling layer for object detection that builds upon the corner pooling operation. Corners are often outside the objects, which lacks local appearance features. CornerNet uses corner pooling to address this issue, where we find the maximum values on the boundary directions so as to determine corners. However, it makes corners sensitive to the edges. To address this problem, we need to let corners see the visual patterns of objects. Cascade corner pooling first looks along a boundary to find a boundary maximum value, then looks inside along the location of the boundary maximum value to find an internal maximum value, and finally, add the two maximum values together. By doing this, the corners obtain both the the boundary information and the visual patterns of objects.

Computer VisionIntroduced 200046 papers

Attention Pooling

Computer VisionIntroduced 200046 papers

fast speak--How do I Speak to someone at Expedia?

Want to speak directly in Expedia? 1-805-330-4056 You’re not alone. Many users crave a real conversation, not just 1-805-330-4056 emails or chatbots. The secret? Dial 1-805-330-4056. This number is your direct line to human support at Expedia—real people who can answer questions, solve problems, and guide you through the platform. When confusion strikes or an issue arises, stop guessing and start calling 1-805-330-4056. Want to verify your identity? Dial 1-805-330-4056. Having trouble with two-factor authentication? 1-805-330-4056. It’s simple—direct communication means picking up the phone and dialing 1-805-330-4056. Ultimately, speaking directly in Expedia is about cutting through barriers and getting personal support—and that starts with 1-805-330-4056. Whether it’s during market hours or late-night trading, 1-805-330-4056 connects you to the people who can fix your issues fast. Don’t settle for automated responses or waiting days for email replies. Next time you want to speak directly in Expedia, remember the magic number: 1-805-330-4056. Share it, save it, repeat it. Because when you call 1-805-330-4056, you’re not just a user—you’re a priority. Get direct. Get clear. Get help—right now at 1-805-330-4056.

GeneralIntroduced 200046 papers

MuZero

MuZero is a model-based reinforcement learning algorithm. It builds upon AlphaZero's search and search-based policy iteration algorithms, but incorporates a learned model into the training procedure. The main idea of the algorithm is to predict those aspects of the future that are directly relevant for planning. The model receives the observation (e.g. an image of the Go board or the Atari screen) as an input and transforms it into a hidden state. The hidden state is then updated iteratively by a recurrent process that receives the previous hidden state and a hypothetical next action. At every one of these steps the model predicts the policy (e.g. the move to play), value function (e.g. the predicted winner), and immediate reward (e.g. the points scored by playing a move). The model is trained end-to-end, with the sole objective of accurately estimating these three important quantities, so as to match the improved estimates of policy and value generated by search as well as the observed reward. There is no direct constraint or requirement for the hidden state to capture all information necessary to reconstruct the original observation, drastically reducing the amount of information the model has to maintain and predict; nor is there any requirement for the hidden state to match the unknown, true state of the environment; nor any other constraints on the semantics of state. Instead, the hidden states are free to represent state in whatever way is relevant to predicting current and future values and policies. Intuitively, the agent can invent, internally, the rules or dynamics that lead to most accurate planning.

Reinforcement LearningIntroduced 200046 papers

LightGCN

LightGCN is a type of graph convolutional neural network (GCN), including only the most essential component in GCN (neighborhood aggregation) for collaborative filtering. Specifically, LightGCN learns user and item embeddings by linearly propagating them on the user-item interaction graph, and uses the weighted sum of the embeddings learned at all layers as the final embedding.

GraphsIntroduced 200046 papers

Center Pooling

Center Pooling is a pooling technique for object detection that aims to capture richer and more recognizable visual patterns. The geometric centers of objects do not necessarily convey very recognizable visual patterns (e.g., the human head contains strong visual patterns, but the center keypoint is often in the middle of the human body). The detailed process of center pooling is as follows: the backbone outputs a feature map, and to determine if a pixel in the feature map is a center keypoint, we need to find the maximum value in its both horizontal and vertical directions and add them together. By doing this, center pooling helps the better detection of center keypoints.

Computer VisionIntroduced 200046 papers

GLM

GLM is a bilingual (English and Chinese) pre-trained transformer-based language model that follow the traditional architecture of decoder-only autoregressive language modeling. It leverages autoregressive blank infilling as its training objective.

Natural Language ProcessingIntroduced 200046 papers

Stochastic Weight Averaging

Stochastic Weight Averaging is an optimization procedure that averages multiple points along the trajectory of SGD, with a cyclical or constant learning rate. On the one hand it averages weights, but it also has the property that, with a cyclical or constant learning rate, SGD proposals are approximately sampling from the loss surface of the network, leading to stochastic weights and helping to discover broader optima.

GeneralIntroduced 200045 papers

PixelCNN

A PixelCNN is a generative model that uses autoregressive connections to model images pixel by pixel, decomposing the joint image distribution as a product of conditionals. PixelCNNs are much faster to train than PixelRNNs because convolutions are inherently easier to parallelize; given the vast number of pixels present in large image datasets this is an important advantage.

Computer VisionIntroduced 200045 papers

Relative Position Encodings

Relative Position Encodings are a type of position embeddings for Transformer-based models that attempts to exploit pairwise, relative positional information. Relative positional information is supplied to the model on two levels: values and keys. This becomes apparent in the two modified self-attention equations shown below. First, relative positional information is supplied to the model as an additional component to the keys Here is an edge representation for the inputs and . The softmax operation remains unchanged from vanilla self-attention. Then relative positional information is supplied again as a sub-component of the values matrix: In other words, instead of simply combining semantic embeddings with absolute positional ones, relative positional information is added to keys and values on the fly during attention calculation. Source: Jake Tae Image Source: [Relative Positional Encoding for Transformers with Linear Complexity](https://www.youtube.com/watch?v=qajudaEHuq8

GeneralIntroduced 200045 papers

SPL

Semi-Pseudo-Label

GeneralIntroduced 200045 papers

LeNet

LeNet is a classic convolutional neural network employing the use of convolutions, pooling and fully connected layers. It was used for the handwritten digit recognition task with the MNIST dataset. The architectural design served as inspiration for future networks such as AlexNet and VGG.. code

Computer VisionIntroduced 199845 papers

PixelShuffle

PixelShuffle is an operation used in super-resolution models to implement efficient sub-pixel convolutions with a stride of . Specifically it rearranges elements in a tensor of shape to a tensor of shape . Image Source: Remote Sensing Single-Image Resolution Improvement Using A Deep Gradient-Aware Network with Image-Specific Enhancement

GeneralIntroduced 200045 papers

SRS

Sticker Response Selector

Sticker Response Selector, or SRS, is a model for multi-turn dialog that automatically selects a sticker response. SRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker with each utterance in the dialog history. SRS then learns the short-term and long-term dependency between all interaction results by a fusion network to output the the final matching score.

Natural Language ProcessingIntroduced 200045 papers

CRN

Conditional Relation Network

Conditional Relation Network, or CRN, is a building block to construct more sophisticated structures for representation and reasoning over video. CRN takes as input an array of tensorial objects and a conditioning feature, and computes an array of encoded output objects. Model building becomes a simple exercise of replication, rearrangement and stacking of these reusable units for diverse modalities and contextual information. This design thus supports high-order relational and multi-step reasoning.

Computer VisionIntroduced 200044 papers

EfficientNetV2

EfficientNetV2 is a type convolutional neural network that has faster training speed and better parameter efficiency than previous models. To develop these models, the authors use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed. The models were searched from the search space enriched with new ops such as Fused-MBConv. Architecturally the main differences are: - EfficientNetV2 extensively uses both MBConv and the newly added fused-MBConv in the early layers. - EfficientNetV2 prefers smaller expansion ratio for MBConv since smaller expansion ratios tend to have less memory access overhead. - EfficientNetV2 prefers smaller 3x3 kernel sizes, but it adds more layers to compensate the reduced receptive field resulted from the smaller kernel size. - EfficientNetV2 completely removes the last stride-1 stage in the original EfficientNet, wperhaps due to its large parameter size and memory access overhead.

Computer VisionIntroduced 200043 papers

GIN

Graph Isomorphism Network

Per the authors, Graph Isomorphism Network (GIN) generalizes the WL test and hence achieves maximum discriminative power among GNNs.

GraphsIntroduced 200043 papers

DEQ

Deep Equilibrium Models

A new kind of implicit models, where the output of the network is defined as the solution to an "infinite-level" fixed point equation. Thanks to this we can compute the gradient of the output without activations and therefore with a significantly reduced memory footprint.

GeneralIntroduced 200042 papers

NODE

Neural Oblivious Decision Ensembles

Neural Oblivious Decision Ensembles (NODE) is a tabular data architecture that consists of differentiable oblivious decision trees (ODT) that are trained end-to-end by backpropagation. The core building block is a Neural Oblivious Decision Ensemble (NODE) layer. The layer is composed of differentiable oblivious decision trees (ODTs) of equal depth . As an input, all trees get a common vector , containing numeric features. Below we describe a design of a single differentiable ODT. In its essence, an ODT is a decision table that splits the data along splitting features and compares each feature to a learned threshold. Then, the tree returns one of the possible responses, corresponding to the comparisons result. Therefore, each ODT is completely determined by its splitting features , splitting thresholds and a -dimensional tensor of responses . In this notation, the tree output is defined as: where denotes the Heaviside function.

GeneralIntroduced 200042 papers

Deformable Attention Module

Deformable Attention Module is an attention module used in the Deformable DETR architecture, which seeks to overcome one issue base Transformer attention in that it looks over all possible spatial locations. Inspired by deformable convolution, the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps. By assigning only a small fixed number of keys for each query, the issues of convergence and feature spatial resolution can be mitigated. Given an input feature map , let index a query element with content feature and a 2-d reference point , the deformable attention feature is calculated by: where indexes the attention head, indexes the sampled keys, and is the total sampled key number and denote the sampling offset and attention weight of the sampling point in the attention head, respectively. The scalar attention weight lies in the range , normalized by are of 2-d real numbers with unconstrained range. As is fractional, bilinear interpolation is applied as in Dai et al. (2017) in computing . Both and are obtained via linear projection over the query feature In implementation, the query feature is fed to a linear projection operator of channels, where the first channels encode the sampling offsets , and the remaining channels are fed to a softmax operator to obtain the attention weights .

GeneralIntroduced 200042 papers

GAIL

Generative Adversarial Imitation Learning

Generative Adversarial Imitation Learning presents a new general framework for directly extracting a policy from data, as if it were obtained by reinforcement learning following inverse reinforcement learning.

GeneralIntroduced 200041 papers

ULMFiT

Universal Language Model Fine-tuning

Universal Language Model Fine-tuning, or ULMFiT, is an architecture and transfer learning method that can be applied to NLP tasks. It involves a 3-layer AWD-LSTM architecture for its representations. The training consists of three steps: 1) general language model pre-training on a Wikipedia-based text, 2) fine-tuning the language model on a target task, and 3) fine-tuning the classifier on the target task. As different layers capture different types of information, they are fine-tuned to different extents using discriminative fine-tuning. Training is performed using Slanted triangular learning rates (STLR), a learning rate scheduling strategy that first linearly increases the learning rate and then linearly decays it. Fine-tuning the target classifier is achieved in ULMFiT using gradual unfreezing. Rather than fine-tuning all layers at once, which risks catastrophic forgetting, ULMFiT gradually unfreezes the model starting from the last layer (i.e., closest to the output) as this contains the least general knowledge. First the last layer is unfrozen and all unfrozen layers are fine-tuned for one epoch. Then the next group of frozen layers is unfrozen and fine-tuned and repeat, until all layers are fine-tuned until convergence at the last iteration.

Natural Language ProcessingIntroduced 200040 papers

VERSE

VERtex Similarity Embeddings

VERtex Similarity Embeddings (VERSE) is a simple, versatile, and memory-efficient method that derives graph embeddings explicitly calibrated to preserve the distributions of a selected vertex-to-vertex similarity measure. VERSE learns such embeddings by training a single-layer neural network. Source: Tsitsulin et al. Image source: Tsitsulin et al.

GraphsIntroduced 200040 papers

Slanted Triangular Learning Rates

Slanted Triangular Learning Rates (STLR) is a learning rate schedule which first linearly increases the learning rate and then linearly decays it, which can be seen in Figure to the right. It is a modification of Triangular Learning Rates, with a short increase and a long decay period.

GeneralIntroduced 200040 papers

1-bit Adam

1-bit Adam is a stochastic optimization technique that is a variant of ADAM with error-compensated 1-bit compression, based on finding that Adam's variance term becomes stable at an early stage. First vanilla Adam is used for a few epochs as a warm-up. After the warm-up stage, the compression stage starts and we stop updating the variance term and use it as a fixed precondition. At the compression stage, we communicate based on the momentum applied with error-compensated 1-bit compression. The momentums are quantized into 1-bit representation (the sign of each element). Accompanying the vector, a scaling factor is computed as . This scaling factor ensures that the compressed momentum has the same magnitude as the uncompressed momentum. This 1-bit compression could reduce the communication cost by and compared to the original float 32 and float 16 training, respectively.

GeneralIntroduced 200040 papers

LXMERT

Learning Cross-Modality Encoder Representations from Transformers

LXMERT is a model for learning vision-and-language cross-modality representations. It consists of a Transformer model that consists three encoders: object relationship encoder, a language encoder, and a cross-modality encoder. The model takes two inputs: image with its related sentence. The images are represented as a sequence of objects, whereas each sentence is represented as sequence of words. By combining the self-attention and cross-attention layers the model is able to generated language representation, image representations, and cross-modality representations from the input. The model is pre-trained with image-sentence pairs via five pre-training tasks: masked language modeling, masked object prediction, cross-modality matching, and image questions answering. These tasks help the model to learn both intra-modality and cross-modality relationships.

Computer VisionIntroduced 200040 papers

Differentiable NAS

Differentiable Neural Architecture Search

GeneralIntroduced 200039 papers

DMA

Dual Multimodal Attention

In image inpainting task, the mechanism extracts complementary features from the word embedding in two paths by reciprocal attention, which is done by comparing the descriptive text and complementary image areas through reciprocal attention.

GeneralIntroduced 200039 papers

ARMA

ARMA GNN

The ARMA GNN layer implements a rational graph filter with a recursive approximation.

GraphsIntroduced 200038 papers

PCT

Perceptual control theoretic architecture

Natural Language ProcessingIntroduced 200038 papers

SPADE

Spatially-Adaptive Normalization

SPADE, or Spatially-Adaptive Normalization is a conditional normalization method for semantic image synthesis. Similar to Batch Normalization, the activation is normalized in the channel-wise manner and then modulated with learned scale and bias. In the SPADE, the mask is first projected onto an embedding space and then convolved to produce the modulation parameters and Unlike prior conditional normalization methods, and are not vectors, but tensors with spatial dimensions. The produced and are multiplied and added to the normalized activation element-wise.

GeneralIntroduced 200038 papers

EfficientDet

EfficientDet is a type of object detection model, which utilizes several optimization and backbone tweaks, such as the use of a BiFPN, and a compound scaling method that uniformly scales the resolution,depth and width for all backbones, feature networks and box/class prediction networks at the same time.

Computer VisionIntroduced 200038 papers

Gradient Sparsification

Gradient Sparsification is a technique for distributed training that sparsifies stochastic gradients to reduce the communication cost, with minor increase in the number of iterations. The key idea behind our sparsification technique is to drop some coordinates of the stochastic gradient and appropriately amplify the remaining coordinates to ensure the unbiasedness of the sparsified stochastic gradient. The sparsification approach can significantly reduce the coding length of the stochastic gradient and only slightly increase the variance of the stochastic gradient.

GeneralIntroduced 200038 papers

Linear Warmup

Linear Warmup is a learning rate schedule where we linearly increase the learning rate from a low rate to a constant rate thereafter. This reduces volatility in the early stages of training. Image Credit: Chengwei Zhang

GeneralIntroduced 200038 papers

GPT-Neo

An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library. Source: EleutherAI/GPT-Neo

Natural Language ProcessingIntroduced 200038 papers

CSL

Circular Smooth Label

Circular Smooth Label (CSL) is a classification-based rotation detection technique for arbitrary-oriented object detection. It is used for circularly distributed angle classification and addresses the periodicity of the angle and increases the error tolerance to adjacent angles.

Computer VisionIntroduced 200038 papers

Fast R-CNN

Fast R-CNN is an object detection model that improves in its predecessor R-CNN in a number of ways. Instead of extracting CNN features independently for each region of interest, Fast R-CNN aggregates them into a single forward pass over the image; i.e. regions of interest from the same image share computation and memory in the forward and backward passes.

Computer VisionIntroduced 200038 papers

NesT

NesT stacks canonical transformer layers to conduct local self-attention on every image block independently, and then "nests" them hierarchically. Coupling of processed information between spatially adjacent blocks is achieved through a proposed block aggregation between every two hierarchies. The overall hierarchical structure can be determined by two key hyper-parameters: patch size and number of block hierarchies . All blocks inside each hierarchy share one set of parameters. Given input of image, each image is linearly projected to an embedding. All embeddings are partitioned to blocks and flattened to generate final input. Each transformer layers is composed of a multi-head self attention (MSA) layer followed by a feed-forward fully-connected network (FFN) with skip-connection and Layer normalization. Positional embeddings are added to encode spatial information before feeding into the block. Lastly, a nested hierarchy with block aggregation is built -- every four spatially connected blocks are merged into one.

Computer VisionIntroduced 200037 papers

Sparse Convolutions

Computer VisionIntroduced 200037 papers

VGG Loss

VGG Loss is a type of content loss introduced in the Perceptual Losses for Real-Time Style Transfer and Super-Resolution super-resolution and style transfer framework. It is an alternative to pixel-wise losses; VGG Loss attempts to be closer to perceptual similarity. The VGG loss is based on the ReLU activation layers of the pre-trained 19 layer VGG network. With we indicate the feature map obtained by the -th convolution (after activation) before the -th maxpooling layer within the VGG19 network, which we consider given. We then define the VGG loss as the euclidean distance between the feature representations of a reconstructed image and the reference image : Here and describe the dimensions of the respective feature maps within the VGG network.

GeneralIntroduced 200037 papers

PO

Parrot optimizer: Algorithm and applications to medical problems

Stochastic optimization methods have gained significant prominence as effective techniques in contemporary research, addressing complex optimization challenges efficiently. This paper introduces the Parrot Optimizer (PO), an efficient optimization method inspired by key behaviors observed in trained Pyrrhura Molinae parrots. The study features qualitative analysis and comprehensive experiments to showcase the distinct characteristics of the Parrot Optimizer in handling various optimization problems. Performance evaluation involves benchmarking the proposed PO on 35 functions, encompassing classical cases and problems from the IEEE CEC 2022 test sets, and comparing it with eight popular algorithms. The results vividly highlight the competitive advantages of the PO in terms of its exploratory and exploitative traits. Furthermore, parameter sensitivity experiments explore the adaptability of the proposed PO under varying configurations. The developed PO demonstrates effectiveness and superiority when applied to engineering design problems. To further extend the assessment to real-world applications, we included the application of PO to disease diagnosis and medical image segmentation problems, which are highly relevant and significant in the medical field. In conclusion, the findings substantiate that the PO is a promising and competitive algorithm, surpassing some existing algorithms in the literature. The supplementary files and open source codes of the proposed Parrot Optimizer (PO) is available at https://aliasgharheidari.com/PO.html

GeneralIntroduced 200037 papers

Class Attention

A Class Attention layer, or CA Layer, is an attention mechanism for vision transformers used in CaiT that aims to extract information from a set of processed patches. It is identical to a self-attention layer, except that it relies on the attention between (i) the class embedding (initialized at CLS in the first CA) and (ii) itself plus the set of frozen patch embeddings Considering a network with heads and patches, and denoting by the embedding size, the multi-head class-attention is parameterized with several projection matrices, , and the corresponding biases With this notation, the computation of the CA residual block proceeds as follows. We first augment the patch embeddings (in matrix form) as . We then perform the projections: The class-attention weights are given by where . This attention is involved in the weighted sum to produce the residual output vector which is in turn added to for subsequent processing.

GeneralIntroduced 200036 papers

ETC

Extended Transformer Construction

Extended Transformer Construction, or ETC, is an extension of the Transformer architecture with a new attention mechanism that extends the original in two main ways: (1) it allows scaling up the input length from 512 to several thousands; and (2) it can ingesting structured inputs instead of just linear sequences. The key ideas that enable ETC to achieve these are a new global-local attention mechanism, coupled with relative position encodings. ETC also allows lifting weights from existing BERT models, saving computational resources while training.

Natural Language ProcessingIntroduced 200036 papers

INFO

INFO: An Efficient Optimization Algorithm based on Weighted Mean of Vectors

This study presents the analysis and principle of an innovative optimizer named weIghted meaN oF vectOrs (INFO) to optimize different problems. INFO is a modified weight mean method, whereby the weighted mean idea is employed for a solid structure and updating the vectors’ position using three core procedures: updating rule, vector combining, and a local search. The updating rule stage is based on a mean-based law and convergence acceleration to generate new vectors. The vector combining stage creates a combination of obtained vectors with the updating rule to achieve a promising solution. The updating rule and vector combining steps were improved in INFO to increase the exploration and exploitation capacities. Moreover, the local search stage helps this algorithm escape low-accuracy solutions and improve exploitation and convergence. The performance of INFO was evaluated in 48 mathematical test functions, and five constrained engineering test cases including optimal design of 10-reservoir system and 4-reservoir system. According to the literature, the results demonstrate that INFO outperforms other basic and advanced methods in terms of exploration and exploitation. In the case of engineering problems, the results indicate that the INFO can converge to 0.99% of the global optimum solution. Hence, the INFO algorithm is a promising tool for optimal designs in optimization problems, which stems from the considerable efficiency of this algorithm for optimizing constrained cases. The source codes of INFO algorithm are publicly available at https://aliasgharheidari.com/INFO.html

GeneralIntroduced 200036 papers

MADDPG

MADDPG, or Multi-agent DDPG, extends DDPG into a multi-agent policy gradient algorithm where decentralized agents learn a centralized critic based on the observations and actions of all agents. It leads to learned policies that only use local information (i.e. their own observations) at execution time, does not assume a differentiable model of the environment dynamics or any particular structure on the communication method between agents, and is applicable not only to cooperative interaction but to competitive or mixed interaction involving both physical and communicative behavior. The critic is augmented with extra information about the policies of other agents, while the actor only has access to local information. After training is completed, only the local actors are used at execution phase, acting in a decentralized manner.

Reinforcement LearningIntroduced 200036 papers

OSCAR

OSCAR is a new learning method that uses object tags detected in images as anchor points to ease the learning of image-text alignment. The model take a triple as input (word-tag-region) and pre-trained with two losses (masked token loss over words and tags, and a contrastive loss between tags and others). OSCAR represents an image-text pair into semantic space via dictionary lookup. Object tags are used as anchor points to align image regions with word embeddings of pre-trained language models. The model is then fine-tuned for understanding and generation tasks.

Computer VisionIntroduced 200036 papers
PreviousPage 8 of 175Next