Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods

8,725 machine learning methods and techniques

All Audio Computer Vision General Graphs Natural Language Processing Reinforcement Learning Sequential

BiGG

BiGG is an autoregressive model for generative modeling for sparse graphs. It utilizes sparsity to avoid generating the full adjacency matrix, and reduces the graph generation time complexity to . Furthermore, during training this autoregressive model can be parallelized with synchronization stages, which makes it much more efficient than other autoregressive models that require . The approach is based on three key elements: (1) an process for generating each edge using a binary tree data structure, inspired by R-MAT; (2) a tree-structured autoregressive model for generating the set of edges associated with each node; and (3) an autoregressive model defined over the sequence of nodes.

GraphsIntroduced 20003 papers

Attention-augmented Convolution

Attention-augmented Convolution is a type of convolution with a two-dimensional relative self-attention mechanism that can replace convolutions as a stand-alone computational primitive for image classification. It employs scaled-dot product attention and multi-head attention as with Transformers. It works by concatenating convolutional and attentional feature map. To see this, consider an original convolution operator with kernel size , input filters and output filters. The corresponding attention augmented convolution can be written as" originates from an input tensor of shape . This is flattened to become which is passed into a multi-head attention module, as well as a convolution (see above). Similarly to the convolution, the attention augmented convolution 1) is equivariant to translation and 2) can readily operate on inputs of different spatial dimensions.

GeneralIntroduced 20003 papers

GridMask

GridMask is a data augmentation method that randomly removes some pixels of an input image. Unlike other methods, the region that the algorithm removes is neither a continuous region nor random pixels in dropout. Instead, the algorithm removes a region with disconnected pixel sets, as shown in the Figure. We express the setting as where represents the input image, is the binary mask that stores pixels to be removed, and is the result produced by the algorithm. For the binary mask , if we keep pixel in the input image; otherwise we remove it. GridMask is applied after the image normalization operation. The shape of looks like a grid, as shown in the Figure . Four numbers are used to represent a unique . Every mask is formed by tiling the units. is the ratio of the shorter gray edge in a unit. is the length of one unit. and are the distances between the first intact unit and boundary of the image.

Computer VisionIntroduced 20003 papers

Compressive Transformer

The Compressive Transformer is an extension to the Transformer which maps past hidden activations (memories) to a smaller set of compressed representations (compressed memories). The Compressive Transformer uses the same attention mechanism over its set of memories and compressed memories, learning to query both its short-term granular memory and longer-term coarse memory. It builds on the ideas of Transformer-XL which maintains a memory of past activations at each layer to preserve a longer history of context. The Transformer-XL discards past activations when they become sufficiently old (controlled by the size of the memory). The key principle of the Compressive Transformer is to compress these old memories, instead of discarding them, and store them in an additional compressed memory. At each time step , we discard the oldest compressed memories (FIFO) and then the oldest states from ordinary memory are compressed and shifted to the new slot in compressed memory. During training, the compressive memory component is optimized separately from the main language model (separate training loop).

Natural Language ProcessingIntroduced 20003 papers

BezierAlign

BezierAlign is a feature sampling method for arbitrarily-shaped scene text recognition that exploits parameterization nature of a compact Bezier curve bounding box. Unlike RoIAlign, the shape of sampling grid of BezierAlign is not rectangular. Instead, each column of the arbitrarily-shaped grid is orthogonal to the Bezier curve boundary of the text. The sampling points have equidistant interval in width and height, respectively, which are bilinear interpolated with respect to the coordinates. Formally given an input feature map and Bezier curve control points, we concurrently process all the output pixels of the rectangular output feature map with size . Taking pixel with position (from output feature map) as an example, we calculate by: We then calculate the point of upper Bezier curve boundary and lower Bezier curve boundary . Using and , we can linearly index the sampling point by: With the position of , we can easily apply bilinear interpolation to calculate the result. Comparisons among previous sampling methods and BezierAlign are shown in the Figure.

Computer VisionIntroduced 20003 papers

Style-based Recalibration Module

A Style-based Recalibration Module (SRM) is a module for convolutional neural networks that adaptively recalibrates intermediate feature maps by exploiting their styles. SRM first extracts the style information from each channel of the feature maps by style pooling, then estimates per-channel recalibration weight via channel-independent style integration. By incorporating the relative importance of individual styles into feature maps, SRM is aimed at enhancing the representational ability of a CNN. The overall structure of SRM is illustrated in the Figure to the right. It consists of two main components: style pooling and style integration. The style pooling operator extracts style features from each channel by summarizing feature responses across spatial dimensions. It is followed by the style integration operator, which produces example-specific style weights by utilizing the style features via channel-wise operation. The style weights finally recalibrate the feature maps to either emphasize or suppress their information.

Computer VisionIntroduced 20003 papers

SimVLM

Simple Visual Language Model

SimVLM is a minimalist pretraining framework to reduce training complexity by exploiting large-scale weak supervision. It is trained end-to-end with a single prefix language modeling (PrefixLM) objective. PrefixLM enables bidirectional attention within the prefix sequence, and thus it is applicable for both decoder-only and encoder-decoder sequence-to-sequence language models.

Computer VisionIntroduced 20003 papers

CSPResNeXt Block

CSPResNeXt Block is an extended ResNext Block where we partition the feature map of the base layer into two parts and then merge them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.

Computer VisionIntroduced 20003 papers

KIP

Kernel Inducing Points

Kernel Inducing Points, or KIP, is a meta-learning algorithm for learning datasets that can mitigate the challenges which occur for naturally occurring datasets without a significant sacrifice in performance. KIP uses kernel-ridge regression to learn -approximate datasets. It can be regarded as an adaption of the inducing point method for Gaussian processes to the case of Kernel Ridge Regression.

GeneralIntroduced 20003 papers

Dynamic R-CNN

Dynamic R-CNN is an object detection method that adjusts the label assignment criteria (IoU threshold) and the shape of regression loss function (parameters of Smooth L1 Loss) automatically based on the statistics of proposals during training. The motivation is that in previous two-stage object detectors, there is an inconsistency problem between the fixed network settings and the dynamic training procedure. For example, the fixed label assignment strategy and regression loss function cannot fit the distribution change of proposals and thus are harmful to training high quality detectors. It consists of two components: Dynamic Label Assignment and Dynamic Smooth L1 Loss, which are designed for the classification and regression branches, respectively. For Dynamic Label Assignment, we want our model to be discriminative for high IoU proposals, so we gradually adjust the IoU threshold for positive/negative samples based on the proposals distribution in the training procedure. Specifically, we set the threshold as the IoU of the proposal at a certain percentage since it can reflect the quality of the overall distribution. For Dynamic Smooth L1 Loss, we want to change the shape of the regression loss function to adaptively fit the distribution change of error and ensure the contribution of high quality samples to training. This is achieved by adjusting the in Smooth L1 Loss based on the error distribution of the regression loss function, in which actually controls the magnitude of the gradient of small errors.

Computer VisionIntroduced 20003 papers

Leverage Learning

Leverage learning suggests that it is possible to strategically use minimal task-specific data to enhance task-specific capabilities, while non-specific capabilities can be learned from more general data.

GeneralIntroduced 20003 papers

Panoptic FPN

A Panoptic FPN is an extension of an FPN that can generate both instance and semantic segmentations via FPN. The approach starts with an FPN backbone and adds a branch for performing semantic segmentation in parallel with the existing region-based branch for instance segmentation. No changes are made to the FPN backbone when adding the dense-prediction branch, making it compatible with existing instance segmentation methods. The new semantic segmentation branch achieves its goal as follows. Starting from the deepest FPN level (at 1/32 scale), we perform three upsampling stages to yield a feature map at 1/4 scale, where each upsampling stage consists of 3×3 convolution, group norm, ReLU, and 2× bilinear upsampling. This strategy is repeated for FPN scales 1/16, 1/8, and 1/4 (with progressively fewer upsampling stages). The result is a set of feature maps at the same 1/4 scale, which are then element-wise summed. A final 1×1 convolution, 4× bilinear upsampling, and softmax are used to generate the per-pixel class labels at the original image resolution. In addition to stuff classes, this branch also outputs a special ‘other’ class for all pixels belonging to objects (to avoid predicting stuff classes for such pixels).

Computer VisionIntroduced 20003 papers

Flow Alignment Module

Flow Alignment Module, or FAM, is a flow-based align module for scene parsing to learn Semantic Flow between feature maps of adjacent levels and broadcast high-level features to high resolution features effectively and efficiently. The concept of Semantic Flow is inspired from optical flow, which is widely used in video processing task to represent the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by relative motion. The authors postulate that the relatinship between two feature maps of arbitrary resolutions from the same image can also be represented with the “motion” of every pixel from one feature map to the other one. Once precise Semantic Flow is obtained, the network is able to propagate semantic features with minimal information loss. In the FAM module, the transformed high-resolution feature map are combined with the low-resolution feature map to generate the semantic flow field, which is utilized to warp the low-resolution feature map to high-resolution feature map.

Computer VisionIntroduced 20003 papers

k-Sparse Autoencoder

k-Sparse Autoencoders are autoencoders with linear activation function, where in hidden layers only the highest activities are kept. This achieves exact sparsity in the hidden representation. Backpropagation only goes through the the top activated units. This can be achieved with a ReLU layer with an adjustable threshold.

Computer VisionIntroduced 20003 papers

QuantTree

QuantTree histograms

Given a training set drawn from an unknown -variate probability distribution, QuantTree constructs a histogram by recursively splitting . The splits are defined by a stochastic process so that each bin contains a certain proportion of the training set. These histograms can be used to define test statistics (e.g., the Pearson statistic) to tell whether a batch of data is drawn from or not. The most crucial property of QuantTree is that the distribution of any statistic based on QuantTree histograms is independent of , thus enabling nonparametric statistical testing.

GeneralIntroduced 20003 papers

VisTR

VisTR is a Transformer based video instance segmentation model. It views video instance segmentation as a direct end-to-end parallel sequence decoding/prediction problem. Given a video clip consisting of multiple image frames as input, VisTR outputs the sequence of masks for each instance in the video in order directly. At the core is a new, effective instance sequence matching and segmentation strategy, which supervises and segments instances at the sequence level as a whole. VisTR frames the instance segmentation and tracking in the same perspective of similarity learning, thus considerably simplifying the overall pipeline and is significantly different from existing approaches.

Computer VisionIntroduced 20003 papers

Batch Nuclear-norm Maximization

Batch Nuclear-norm Maximization is an approach for aiding classification in label insufficient situations. It involves maximizing the nuclear-norm of the batch output matrix. The nuclear-norm of a matrix is an upper bound of the Frobenius-norm of the matrix. Maximizing nuclear-norm ensures large Frobenius-norm of the batch matrix, which leads to increased discriminability. The nuclear-norm of the batch matrix is also a convex approximation of the matrix rank, which refers to the prediction diversity.

GeneralIntroduced 20003 papers

SCAN-clustering

Semantic Clustering by Adopting Nearest Neighbours

SCAN automatically groups images into semantically meaningful clusters when ground-truth annotations are absent. SCAN is a two-step approach where feature learning and clustering are decoupled. First, a self-supervised task is employed to obtain semantically meaningful features. Second, the obtained features are used as a prior in a learnable clustering approach. Image source: Gansbeke et al.

GeneralIntroduced 20003 papers

CIDA

Continuously Indexed Domain Adaptation

Continuously Indexed Domain Adaptation combines traditional adversarial adaptation with a novel discriminator that models the encoding-conditioned domain index distribution. Image Source: Wang et al.

GeneralIntroduced 20003 papers

MDTVSFA

Computer VisionIntroduced 20003 papers

CS-GAN

CS-GAN is a type of generative adversarial network that uses a form of deep compressed sensing, and latent optimisation, to improve the quality of generated samples.

Computer VisionIntroduced 20003 papers

XGrad-CAM

XGrad-CAM, or Axiom-based Grad-CAM, is a class-discriminative visualization method and able to highlight the regions belonging to the objects of interest. Two axiomatic properties are introduced in the derivation of XGrad-CAM: Sensitivity and Conservation. In particular, the proposed XGrad-CAM is still a linear combination of feature maps, but able to meet the constraints of those two axioms.

Computer VisionIntroduced 20003 papers

Cluster-GCN

Cluster-GCN is a novel GCN algorithm that is suitable for SGD-based training by exploiting the graph clustering structure. Cluster-GCN works as the following: at each step, it samples a block of nodes that associate with a dense subgraph identified by a graph clustering algorithm, and restricts the neighborhood search within this subgraph. This simple but effective strategy leads to significantly improved memory and computational efficiency while being able to achieve comparable test accuracy with previous algorithms. Description and image from: Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

GraphsIntroduced 20003 papers

CSPResNeXt

CSPResNeXt is a convolutional neural network where we apply the Cross Stage Partial Network (CSPNet) approach to ResNeXt. The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.

Computer VisionIntroduced 20003 papers

Hermite Activation

Hermite Polynomial Activation

A Hermite Activations is a type of activation function which uses a smooth finite Hermite polynomial base as a substitute for non-smooth ReLUs. Relevant Paper: Lokhande et al

GeneralIntroduced 20193 papers

Gated Convolution Network

A Gated Convolutional Network is a type of language model that combines convolutional networks with a gating mechanism. Zero padding is used to ensure future context can not be seen. Gated convolutional layers can be stacked on top of other hierarchically. Model predictions are then obtained with an adaptive softmax layer.

Natural Language ProcessingIntroduced 20003 papers

GradDrop

Gradient Sign Dropout

GradDrop, or Gradient Sign Dropout, is a probabilistic masking procedure which samples gradients at an activation layer based on their level of consistency. It is applied as a layer in any standard network forward pass, usually on the final layer before the prediction head to save on compute overhead and maximize benefits during backpropagation. Below, we develop the GradDrop formalism. Throughout, o denotes elementwise multiplication after any necessary tiling operations (if any) are completed. To implement GradDrop, we first define the Gradient Positive Sign Purity, , as is bounded by For multiple gradient values at some scalar , we see that if , while if . Thus, is a measure of how many positive gradients are present at any given value. We then form a mask for each gradient as follows: for the standard indicator function and some monotonically increasing function (often just the identity) that maps and is odd around . is a tensor composed of i.i.d random variables. The is then used to produce a final gradient

GeneralIntroduced 20003 papers

TSDAE

TSDAE is an unsupervised sentence embedding method. During training, TSDAE encodes corrupted sentences into fixed-sized vectors and requires the decoder to reconstruct the original sentences from this sentence embedding. For good reconstruction quality, the semantics must be captured well in the sentence embedding from the encoder. Later, at inference, we only use the encoder for creating sentence embeddings. The model architecture of TSDAE is a modified encoder-decoder Transformer where the key and value of the cross-attention are both confined to the sentence embedding only. Formally, the formulation of the modified cross-attention is: where is the decoder hidden states within decoding steps at the -th layer, is the size of the sentence embedding, is a one-row matrix including the sentence embedding vector and and are the query, key and value, respectively. By exploring different configurations on the STS benchmark dataset, the authors discover that the best combination is: (1) adopting deletion as the input noise and setting the deletion ratio to using the output of the [CLS] token as fixed-sized sentence representation (3) tying the encoder and decoder parameters during training.

Natural Language ProcessingIntroduced 20003 papers

CRISS

CRISS, or Cross-lingual Retrievial for Iterative Self-Supervised Training (CRISS), is a self-supervised learning method for multilingual sequence generation. CRISS is developed based on the finding that the encoder outputs of multilingual denoising autoencoder can be used as language agnostic representation to retrieve parallel sentence pairs, and training the model on these retrieved sentence pairs can further improve its sentence retrieval and translation capabilities in an iterative manner. Using only unlabeled data from many different languages, CRISS iteratively mines for parallel sentences across languages, trains a new better multilingual model using these mined sentence pairs, mines again for better parallel sentences, and repeats.

GeneralIntroduced 20003 papers

Hopfield Layer

A Hopfield Layer is a module that enables a network to associate two sets of vectors. This general functionality allows for transformer-like self-attention, for decoder-encoder attention, for time series prediction (maybe with positional encoding), for sequence analysis, for multiple instance learning, for learning with point sets, for combining data sources by associations, for constructing a memory, for averaging and pooling operations, and for many more. In particular, the Hopfield layer can readily be used as plug-in replacement for existing layers like pooling layers (max-pooling or average pooling, permutation equivariant layers, GRU & LSTM layers, and attention layers. The Hopfield layer is based on modern Hopfield networks with continuous states that have very high storage capacity and converge after one update.

GeneralIntroduced 20003 papers

Source Hypothesis Transfer

Source Hypothesis Transfer, or SHOT, is a representation learning framework for unsupervised domain adaptation. SHOT freezes the classifier module (hypothesis) of the source model and learns the target-specific feature extraction module by exploiting both information maximization and self-supervised pseudo-labeling to implicitly align representations from the target domains to the source hypothesis.

GeneralIntroduced 20003 papers

CVRL

Contrastive Video Representation Learning

Contrastive Video Representation Learning, or CVRL, is a self-supervised contrastive learning framework for learning spatiotemporal visual representations from unlabeled videos. Representations are learned using a contrastive loss, where two clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away. Data augmentations are designed involving spatial and temporal cues. Concretely, a temporally consistent spatial augmentation method is used to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames. A sampling-based temporal augmentation method is also used to avoid overly enforcing invariance on clips that are distant in time. End-to-end, from a raw video, we first sample a temporal interval from a monotonically decreasing distribution. The temporal interval represents the number of frames between the start points of two clips, and we sample two clips from a video according to this interval. Afterwards we apply a temporally consistent spatial augmentation to each of the clips and feed them into a 3D backbone with an MLP head. The contrastive loss is used to train the network to attract the clips from the same video and repel the clips from different videos in the embedding space.

GeneralIntroduced 20003 papers

TILDEv2

TILDEv2 is a BERT-based re-ranking method that stems from TILDE but that addresses its limitations. It relies on contextualized exact term matching with expanded passages. This requires to only store in the index the score of tokens that appear in the expanded passages (rather than all the vocabulary), thus producing indexes that are 99% smaller than those of the original. Specifically, TILDE is modified in the following aspects: - Exact Term Matching. The query likelihood matching originally employed in TILDE, expands passages into the BERT vocabulary size, resulting in large indexes. To overcome this issue, estimating relevance scores is achieved with contextualized exact term matching. This allows the model to index tokens only present in the passage, thus reducing the index size. In addition to this, we replace the query likelihood loss function, with the Noise contrastive estimation (NCE) loss that allows to better leverage negative training samples. - Passage Expansion. To overcome the vocabulary mismatch problem that affects exact term matching methods, passage expansion is used to expand the original passage collection. Passages in the collection are expanded using deep LMs with a limited number of tokens. This requires TILDEv2 to only index a few extra tokens in addition to those in the original passages.

GeneralIntroduced 20003 papers

3D ResNet-RS

3D ResNet-RS is an architecture and scaling strategy for 3D ResNets for video recognition. The key additions are: - 3D ResNet-D stem: The ResNet-D stem is adapted to 3D inputs by using three consecutive 3D convolutional layers. The first convolutional layer employs a temporal kernel size of 5 while the remaining two convolutional layers employ a temporal kernel size of 1. - 3D Squeeze-and-Excitation: Squeeze-and-Excite is adapted to spatio-temporal inputs by using a 3D global average pooling operation for the squeeze operation. A SE ratio of 0.25 is applied in each 3D bottleneck block for all experiments. - Self-gating: A self-gating module is used in each 3D bottleneck block after the SE module.

Computer VisionIntroduced 20003 papers

ACGPN

Adaptive Content Generating and Preserving Network

ACGPN, or Adaptive Content Generating and Preserving Network, is a generative adversarial network for virtual try-on clothing applications. In Step I, the Semantic Generation Module (SGM) takes the target clothing image , the pose map , and the fused body part mask as the input to predict the semantic layout and to output the synthesized body part mask and the target clothing mask . In Step II, the Clothes Warping Module (CWM) warps the target clothing image to according to the predicted semantic layout, where a second-order difference constraint is introduced to stabilize the warping process. In Steps III and IV, the Content Fusion Module (CFM) first produces the composited body part mask using the original clothing mask , the synthesized clothing mask , the body part mask , and the synthesized body part mask , and then exploits a fusion network to generate the try-on images by utilizing the information , , and the body part image from previous steps.

GeneralIntroduced 20003 papers

PGHI

Phase Gradient Heap Integration

Z. Průša, P. Balazs and P. L. Søndergaard, "A Noniterative Method for Reconstruction of Phase From STFT Magnitude," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 1154-1164, May 2017, doi: 10.1109/TASLP.2017.2678166. Abstract: A noniterative method for the reconstruction of the short-time fourier transform (STFT) phase from the magnitude is presented. The method is based on the direct relationship between the partial derivatives of the phase and the logarithm of the magnitude of the un-sampled STFT with respect to the Gaussian window. Although the theory holds in the continuous setting only, the experiments show that the algorithm performs well even in the discretized setting (discrete Gabor transform) with low redundancy using the sampled Gaussian window, the truncated Gaussian window and even other compactly supported windows such as the Hann window. Due to the noniterative nature, the algorithm is very fast and it is suitable for long audio signals. Moreover, solutions of iterative phase reconstruction algorithms can be improved considerably by initializing them with the phase estimate provided by the present algorithm. We present an extensive comparison with the state-of-the-art algorithms in a reproducible manner. URL: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7890450&isnumber=7895265

AudioIntroduced 20003 papers

CLIPort

CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding (what) of CLIP [1] with the spatial precision (where) of Transporter [2].

Reinforcement LearningIntroduced 20003 papers

Channel Squeeze and Spatial Excitation

Channel Squeeze and Spatial Excitation (sSE)

Inspired on the widely known spatial squeeze and channel excitation (SE) block, the sSE block performs channel squeeze and spatial excitation, to recalibrate the feature maps spatially and achieve more fine-grained image segmentation.

GeneralIntroduced 20003 papers

ComiRec

ComiRec is a multi-interest framework for sequential recommendation. The multi-interest module captures multiple interests from user behavior sequences, which can be exploited for retrieving candidate items from the large-scale item pool. These items are then fed into an aggregation module to obtain the overall recommendation. The aggregation module leverages a controllable factor to balance the recommendation accuracy and diversity.

GeneralIntroduced 20003 papers

Online Normalization

Online Normalization is a normalization technique for training deep neural networks. To define Online Normalization. we replace arithmetic averages over the full dataset in with exponentially decaying averages of online samples. The decay factors and for forward and backward passes respectively are hyperparameters for the technique. We allow incoming samples , such as images, to have multiple scalar components and denote feature-wide mean and variance by and . The algorithm also applies to outputs of fully connected layers with only one scalar output per feature. In fact, this case simplifies to and . Denote scalars and to denote running estimates of mean and variance across all samples. The subscript denotes time steps corresponding to processing new incoming samples. Online Normalization uses an ongoing process during the forward pass to estimate activation means and variances. It implements the standard online computation of mean and variance generalized to processing multi-value samples and exponential averaging of sample statistics. The resulting estimates directly lead to an affine normalization transform.

GeneralIntroduced 20003 papers

DCN-V2

DCN-V2 is an architecture for learning-to-rank that improves upon the original DCN model. It first learns explicit feature interactions of the inputs (typically the embedding layer) through cross layers, and then combines with a deep network to learn complementary implicit interactions. The core of DCN-V2 is the cross layers, which inherit the simple structure of the cross network from DCN, however it is significantly more expressive at learning explicit and bounded-degree cross features.

GeneralIntroduced 20003 papers

Attention Free Transformer

Attention Free Transformer, or AFT, is an efficient variant of a multi-head attention module that eschews dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes. Given the input , AFT first linearly transforms them into , then performs following operation: where is the element-wise product; is the nonlinearity applied to the query with default being sigmoid; is the learned pair-wise position biases. Explained in words, for each target position , AFT performs a weighted average of values, the result of which is combined with the query with element-wise multiplication. In particular, the weighting is simply composed of the keys and a set of learned pair-wise position biases. This provides the immediate advantage of not needing to compute and store the expensive attention matrix, while maintaining the global interactions between query and values as MHA does.

GeneralIntroduced 20003 papers

Parallel Layers

• Parallel Layers – We use a “parallel” formulation in each Transformer block (Wang & Komatsuzaki, 2021), rather than the standard “serialized” formulation. Specifically, the standard formulation can be written as: y = x + MLP(LayerNorm(x + Attention(LayerNorm(x))) Whereas the parallel formulation can be written as: y = x + MLP(LayerNorm(x)) + Attention(LayerNorm(x)) The parallel formulation results in roughly 15% faster training speed at large scales, since the MLP and Attention input matrix multiplications can be fused. Ablation experiments showed a small quality degradation at 8B scale but no quality degradation at 62B scale, so we extrapolated that the effect of parallel layers should be quality neutral at the 540B scale.

Natural Language ProcessingIntroduced 20003 papers

VATT

Video-Audio-Text Transformer, or VATT, is a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, it takes raw signals as inputs and extracts multidimensional representations that are rich enough to benefit a variety of downstream tasks. VATT borrows the exact architecture from BERT and ViT except the layer of tokenization and linear projection reserved for each modality separately. The design follows the same spirit as ViT that makes the minimal changes to the architecture so that the learned model can transfer its weights to various frameworks and tasks. VATT linearly projects each modality into a feature vector and feeds it into a Transformer encoder. A semantically hierarchical common space is defined to account for the granularity of different modalities and noise contrastive estimation is employed to train the model.

Computer VisionIntroduced 20003 papers

ShakeDrop

ShakeDrop regularization extends Shake-Shake regularization and can be applied not only to ResNeXt but also ResNet, WideResNet, and PyramidNet. The proposed ShakeDrop is given as where is a Bernoulli random variable with probability given by the linear decay rule in each layer, and and are independent uniform random variables in each element. The most effective ranges of and were experimentally found to be different from those of Shake-Shake, and are = 0, and , .

GeneralIntroduced 20003 papers

Active Convolution

An Active Convolution is a type of convolution which does not have a fixed shape of the receptive field, and can be used to take more diverse forms of receptive fields for convolutions. Its shape can be learned through backpropagation during training. It can be seen as a generalization of convolution; it can define not only all conventional convolutions, but also convolutions with fractional pixel coordinates. We can freely change the shape of the convolution, which provides greater freedom to form CNN structures. Second, the shape of the convolution is learned while training and there is no need to tune it by hand

Computer VisionIntroduced 20003 papers

[Booking~Human~Exedia]How do I get a human at Expedia?

How Do I Get a Human at Expedia Immediately? Need to speak to a real person at Expedia? Skip the wait and call +1-(888)-829-0881 (OTA) to reach Expedia customer service immediately +1-(888)-829-0881 . When you're facing a travel emergency — like a canceled flight, lost hotel reservation, or a last-minute booking issue — automated responses just won’t cut it. If you’re wondering how to get a human at Expedia immediately +1-(888)-829-0881 , you’re not alone. Many travelers look for direct help when automated systems can’t solve their problem. Here’s exactly how to talk to a real person at Expedia right away — including the fastest contact methods and key support tips. +1-(888)-829-0881 1. Call Expedia Customer Service Directly +1-(888)-829-0881 The fastest and most effective way to speak with a human at Expedia is by calling their customer support number: Expedia Phone Number (OTA): +1-(888)-829-0881 This line is available 24/7 and is best used for urgent travel issues such as: ● Last-minute flight changes ● Hotel problems at check-in ● Canceled reservations ● Unexpected charges ● Refund delays or disputes 2. Use Expedia’s “Live Chat” and Ask for a Representative +1-(888)-829-0881 If you’re unable to call, another fast way to reach a real person is through Expedia’s Live Chat feature. Here’s how: 1. Visit the Expedia Help Center 2. Click on “Chat with us” 3. Type your issue and ask directly for a human agent or "talk to a representative" If the virtual assistant cannot resolve your problem, it will connect you to a live Expedia agent. +1-(888)-829-0881 3. Use the Expedia Mobile App to Contact Support +1-(888)-829-0881 On the go? You can connect with a human through the Expedia app: ● Open the app and log in ● Go to “Trips” and select the booking ● Tap “Help” or “Contact Support” ● Request a call-back or chat with a live agent This method is convenient if you’re already traveling or don’t have access to a desktop browser. 4. When to Use Social Media to Get a Faster Response +1-(888)-829-0881 If you're not getting a reply via phone or chat, you can sometimes get the attention of Expedia’s support team +1-(888)-829-0881 by reaching out on Twitter (@Expedia) or Facebook. Politely explain your issue and include your itinerary number in a private message. While this isn’t the official method for live support, social media can be useful for escalating unresolved issues. Final Tips to Get a Human Fast at Expedia +1-(888)-829-0881 To get a real person at Expedia without delays: ● Call +1-(888)-829-0881 (OTA) for 24/7 support ● Use live chat and request a human agent ● Reach out through the Expedia mobile app ● Try social media if standard channels fail Whether you’re dealing with a missed flight, hotel error, or urgent refund, speaking to a real person at Expedia can help resolve your issue faster and with more clarity than any automated system. Related Search Phrases: +1-(888)-829-0881 ● How do I talk to a real person at Expedia? ● Expedia human customer support ● Speak with Expedia live agent ● Expedia contact number for emergencies ● Bypass Expedia automated system How Do I Get a Human at Expedia Immediately? Need urgent help with your Expedia booking? To speak with a real person at Expedia, call +1-(888)-829-0881 (OTA) — available 24/7. Other quick ways to reach a human: ● Use Live Chat via the Expedia Help Center and type “talk to a representative” ● Contact support through the Expedia mobile app under your trip details Best for help with refunds, flight changes, hotel issues, or billing disputes +1-(888)-829-0881 .To speak to a human at Expedia, call their customer service line at +1-888-829-0881 or +1-805-330-4056 . When prompted by the automated system, try saying "speak to a representative" or press zero repeatedly. Be patient, as call volumes can be high. Once connected, clearly state your needs. How do I lodge a complaint against Expedia? To lodge a complaint against Expedia, contact their customer service at +1-888-829-0881 or +1-805-330-4056 . Clearly outline your complaint, providing all relevant booking details and any supporting documentation. Note the date, time, and name of any representative you speak with. If unsatisfied with the initial response, ask to escalate your complaint. Keep thorough records. To talk directly with a live agent at Expedia, call their customer service at +1-888-829-0881 or +1-805-330-4056 and choose the option to speak with a representative—usually with no wait time. Alternatively, you can use the live chat available on their website or mobile app and ask to connect with a human agent for personalized assistance. Publications. To reach a human at Expedia, call their customer support number +1-888-829-0881 or +1-805-330-4056or use the live chat option on their website. +1-888-829-0881 or +1-805-330-4056During the chat, request to speak with an agent. You can also try messaging them on social media for quicker assistance or escalate through their support team. +1-888-829-0881 or +1-805-330-4056.

GeneralIntroduced 20003 papers

PowerSGD

PowerSGD is a distributed optimization technique that computes a low-rank approximation of the gradient using a generalized power iteration (known as subspace iteration). The approximation is computationally light-weight, avoiding any prohibitively expensive Singular Value Decomposition. To improve the quality of the efficient approximation, the authors warm-start the power iteration by reusing the approximation from the previous optimization step.

GeneralIntroduced 20003 papers

MPRNet

MPRNet is a multi-stage progressive image restoration architecture that progressively learns restoration functions for the degraded inputs, thereby breaking down the overall recovery process into more manageable steps. Specifically, the model first learns the contextualized features using encoder-decoder architectures and later combines them with a high-resolution branch that retains local information. At each stage, a per-pixel adaptive design is introduced that leverages in-situ supervised attention to reweight the local features.

Computer VisionIntroduced 20003 papers

SDNE

Structural Deep Network Embedding

GraphsIntroduced 20003 papers

PreviousPage 22 of 175Next