Methods

GraphsIntroduced 20007 papers

GraphCL

Graph contrastive learning with augmentations

Glow-TTS

Glow-TTS is a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech. The model is directly trained to maximize the log-likelihood of speech with the alignment. Enforcing hard monotonic alignments helps enable robust TTS, which generalizes to long utterances, and employing flows enables fast, diverse, and controllable speech synthesis.

Conditional Positional Encoding

Conditional Positional Encoding, or CPE, is a type of positional encoding for vision transformers. Unlike previous fixed or learnable positional encodings, which are predefined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE aims to generalize to the input sequences that are longer than what the model has ever seen during training. CPE can also keep the desired translation-invariance in the image classification task. CPE can be implemented with a Position Encoding Generator (PEG) and incorporated into the current Transformer framework.

ELR

Early Learning Regularization

GraphsIntroduced 20007 papers

GCNII

GCNII is an extension of a Graph Convolution Networks with two new techniques, initial residual and identify mapping, to tackle the problem of oversmoothing -- where stacking more layers and adding non-linearity tends to degrade performance. At each layer, initial residual constructs a skip connection from the input layer, while identity mapping adds an identity matrix to the weight matrix.

Spatial Attention-Guided Mask

A Spatial Attention-Guided Mask is a module for instance segmentation that predicts a segmentation mask on each detected box with a spatial attention map that helps to focus on informative pixels and suppress noise. The goal is to guide the mask head for spotlighting meaningful pixels and repressing uninformative ones. Once features inside the predicted RoIs are extracted by RoIAlign with 14×14 resolution, those features are fed into four conv layers and the spatial attention module (SAM) sequentially. To exploit the spatial attention map as a feature descriptor given input feature map , the SAM first generates pooled features by both average and max pooling operations respectively along the channel axis and aggregates them via concatenation. Then it is followed by a 3 × 3 conv layer and normalized by the sigmoid function. The computation process is summarized as follow: where denotes the sigmoid function, is 3 × 3 conv layer and represents the concatenate operation. Finally, the attention guided feature map is computed as: where ⊗ denotes element-wise multiplication. After then, a 2 × 2 deconv upsamples the spatially attended feature map to 28 × 28 resolution. Lastly, a 1 × 1 conv is applied for predicting class-specific masks.

M2D

Masked Modeling Duo

Self-supervised learning (SSL) using masked prediction has made great strides in general-purpose audio representation. This study proposes Masked Modeling Duo (M2D), an improved masked prediction SSL, which learns by predicting representations of masked input signals that serve as training signals. Unlike conventional methods, M2D obtains a training signal by encoding only the masked part, encouraging the two networks in M2D to model the input. While M2D improves general-purpose audio representations, a specialized representation is essential for real-world applications, such as in industrial and medical domains. The often confidential and proprietary data in such domains is typically limited in size and has a different distribution from that in pre-training datasets. Therefore, we propose M2D for X (M2D-X), which extends M2D to enable the pre-training of specialized representations for an application X. M2D-X learns from M2D and an additional task and inputs background noise. We make the additional task configurable to serve diverse applications, while the background noise helps learn on small data and forms a denoising task that makes representation robust. With these design choices, M2D-X should learn a representation specialized to serve various application needs. Our experiments confirmed that the representations for general-purpose audio, specialized for the highly competitive AudioSet and speech domain, and a small-data medical task achieve top-level performance, demonstrating the potential of using our models as a universal audio pre-training framework.

Self-Attention Guidance

MagFace

MagFace is a category of losses for face recognition that learn a universal feature embedding whose magnitude can measure the quality of a given face. Under the new loss, it can be proven that the magnitude of the feature embedding monotonically increases if the subject is more likely to be recognized. In addition, MagFace introduces an adaptive mechanism to learn a well-structured within-class feature distributions by pulling easy samples to class centers while pushing hard samples away. For face recognition, MagFace helps prevent model overfitting on noisy and low-quality samples by an adaptive mechanism to learn well-structured within-class feature distributions -- by pulling easy samples to class centers while pushing hard samples away.

WFST

weighted finite state transducer

Natural Language ProcessingIntroduced 20007 papers

TWEC

Temporal Word Embeddings with a Compass

TWEC is a method to generate temporal word embeddings: this method is efficient and it is based on a simple heuristic: we train an atemporal word embedding, the compass and we use this embedding to freeze one of the layers of the CBOW architecture. The frozen architecture is then used to train time-specific slices that are all comparable after training.

DiffPool

DiffPool is a differentiable graph pooling module that can generate hierarchical representations of graphs and can be combined with various graph neural network architectures in an end-to-end fashion. DiffPool learns a differentiable soft cluster assignment for nodes at each layer of a deep GNN, mapping nodes to a set of clusters, which then form the coarsened input for the next GNN layer. Description and image from: Hierarchical Graph Representation Learning with Differentiable Pooling

GraphsIntroduced 20007 papers

WaveGrad

WaveGrad is a conditional model for waveform generation through estimating gradients of the data density. This model is built on the prior work on score matching and diffusion probabilistic models. It starts from Gaussian white noise and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. WaveGrad is non-autoregressive, and requires only a constant number of generation steps during inference. It can use as few as 6 iterations to generate high fidelity audio samples.

Natural Language ProcessingIntroduced 20007 papers

ENIGMA

ENIGMA is an evaluation framework for dialog systems based on Pearson and Spearman's rank correlations between the estimated rewards and the true rewards. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors.

Wide&Deep

Wide&Deep jointly trains wide linear models and deep neural networks to combine the benefits of memorization and generalization for real-world recommender systems. In summary, the wide component is a generalized linear model. The deep component is a feed-forward neural network. The deep and wide components are combined using a weighted sum of their output log odds as the prediction. This is then fed to a logistic loss function for joint training, which is done by back-propagating the gradients from the output to both the wide and deep part of the model simultaneously using mini-batch stochastic optimization. The AdaGrad optimizer is used for the wider part. The combined model is illustrated in the figure (center).

ReZero

ReZero is a normalization approach that dynamically facilitates well-behaved gradients and arbitrarily deep signal propagation. The idea is simple: ReZero initializes each layer to perform the identity operation. For each layer, a residual connection is introduced for the input signal and one trainable parameter that modulates the non-trivial transformation of a layer : where at the beginning of training. Initially the gradients for all parameters defining vanish, but dynamically evolve to suitable values during initial stages of training. The architecture is illustrated in the Figure.

Mixer Layer

MLP-Mixer Layer

A Mixer layer is a layer used in the MLP-Mixer architecture proposed by Tolstikhin et. al (2021) for computer vision. Mixer layers consist purely of MLPs, without convolutions or attention. It takes an input of embedded image patches (tokens), with its output having the same shape as its input, similar to that of a Vision Transformer encoder. As suggested by its name, Mixer layers "mix" tokens and channels through its "token mixing" and "channel mixing" MLPs contained the layer. It utilizes previous techniques by other architectures, such as layer normalization, skip-connections, and regularization methods. Image credit: Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., ... & Dosovitskiy, A. (2021). Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34, 24261-24272.

SequentialIntroduced 20007 papers

DFA (Random Walk)

Detrended fluctuation analysis

In stochastic processes, chaos theory and time series analysis, detrended fluctuation analysis (DFA) is a method for determining the statistical self-affinity of a signal. It is useful for analysing time series that appear to be long-memory processes (diverging correlation time, e.g. power-law decaying autocorrelation function) or 1/f noise. The obtained exponent is similar to the Hurst exponent, except that DFA may also be applied to signals whose underlying statistics (such as mean and variance) or dynamics are non-stationary (changing with time). It is related to measures based upon spectral techniques such as autocorrelation and Fourier transform.

WaveGrad DBlock

WaveGrad DBlocks are used to downsample the temporal dimension of noisy waveform in WaveGrad. They are similar to UBlocks except that only one residual block is included. The dilation factors are 1, 2, 4 in the main branch. Orthogonal initialization is used.

Bridge-net

Bridge-net is an audio model block used in the ClariNet text-to-speech architecture. Bridge-net maps frame-level hidden representation to sample-level through several convolution blocks and transposed convolution layers interleaved with softsign non-linearities.

gMLP

gMLP is an MLP-based alternative to Transformers without self-attention, which simply consists of channel projections and spatial projections with static parameterization. It is built out of basic MLP layers with gating. The model consists of a stack of blocks with identical size and structure. Let be the token representations with sequence length and dimension . Each block is defined as: where is an activation function such as GeLU. and define linear projections along the channel dimension - the same as those in the FFNs of Transformers (e.g., their shapes are and for ). A key ingredient is , a layer which captures spatial interactions. When is an identity mapping, the above transformation degenerates to a regular FFN, where individual tokens are processed independently without any cross-token communication. One of the major focuses is therefore to design a good capable of capturing complex spatial interactions across tokens. This leads to the use of a Spatial Gating Unit which involves a modified linear gating. The overall block layout is inspired by inverted bottlenecks, which define as a spatial depthwise convolution. Note, unlike Transformers, gMLP does not require position embeddings because such information will be captured in .

SPLASH

Simple Piecewise Linear and Adaptive with Symmetric Hinges

Please enter a description about the method here

Fast AutoAugment

Fast AutoAugment is an image data augmentation algorithm that finds effective augmentation policies via a search strategy based on density matching, motivated by Bayesian DA. The strategy is to improve the generalization performance of a given network by learning the augmentation policies which treat augmented data as missing data points of training data. However, different from Bayesian DA, the proposed method recovers those missing data points by the exploitation-and-exploration of a family of inference-time augmentations via Bayesian optimization in the policy search phase. This is realized by using an efficient density matching algorithm that does not require any back-propagation for network training for each policy evaluation.

FSAF

FSAF, or Feature Selective Anchor-Free, is a building block for single-shot object detectors. It can be plugged into single-shot detectors with feature pyramid structure. The FSAF module addresses two limitations brought up by the conventional anchor-based detection: 1) heuristic-guided feature selection; 2) overlap-based anchor sampling. The general concept of the FSAF module is online feature selection applied to the training of multi-level anchor-free branches. Specifically, an anchor-free branch is attached to each level of the feature pyramid, allowing box encoding and decoding in the anchor-free manner at an arbitrary level. During training, we dynamically assign each instance to the most suitable feature level. At the time of inference, the FSAF module can work jointly with anchor-based branches by outputting predictions in parallel. We instantiate this concept with simple implementations of anchor-free branches and online feature selection strategy The general concept is presented in the Figure to the right. An anchor-free branch is built per level of feature pyramid, independent to the anchor-based branch. Similar to the anchor-based branch, it consists of a classification subnet and a regression subnet (not shown in figure). An instance can be assigned to arbitrary level of the anchor-free branch. During training, we dynamically select the most suitable level of feature for each instance based on the instance content instead of just the size of instance box. The selected level of feature then learns to detect the assigned instances. At inference, the FSAF module can run independently or jointly with anchor-based branches. The FSAF module is agnostic to the backbone network and can be applied to single-shot detectors with a structure of feature pyramid. Additionally, the instantiation of anchor-free branches and online feature selection can be various.

ScaleNet

ScaleNet, or a Scale Aggregation Network, is a type of convolutional neural network which learns a neuron allocation for aggregating multi-scale information in different building blocks of a deep network. The most informative output neurons in each block are preserved while others are discarded, and thus neurons for multiple scales are competitively and adaptively allocated. The scale aggregation (SA) block concatenates feature maps at a wide range of scales. Feature maps for each scale are generated by a stack of downsampling, convolution and upsampling operations.

Natural Language ProcessingIntroduced 20007 papers

BLANC

BLANC is an automatic estimation approach for document summary quality. The goal is to measure the functional performance of a summary with an objective, reproducible, and fully automated method. BLANC achieves this by measuring the performance boost gained by a pre-trained language model with access to a document summary while carrying out its language understanding task on the document's text.

DistDGL

DistDGL is a system for training GNNs in a mini-batch fashion on a cluster of machines. It is is based on the Deep Graph Library (DGL), a popular GNN development framework. DistDGL distributes the graph and its associated data (initial features and embeddings) across the machines and uses this distribution to derive a computational decomposition by following an owner-compute rule. DistDGL follows a synchronous training approach and allows ego-networks forming the mini-batches to include non-local nodes. To minimize the overheads associated with distributed computations, DistDGL uses a high-quality and light-weight mincut graph partitioning algorithm along with multiple balancing constraints. This allows it to reduce communication overheads and statically balance the computations. It further reduces the communication by replicating halo nodes and by using sparse embedding updates. The combination of these design choices allows DistDGL to train high-quality models while achieving high parallel efficiency and memory scalability

Submanifold Convolution

Submanifold Convolution (SC) is a spatially sparse convolution operation used for tasks with sparse data like semantic segmentation of 3D point clouds. An SC convolution computes the set of active sites in the same way as a regular convolution: it looks for the presence of any active sites in its receptive field of size . If the input has size then the output will have size . Unlike a regular convolution, an SC convolution discards the ground state for non-active sites by assuming that the input from those sites is zero. For more details see the paper, or the official code here.

Scale Aggregation Block

A Scale Aggregation Block concatenates feature maps at a wide range of scales. Feature maps for each scale are generated by a stack of downsampling, convolution and upsampling operations. The proposed scale aggregation block is a standard computational module which readily replaces any given transformation , where , with and being the input and output channel number respectively. is any operator such as a convolution layer or a series of convolution layers. Assume we have scales. Each scale is generated by sequentially conducting a downsampling , a transformation and an unsampling operator : where , , and . Notably, has the similar structure as . We can concatenate all scales together, getting where indicates concatenating feature maps along the channel dimension, and is the final output feature maps of the scale aggregation block. In the reference implementation, the downsampling with factor is implemented by a max pool layer with kernel size and stride. The upsampling is implemented by resizing with the nearest neighbor interpolation.

VQ-VAE-2

VQ-VAE-2 is a type of variational autoencoder that combines a a two-level hierarchical VQ-VAE with a self-attention autoregressive model (PixelCNN) as a prior. The encoder and decoder architectures are kept simple and light-weight as in the original VQ-VAE, with the only difference that hierarchical multi-scale latent maps are used for increased resolution.

FairMOT

FairMOT is a model for multi-object tracking which consists of two homogeneous branches to predict pixel-wise objectness scores and re-ID features. The achieved fairness between the tasks is used to achieve high levels of detection and tracking accuracy. The detection branch is implemented in an anchor-free style which estimates object centers and sizes represented as position-aware measurement maps. Similarly, the re-ID branch estimates a re-ID feature for each pixel to characterize the object centered at the pixel. Note that the two branches are completely homogeneous which essentially differs from the previous methods which perform detection and re-ID in a cascaded style. It is also worth noting that FairMOT operates on high-resolution feature maps of strides four while the previous anchor-based methods operate on feature maps of stride 32. The elimination of anchors as well as the use of high-resolution feature maps better aligns re-ID features to object centers which significantly improves the tracking accuracy.

CornerNet

CornerNet is an object detection model that detects an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. It also utilises corner pooling, a new type of pooling layer than helps the network better localize corners.

SequentialIntroduced 20007 papers

ClariNet

ClariNet is an end-to-end text-to-speech architecture. Unlike previous TTS systems which use text-to-spectogram models with a separate waveform synthesizer (vocoder), ClariNet is a text-to-wave architecture that is fully convolutional and can be trained from scratch. In ClariNet, the WaveNet module is conditioned on the hidden states instead of the mel-spectogram. The architecture is otherwise based on Deep Voice 3.

GPipe

GPipe is a distributed model parallel method for neural networks. With GPipe, each model can be specified as a sequence of layers, and consecutive groups of layers can be partitioned into cells. Each cell is then placed on a separate accelerator. Based on this partitioned setup, batch splitting is applied. A mini-batch of training examples is split into smaller micro-batches, then the execution of each set of micro-batches is pipelined over cells. Synchronous mini-batch gradient descent is applied for training, where gradients are accumulated across all micro-batches in a mini-batch and applied at the end of a mini-batch.

Grab

Grab is a sensor processing system for cashier-free shopping. Grab needs to accurately identify and track customers, and associate each shopper with items he or she retrieves from shelves. To do this, it uses a keypoint-based pose tracker as a building block for identification and tracking, develops robust feature-based face trackers, and algorithms for associating and tracking arm movements. It also uses a probabilistic framework to fuse readings from camera, weight and RFID sensors in order to accurately assess which shopper picks up which item.

Natural Language ProcessingIntroduced 20006 papers

PLATO-2

CLRNet

Convolutional LSTM based Residual Network

SequentialIntroduced 20006 papers

UORO

Unbiased Online Recurrent Optimization

FractalNet

FractalNet is a type of convolutional neural network that eschews residual connections in favour of a "fractal" design. They involve repeated application of a simple expansion rule to generate deep networks whose structural layouts are precisely truncated fractals. These networks contain interacting subpaths of different lengths, but do not include any pass-through or residual connections; every internal signal is transformed by a filter and nonlinearity before being seen by subsequent layers.

CR-NET

CR-NET is a YOLO-based model proposed for license plate character detection and recognition

KnowPrompt

KnowPrompt is a prompt-tuning approach for relational understanding. It injects entity and relation knowledge into prompt construction with learnable virtual template words as well as answer words and synergistically optimize their representation with knowledge constraints. To be specific, TYPED MARKER is utilized around entities initialized with aggregated entity-type embeddings as learnable virtual template words to inject entity type knowledge. The average embeddings of each token are leveraged in relation labels as virtual answer words to inject relation knowledge. Since there exist implicit structural constraints among entities and relations, and virtual words should be consistent with the surrounding contexts, synergistic optimization is introduced to obtain optimized virtual templates and answer words. Concretely, a context-aware prompt calibration method is used with implicit structural constraints to inject structural knowledge implications among relational triples and associate prompt embeddings with each other.

GBO

Gradient-based optimization

GBO is a novel metaheuristic optimization algorithm. The GBO, inspired by the gradient-based Newton’s method, uses two main operators: gradient search rule (GSR) and local escaping operator (LEO) and a set of vectors to explore the search space. The GSR employs the gradient-based method to enhance the exploration tendency and accelerate the convergence rate to achieve better positions in the search space. The LEO enables the proposed GBO to escape from local optima. The performance of the new algorithm was evaluated in two phases. 28 mathematical test functions were first used to evaluate various characteristics of the GBO, and then six engineering problems were optimized by the GBO. In the first phase, the GBO was compared with five existing optimization algorithms, indicating that the GBO yielded very promising results due to its enhanced capabilities of exploration, exploitation, convergence, and effective avoidance of local optima. The second phase also demonstrated the superior performance of the GBO in solving complex real-world engineering problems. The source codes of GBO are publicly available at https://imanahmadianfar.com/codes/.

K-Net

K-Net is a framework for unified semantic and instance segmentation that segments both instances and semantic categories consistently by a group of learnable kernels, where each kernel is responsible for generating a mask for either a potential instance or a stuff class. It begins with a set of kernels that are randomly initialized, and learns the kernels in accordance to the segmentation targets at hand, namely, semantic kernels for semantic categories and instance kernels for instance identities. A simple combination of semantic kernels and instance kernels allows panoptic segmentation naturally. In the forward pass, the kernels perform convolution on the image features to obtain the corresponding segmentation predictions. K-Net is formulated so that it dynamically updates the kernels to make them conditional to their activations on the image. Such a content-aware mechanism is crucial to ensure that each kernel, especially an instance kernel, responds accurately to varying objects in an image. Through applying this adaptive kernel update strategy iteratively, K-Net significantly improves the discriminative ability of the kernels and boosts the final segmentation performance. It is noteworthy that this strategy universally applies to kernels for all the segmentation tasks. It also utilises a bipartite matching strategy to assign learning targets for each kernel. This training approach is advantageous to conventional training strategies as it builds a one-to-one mapping between kernels and instances in an image. It thus resolves the problem of dealing with a varying number of instances in an image. In addition, it is purely mask-driven without involving boxes. Hence, K-Net is naturally NMS-free and box-free, which is appealing to real-time applications.

NT-ASGD

Non-monotonically Triggered ASGD

NT-ASGD, or Non-monotonically Triggered ASGD, is an averaged stochastic gradient descent technique. In regular ASGD, we take steps identical to regular SGD but instead of returning the last iterate as the solution, we return , where is the total number of iterations and is a user-specified averaging trigger. NT-ASGD has a non-monotonic criterion that conservatively triggers the averaging when the validation metric fails to improve for multiple cycles. Given that the choice of triggering is irreversible, this conservatism ensures that the randomness of training does not play a major role in the decision.

Mixed Attention Block

Mixed Attention Block is an attention module used in the ConvBERT architecture. It is a mixture of self-attention and span-based dynamic convolution (highlighted in pink). They share the same Query but use different Key to generate the attention map and convolution kernel respectively. The number of attention heads is reducing by directly projecting the input to a smaller embedding space to form a bottleneck structure for self-attention and span-based dynamic convolution. Dimensions of the input and output of some blocks are labeled on the left top corner to illustrate the overall framework, where is the embedding size of the input and is the reduction ratio.

Reinforcement LearningIntroduced 20006 papers

TLA

Temporally Layered Architecture

Please enter a description about the method here

Single-Headed Attention

Single-Headed Attention is a single-headed attention module used in the SHA-RNN language model. The principle design reasons for single-headedness were simplicity (avoiding running out of memory) and scepticism about the benefits of using multiple heads.