Methods

5,489 machine learning methods and techniques

All Audio Computer Vision General Graphs Natural Language Processing Reinforcement Learning Sequential

Spatial Gating Unit

Spatial Gating Unit, or SGU, is a gating unit used in the gMLP architecture to captures spatial interactions. To enable cross-token interactions, it is necessary for the layer to contain a contraction operation over the spatial dimension. The layer is formulated as the output of linear gating: where denotes element-wise multiplication. For training stability, the authors find it critical to initialize as near-zero values and as ones, meaning that and therefore at the beginning of training. This initialization ensures each gMLP block behaves like a regular FFN at the early stage of training, where each token is processed independently, and only gradually injects spatial information across tokens during the course of learning. The authors find it further effective to split into two independent parts along the channel dimension for the gating function and for the multiplicative bypass: They also normalize the input to which empirically improved the stability of large NLP models.

GeneralIntroduced 20009 papers

TabTransformer

TabTransformer is a deep tabular data modeling architecture for supervised and semi-supervised learning. The TabTransformer is built upon self-attention based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. As an overview, the architecture comprises a column embedding layer, a stack of Transformer layers, and a multi-layer perceptron (MLP). The contextual embeddings (outputted by the Transformer layer) are concatenated along with continuous features which is inputted to an MLP. The loss function is then minimized to learn all the parameters in an end-to-end learning.

GeneralIntroduced 20009 papers

Varifocal Loss

Varifocal Loss is a loss function for training a dense object detector to predict the IACS, inspired by focal loss. Unlike the focal loss that deals with positives and negatives equally, Varifocal Loss treats them asymmetrically. where is the predicted IACS and is the target IoU score. For a positive training example, is set as the IoU between the generated bounding box and the ground-truth one (gt IoU), whereas for a negative training example, the training target for all classes is .

GeneralIntroduced 20008 papers

STAC

STAC is a semi-supervised framework for visual object detection along with a data augmentation strategy. STAC deploys highly confident pseudo labels of localized objects from an unlabeled image and updates the model by enforcing consistency via strong augmentations. We generate pseudo labels (i.e., bounding boxes and their class labels) for unlabeled data using test-time inference, including NMS , of the teacher model trained with labeled data. We then compute unsupervised loss with respect to pseudo labels whose confidence scores are above a threshold . The strong augmentations are applied for augmentation consistency during the model training. Target boxes are augmented when global geometric transformations are used.

GeneralIntroduced 20008 papers

Dispute^Resolution^Expedia--How do I file a dispute with Expedia?

How do I file a dispute with Expedia? To file a complaint against Expedia, first try contacting their customer service directly. You can reach them by phone at +(1)(888)(829)(0881) or +(1)(888)(829)(0881), via their online chat support, or through their online help center. If the issue persists, you can escalate it by contacting their corporate office via email or filing a complaint with your payment provider or a consumer protection agency.

GeneralIntroduced 20008 papers

Denoised Smoothing

Denoised Smoothing is a method for obtaining a provably robust classifier from a fixed pretrained one, without any additional training or fine-tuning of the latter. The basic idea is to prepend a custom-trained denoiser before the pretrained classifier, and then apply randomized smoothing. Randomized smoothing is a certified defense that converts any given classifier into a new smoothed classifier that is characterized by a non-linear Lipschitz property. When queried at a point , the smoothed classifier outputs the class that is most likely to be returned by under isotropic Gaussian perturbations of its inputs. Unfortunately, randomized smoothing requires that the underlying classifier is robust to relatively large random Gaussian perturbations of the input, which is not the case for off-the-shelf pretrained models. By applying our custom-trained denoiser to the classifier , we can effectively make robust to such Gaussian perturbations, thereby making it “suitable” for randomized smoothing.

GeneralIntroduced 20008 papers

Positional Encoding Generator

Positional Encoding Generator, or PEG, is a module used in the Conditional Position Encoding position embeddings. It dynamically produce the positional encodings conditioned on the local neighborhood of an input token. To condition on the local neighbors, we first reshape the flattened input sequence of DeiT back to in the 2 -D image space. Then, a function (denoted by in the Figure) is repeatedly applied to the local patch in to produce the conditional positional encodings PEG can be efficiently implemented with a 2-D convolution with kernel and zero paddings. Note that the zero paddings here are important to make the model be aware of the absolute positions, and can be of various forms such as separable convolutions and many others.

GeneralIntroduced 20008 papers

GLN

Gated Linear Network

A Gated Linear Network, or GLN, is a type of backpropagation-free neural architecture. What distinguishes GLNs from contemporary neural networks is the distributed and local nature of their credit assignment mechanism; each neuron directly predicts the target, forgoing the ability to learn feature representations in favor of rapid online learning. Individual neurons can model nonlinear functions via the use of data-dependent gating in conjunction with online convex optimization. GLNs are feedforward networks composed of many layers of gated geometric mixing neurons as shown in the Figure . Each neuron in a given layer outputs a gated geometric mixture of the predictions from the previous layer, with the final layer consisting of just a single neuron. In a supervised learning setting, a is trained on (side information, base predictions, label) triplets derived from input-label pairs . There are two types of input to neurons in the network: the first is the side information , which can be thought of as the input features; the second is the input to the neuron, which will be the predictions output by the previous layer, or in the case of layer 0 , some (optionally) provided base predictions that typically will be a function of Each neuron will also take in a constant bias prediction, which helps empirically and is essential for universality guarantees. Weights are learnt in a Gated Linear Network using Online Gradient Descent (OGD) locally at each neuron. They key observation is that as each neuron in layers is itself a gated geometric mixture, all of these neurons can be thought of as individually predicting the target. Given side information , each neuron suffers a loss convex in its active weights of

GeneralIntroduced 20008 papers

Gradual Self-Training

Gradual self-training is a method for semi-supervised domain adaptation. The goal is to adapt an initial classifier trained on a source domain given only unlabeled data that shifts gradually in distribution towards a target domain. This comes up for example in applications ranging from sensor networks and self-driving car perception modules to brain-machine interfaces, where machine learning systems must adapt to data distributions that evolve over time. The gradual self-training algorithm begins with a classifier trained on labeled examples from the source domain (Figure a). For each successive domain , the algorithm generates pseudolabels for unlabeled examples from that domain, and then trains a regularized supervised classifier on the pseudolabeled examples. The intuition, visualized in the Figure, is that after a single gradual shift, most examples are pseudolabeled correctly so self-training learns a good classifier on the shifted data, but the shift from the source to the target can be too large for self-training to correct.

GeneralIntroduced 20008 papers

Embedded Gaussian Affinity

Embedded Gaussian Affinity is a type of affinity or self-similarity function between two points and that uses a Gaussian function in an embedding space: Here and are two embeddings. Note that the self-attention module used in the original Transformer model is a special case of non-local operations in the embedded Gaussian version. This can be seen from the fact that for a given , becomes the softmax computation along the dimension . So we have , which is the self-attention form in the Transformer model. This shows how we can relate this recent self-attention model to the classic computer vision method of non-local means.

GeneralIntroduced 20008 papers

Snapshot Ensembles

Snapshot Ensembles: Train 1, get M for free

The overhead cost of training multiple deep neural networks could be very high in terms of the training time, hardware, and computational resource requirement and often acts as obstacle for creating deep ensembles. To overcome these barriers Huang et al. proposed a unique method to create ensemble which at the cost of training one model, yields multiple constituent model snapshots that can be ensembled together to create a strong learner. The core idea behind the concept is to make the model converge to several local minima along the optimization path and save the model parameters at these local minima points. During the training phase, a neural network would traverse through many such points. The lowest of all such local minima is known as the Global Minima. The larger the model, more are the number of parameters and larger the number of local minima points. This implies, there are discrete sets of weights and biases, at which the model is making fewer errors. So, every such minimum can be considered a weak but a potential learner model for the problem being solved. Multiple such snapshot of weights and biases are recorded which can later be ensembled to get a better generalized model which makes the least amount of mistakes.

GeneralIntroduced 20008 papers

PIRL

Pretext-Invariant Representation Learning (PIRL, pronounced as “pearl”) learns invariant representations based on pretext tasks. PIRL is used with a commonly used pretext task that involves solving jigsaw puzzles. Specifically, PIRL constructs image representations that are similar to the representation of transformed versions of the same image and different from the representations of other images.

GeneralIntroduced 20008 papers

Rotary Embeddings

Rotary Position Embedding

Rotary Position Embedding, or RoPE, is a type of position embedding which encodes absolute positional information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation. Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding.

GeneralIntroduced 20008 papers

NIMA

Neural Image Assessment

In the context of image enhancement, maximizing NIMA score as a prior can increase the likelihood of enhancing perceptual quality of an image.

GeneralIntroduced 20008 papers

GCT

Gated Channel Transformation

GCT first collects global information by computing the l2-norm of each channel. Next, a learnable vector is applied to scale the feature. Then a competition mechanism is adopted by channel normalization to interact between channels. Unlike previous methods, GCT first collects global information by computing the -norm of each channel. Next, a learnable vector is applied to scale the feature. Then a competition mechanism is adopted by channel normalization to interact between channels. Like other common normalization methods, a learnable scale parameter and bias are applied to rescale the normalization. However, unlike previous methods, GCT adopts tanh activation to control the attention vector. Finally, it not only multiplies the input by the attention vector but also adds an identity connection. GCT can be written as: \begin{align} s = F\text{gct}(X, \theta) & = \tanh (\gamma CN(\alpha \text{Norm}(X)) + \beta) \end{align} \begin{align} Y & = s X + X \end{align} where , and are trainable parameters. indicates the -norm of each channel. is channel normalization. A GCT block has fewer parameters than an SE block, and as it is lightweight, can be added after each convolutional layer of a CNN.

GeneralIntroduced 20008 papers

OHEM

Online Hard Example Mining

Some object detection datasets contain an overwhelming number of easy examples and a small number of hard examples. Automatic selection of these hard examples can make training more effective and efficient. OHEM, or Online Hard Example Mining, is a bootstrapping technique that modifies SGD to sample from examples in a non-uniform way depending on the current loss of each example under consideration. The method takes advantage of detection-specific problem structure in which each SGD mini-batch consists of only one or two images, but thousands of candidate examples. The candidate examples are subsampled according to a distribution that favors diverse, high loss instances.

GeneralIntroduced 20008 papers

FIERCE

Feature Information Entropy Regularized Cross Entropy

FIERCE is an entropic regularization on the feature space

GeneralIntroduced 20007 papers

uPIT

utterance level permutation invariant training

GeneralIntroduced 20007 papers

MARLIN

GeneralIntroduced 20007 papers

Conditional Positional Encoding

Conditional Positional Encoding, or CPE, is a type of positional encoding for vision transformers. Unlike previous fixed or learnable positional encodings, which are predefined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE aims to generalize to the input sequences that are longer than what the model has ever seen during training. CPE can also keep the desired translation-invariance in the image classification task. CPE can be implemented with a Position Encoding Generator (PEG) and incorporated into the current Transformer framework.

GeneralIntroduced 20007 papers

ELR

Early Learning Regularization

GeneralIntroduced 20007 papers

Spatial Attention-Guided Mask

A Spatial Attention-Guided Mask is a module for instance segmentation that predicts a segmentation mask on each detected box with a spatial attention map that helps to focus on informative pixels and suppress noise. The goal is to guide the mask head for spotlighting meaningful pixels and repressing uninformative ones. Once features inside the predicted RoIs are extracted by RoIAlign with 14×14 resolution, those features are fed into four conv layers and the spatial attention module (SAM) sequentially. To exploit the spatial attention map as a feature descriptor given input feature map , the SAM first generates pooled features by both average and max pooling operations respectively along the channel axis and aggregates them via concatenation. Then it is followed by a 3 × 3 conv layer and normalized by the sigmoid function. The computation process is summarized as follow: where denotes the sigmoid function, is 3 × 3 conv layer and represents the concatenate operation. Finally, the attention guided feature map is computed as: where ⊗ denotes element-wise multiplication. After then, a 2 × 2 deconv upsamples the spatially attended feature map to 28 × 28 resolution. Lastly, a 1 × 1 conv is applied for predicting class-specific masks.

GeneralIntroduced 20007 papers

M2D

Masked Modeling Duo

Self-supervised learning (SSL) using masked prediction has made great strides in general-purpose audio representation. This study proposes Masked Modeling Duo (M2D), an improved masked prediction SSL, which learns by predicting representations of masked input signals that serve as training signals. Unlike conventional methods, M2D obtains a training signal by encoding only the masked part, encouraging the two networks in M2D to model the input. While M2D improves general-purpose audio representations, a specialized representation is essential for real-world applications, such as in industrial and medical domains. The often confidential and proprietary data in such domains is typically limited in size and has a different distribution from that in pre-training datasets. Therefore, we propose M2D for X (M2D-X), which extends M2D to enable the pre-training of specialized representations for an application X. M2D-X learns from M2D and an additional task and inputs background noise. We make the additional task configurable to serve diverse applications, while the background noise helps learn on small data and forms a denoising task that makes representation robust. With these design choices, M2D-X should learn a representation specialized to serve various application needs. Our experiments confirmed that the representations for general-purpose audio, specialized for the highly competitive AudioSet and speech domain, and a small-data medical task achieve top-level performance, demonstrating the potential of using our models as a universal audio pre-training framework.

GeneralIntroduced 20007 papers

WFST

weighted finite state transducer

GeneralIntroduced 20007 papers

Wide&Deep

Wide&Deep jointly trains wide linear models and deep neural networks to combine the benefits of memorization and generalization for real-world recommender systems. In summary, the wide component is a generalized linear model. The deep component is a feed-forward neural network. The deep and wide components are combined using a weighted sum of their output log odds as the prediction. This is then fed to a logistic loss function for joint training, which is done by back-propagating the gradients from the output to both the wide and deep part of the model simultaneously using mini-batch stochastic optimization. The AdaGrad optimizer is used for the wider part. The combined model is illustrated in the figure (center).

GeneralIntroduced 20007 papers

ReZero

ReZero is a normalization approach that dynamically facilitates well-behaved gradients and arbitrarily deep signal propagation. The idea is simple: ReZero initializes each layer to perform the identity operation. For each layer, a residual connection is introduced for the input signal and one trainable parameter that modulates the non-trivial transformation of a layer : where at the beginning of training. Initially the gradients for all parameters defining vanish, but dynamically evolve to suitable values during initial stages of training. The architecture is illustrated in the Figure.

GeneralIntroduced 20007 papers

SPLASH

Simple Piecewise Linear and Adaptive with Symmetric Hinges

Please enter a description about the method here

GeneralIntroduced 20007 papers

DistDGL

DistDGL is a system for training GNNs in a mini-batch fashion on a cluster of machines. It is is based on the Deep Graph Library (DGL), a popular GNN development framework. DistDGL distributes the graph and its associated data (initial features and embeddings) across the machines and uses this distribution to derive a computational decomposition by following an owner-compute rule. DistDGL follows a synchronous training approach and allows ego-networks forming the mini-batches to include non-local nodes. To minimize the overheads associated with distributed computations, DistDGL uses a high-quality and light-weight mincut graph partitioning algorithm along with multiple balancing constraints. This allows it to reduce communication overheads and statically balance the computations. It further reduces the communication by replicating halo nodes and by using sparse embedding updates. The combination of these design choices allows DistDGL to train high-quality models while achieving high parallel efficiency and memory scalability

GeneralIntroduced 20007 papers

GPipe

GPipe is a distributed model parallel method for neural networks. With GPipe, each model can be specified as a sequence of layers, and consecutive groups of layers can be partitioned into cells. Each cell is then placed on a separate accelerator. Based on this partitioned setup, batch splitting is applied. A mini-batch of training examples is split into smaller micro-batches, then the execution of each set of micro-batches is pipelined over cells. Synchronous mini-batch gradient descent is applied for training, where gradients are accumulated across all micro-batches in a mini-batch and applied at the end of a mini-batch.

GeneralIntroduced 20007 papers

KnowPrompt

KnowPrompt is a prompt-tuning approach for relational understanding. It injects entity and relation knowledge into prompt construction with learnable virtual template words as well as answer words and synergistically optimize their representation with knowledge constraints. To be specific, TYPED MARKER is utilized around entities initialized with aggregated entity-type embeddings as learnable virtual template words to inject entity type knowledge. The average embeddings of each token are leveraged in relation labels as virtual answer words to inject relation knowledge. Since there exist implicit structural constraints among entities and relations, and virtual words should be consistent with the surrounding contexts, synergistic optimization is introduced to obtain optimized virtual templates and answer words. Concretely, a context-aware prompt calibration method is used with implicit structural constraints to inject structural knowledge implications among relational triples and associate prompt embeddings with each other.

Methods

Spatial Gating Unit

TabTransformer

Varifocal Loss

STAC

Dispute^Resolution^Expedia--How do I file a dispute with Expedia?

Denoised Smoothing

Positional Encoding Generator

GLN

Gradual Self-Training

Embedded Gaussian Affinity

Snapshot Ensembles

PIRL

Rotary Embeddings

NIMA

GCT

OHEM

FIERCE

uPIT

MARLIN

Conditional Positional Encoding

ELR

Spatial Attention-Guided Mask

M2D

WFST

Wide&Deep

ReZero

SPLASH

DistDGL

GPipe

KnowPrompt

GBO

NT-ASGD

Mixed Attention Block

Single-Headed Attention

AdaHessian

TPN

((FaQ's--Expedia)))How do I file a dispute with Expedia?

Pattern-Exploiting Training

Polya-Gamma Augmentation

PAFs

AdvProp

modReLU

GShard

Discriminative Regularization

Proximity Regularization

RigL

Temporal Dropout

Movement Pruning

HAPPIER

DifferNet

Methods

Spatial Gating Unit

TabTransformer

Varifocal Loss

STAC

Dispute^Resolution^Expedia--How do I file a dispute with Expedia?

Denoised Smoothing

Positional Encoding Generator

GLN

Gradual Self-Training

Embedded Gaussian Affinity

Snapshot Ensembles

PIRL

Rotary Embeddings

NIMA

GCT

OHEM

FIERCE

uPIT

MARLIN

Conditional Positional Encoding

ELR

Spatial Attention-Guided Mask

M2D

WFST

Wide&Deep

ReZero

SPLASH

DistDGL