5,489 machine learning methods and techniques
Spatial Gating Unit, or SGU, is a gating unit used in the gMLP architecture to captures spatial interactions. To enable cross-token interactions, it is necessary for the layer to contain a contraction operation over the spatial dimension. The layer is formulated as the output of linear gating: where denotes element-wise multiplication. For training stability, the authors find it critical to initialize as near-zero values and as ones, meaning that and therefore at the beginning of training. This initialization ensures each gMLP block behaves like a regular FFN at the early stage of training, where each token is processed independently, and only gradually injects spatial information across tokens during the course of learning. The authors find it further effective to split into two independent parts along the channel dimension for the gating function and for the multiplicative bypass: They also normalize the input to which empirically improved the stability of large NLP models.
TabTransformer is a deep tabular data modeling architecture for supervised and semi-supervised learning. The TabTransformer is built upon self-attention based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. As an overview, the architecture comprises a column embedding layer, a stack of Transformer layers, and a multi-layer perceptron (MLP). The contextual embeddings (outputted by the Transformer layer) are concatenated along with continuous features which is inputted to an MLP. The loss function is then minimized to learn all the parameters in an end-to-end learning.
Varifocal Loss is a loss function for training a dense object detector to predict the IACS, inspired by focal loss. Unlike the focal loss that deals with positives and negatives equally, Varifocal Loss treats them asymmetrically. where is the predicted IACS and is the target IoU score. For a positive training example, is set as the IoU between the generated bounding box and the ground-truth one (gt IoU), whereas for a negative training example, the training target for all classes is .
STAC is a semi-supervised framework for visual object detection along with a data augmentation strategy. STAC deploys highly confident pseudo labels of localized objects from an unlabeled image and updates the model by enforcing consistency via strong augmentations. We generate pseudo labels (i.e., bounding boxes and their class labels) for unlabeled data using test-time inference, including NMS , of the teacher model trained with labeled data. We then compute unsupervised loss with respect to pseudo labels whose confidence scores are above a threshold . The strong augmentations are applied for augmentation consistency during the model training. Target boxes are augmented when global geometric transformations are used.
How do I file a dispute with Expedia? To file a complaint against Expedia, first try contacting their customer service directly. You can reach them by phone at +(1)(888)(829)(0881) or +(1)(888)(829)(0881), via their online chat support, or through their online help center. If the issue persists, you can escalate it by contacting their corporate office via email or filing a complaint with your payment provider or a consumer protection agency.
Denoised Smoothing is a method for obtaining a provably robust classifier from a fixed pretrained one, without any additional training or fine-tuning of the latter. The basic idea is to prepend a custom-trained denoiser before the pretrained classifier, and then apply randomized smoothing. Randomized smoothing is a certified defense that converts any given classifier into a new smoothed classifier that is characterized by a non-linear Lipschitz property. When queried at a point , the smoothed classifier outputs the class that is most likely to be returned by under isotropic Gaussian perturbations of its inputs. Unfortunately, randomized smoothing requires that the underlying classifier is robust to relatively large random Gaussian perturbations of the input, which is not the case for off-the-shelf pretrained models. By applying our custom-trained denoiser to the classifier , we can effectively make robust to such Gaussian perturbations, thereby making it “suitable” for randomized smoothing.
Positional Encoding Generator, or PEG, is a module used in the Conditional Position Encoding position embeddings. It dynamically produce the positional encodings conditioned on the local neighborhood of an input token. To condition on the local neighbors, we first reshape the flattened input sequence of DeiT back to in the 2 -D image space. Then, a function (denoted by in the Figure) is repeatedly applied to the local patch in to produce the conditional positional encodings PEG can be efficiently implemented with a 2-D convolution with kernel and zero paddings. Note that the zero paddings here are important to make the model be aware of the absolute positions, and can be of various forms such as separable convolutions and many others.
Gated Linear Network
A Gated Linear Network, or GLN, is a type of backpropagation-free neural architecture. What distinguishes GLNs from contemporary neural networks is the distributed and local nature of their credit assignment mechanism; each neuron directly predicts the target, forgoing the ability to learn feature representations in favor of rapid online learning. Individual neurons can model nonlinear functions via the use of data-dependent gating in conjunction with online convex optimization. GLNs are feedforward networks composed of many layers of gated geometric mixing neurons as shown in the Figure . Each neuron in a given layer outputs a gated geometric mixture of the predictions from the previous layer, with the final layer consisting of just a single neuron. In a supervised learning setting, a is trained on (side information, base predictions, label) triplets derived from input-label pairs . There are two types of input to neurons in the network: the first is the side information , which can be thought of as the input features; the second is the input to the neuron, which will be the predictions output by the previous layer, or in the case of layer 0 , some (optionally) provided base predictions that typically will be a function of Each neuron will also take in a constant bias prediction, which helps empirically and is essential for universality guarantees. Weights are learnt in a Gated Linear Network using Online Gradient Descent (OGD) locally at each neuron. They key observation is that as each neuron in layers is itself a gated geometric mixture, all of these neurons can be thought of as individually predicting the target. Given side information , each neuron suffers a loss convex in its active weights of
Gradual self-training is a method for semi-supervised domain adaptation. The goal is to adapt an initial classifier trained on a source domain given only unlabeled data that shifts gradually in distribution towards a target domain. This comes up for example in applications ranging from sensor networks and self-driving car perception modules to brain-machine interfaces, where machine learning systems must adapt to data distributions that evolve over time. The gradual self-training algorithm begins with a classifier trained on labeled examples from the source domain (Figure a). For each successive domain , the algorithm generates pseudolabels for unlabeled examples from that domain, and then trains a regularized supervised classifier on the pseudolabeled examples. The intuition, visualized in the Figure, is that after a single gradual shift, most examples are pseudolabeled correctly so self-training learns a good classifier on the shifted data, but the shift from the source to the target can be too large for self-training to correct.
Embedded Gaussian Affinity is a type of affinity or self-similarity function between two points and that uses a Gaussian function in an embedding space: Here and are two embeddings. Note that the self-attention module used in the original Transformer model is a special case of non-local operations in the embedded Gaussian version. This can be seen from the fact that for a given , becomes the softmax computation along the dimension . So we have , which is the self-attention form in the Transformer model. This shows how we can relate this recent self-attention model to the classic computer vision method of non-local means.
Snapshot Ensembles: Train 1, get M for free
The overhead cost of training multiple deep neural networks could be very high in terms of the training time, hardware, and computational resource requirement and often acts as obstacle for creating deep ensembles. To overcome these barriers Huang et al. proposed a unique method to create ensemble which at the cost of training one model, yields multiple constituent model snapshots that can be ensembled together to create a strong learner. The core idea behind the concept is to make the model converge to several local minima along the optimization path and save the model parameters at these local minima points. During the training phase, a neural network would traverse through many such points. The lowest of all such local minima is known as the Global Minima. The larger the model, more are the number of parameters and larger the number of local minima points. This implies, there are discrete sets of weights and biases, at which the model is making fewer errors. So, every such minimum can be considered a weak but a potential learner model for the problem being solved. Multiple such snapshot of weights and biases are recorded which can later be ensembled to get a better generalized model which makes the least amount of mistakes.
Pretext-Invariant Representation Learning (PIRL, pronounced as “pearl”) learns invariant representations based on pretext tasks. PIRL is used with a commonly used pretext task that involves solving jigsaw puzzles. Specifically, PIRL constructs image representations that are similar to the representation of transformed versions of the same image and different from the representations of other images.
Rotary Position Embedding
Rotary Position Embedding, or RoPE, is a type of position embedding which encodes absolute positional information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation. Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding.
Neural Image Assessment
In the context of image enhancement, maximizing NIMA score as a prior can increase the likelihood of enhancing perceptual quality of an image.
Gated Channel Transformation
GCT first collects global information by computing the l2-norm of each channel. Next, a learnable vector is applied to scale the feature. Then a competition mechanism is adopted by channel normalization to interact between channels. Unlike previous methods, GCT first collects global information by computing the -norm of each channel. Next, a learnable vector is applied to scale the feature. Then a competition mechanism is adopted by channel normalization to interact between channels. Like other common normalization methods, a learnable scale parameter and bias are applied to rescale the normalization. However, unlike previous methods, GCT adopts tanh activation to control the attention vector. Finally, it not only multiplies the input by the attention vector but also adds an identity connection. GCT can be written as: \begin{align} s = F\text{gct}(X, \theta) & = \tanh (\gamma CN(\alpha \text{Norm}(X)) + \beta) \end{align} \begin{align} Y & = s X + X \end{align} where , and are trainable parameters. indicates the -norm of each channel. is channel normalization. A GCT block has fewer parameters than an SE block, and as it is lightweight, can be added after each convolutional layer of a CNN.
Online Hard Example Mining
Some object detection datasets contain an overwhelming number of easy examples and a small number of hard examples. Automatic selection of these hard examples can make training more effective and efficient. OHEM, or Online Hard Example Mining, is a bootstrapping technique that modifies SGD to sample from examples in a non-uniform way depending on the current loss of each example under consideration. The method takes advantage of detection-specific problem structure in which each SGD mini-batch consists of only one or two images, but thousands of candidate examples. The candidate examples are subsampled according to a distribution that favors diverse, high loss instances.
Feature Information Entropy Regularized Cross Entropy
FIERCE is an entropic regularization on the feature space
utterance level permutation invariant training
Conditional Positional Encoding, or CPE, is a type of positional encoding for vision transformers. Unlike previous fixed or learnable positional encodings, which are predefined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE aims to generalize to the input sequences that are longer than what the model has ever seen during training. CPE can also keep the desired translation-invariance in the image classification task. CPE can be implemented with a Position Encoding Generator (PEG) and incorporated into the current Transformer framework.
Early Learning Regularization
A Spatial Attention-Guided Mask is a module for instance segmentation that predicts a segmentation mask on each detected box with a spatial attention map that helps to focus on informative pixels and suppress noise. The goal is to guide the mask head for spotlighting meaningful pixels and repressing uninformative ones. Once features inside the predicted RoIs are extracted by RoIAlign with 14×14 resolution, those features are fed into four conv layers and the spatial attention module (SAM) sequentially. To exploit the spatial attention map as a feature descriptor given input feature map , the SAM first generates pooled features by both average and max pooling operations respectively along the channel axis and aggregates them via concatenation. Then it is followed by a 3 × 3 conv layer and normalized by the sigmoid function. The computation process is summarized as follow: where denotes the sigmoid function, is 3 × 3 conv layer and represents the concatenate operation. Finally, the attention guided feature map is computed as: where ⊗ denotes element-wise multiplication. After then, a 2 × 2 deconv upsamples the spatially attended feature map to 28 × 28 resolution. Lastly, a 1 × 1 conv is applied for predicting class-specific masks.
Masked Modeling Duo
Self-supervised learning (SSL) using masked prediction has made great strides in general-purpose audio representation. This study proposes Masked Modeling Duo (M2D), an improved masked prediction SSL, which learns by predicting representations of masked input signals that serve as training signals. Unlike conventional methods, M2D obtains a training signal by encoding only the masked part, encouraging the two networks in M2D to model the input. While M2D improves general-purpose audio representations, a specialized representation is essential for real-world applications, such as in industrial and medical domains. The often confidential and proprietary data in such domains is typically limited in size and has a different distribution from that in pre-training datasets. Therefore, we propose M2D for X (M2D-X), which extends M2D to enable the pre-training of specialized representations for an application X. M2D-X learns from M2D and an additional task and inputs background noise. We make the additional task configurable to serve diverse applications, while the background noise helps learn on small data and forms a denoising task that makes representation robust. With these design choices, M2D-X should learn a representation specialized to serve various application needs. Our experiments confirmed that the representations for general-purpose audio, specialized for the highly competitive AudioSet and speech domain, and a small-data medical task achieve top-level performance, demonstrating the potential of using our models as a universal audio pre-training framework.
weighted finite state transducer
Wide&Deep jointly trains wide linear models and deep neural networks to combine the benefits of memorization and generalization for real-world recommender systems. In summary, the wide component is a generalized linear model. The deep component is a feed-forward neural network. The deep and wide components are combined using a weighted sum of their output log odds as the prediction. This is then fed to a logistic loss function for joint training, which is done by back-propagating the gradients from the output to both the wide and deep part of the model simultaneously using mini-batch stochastic optimization. The AdaGrad optimizer is used for the wider part. The combined model is illustrated in the figure (center).
ReZero is a normalization approach that dynamically facilitates well-behaved gradients and arbitrarily deep signal propagation. The idea is simple: ReZero initializes each layer to perform the identity operation. For each layer, a residual connection is introduced for the input signal and one trainable parameter that modulates the non-trivial transformation of a layer : where at the beginning of training. Initially the gradients for all parameters defining vanish, but dynamically evolve to suitable values during initial stages of training. The architecture is illustrated in the Figure.
Simple Piecewise Linear and Adaptive with Symmetric Hinges
Please enter a description about the method here
DistDGL is a system for training GNNs in a mini-batch fashion on a cluster of machines. It is is based on the Deep Graph Library (DGL), a popular GNN development framework. DistDGL distributes the graph and its associated data (initial features and embeddings) across the machines and uses this distribution to derive a computational decomposition by following an owner-compute rule. DistDGL follows a synchronous training approach and allows ego-networks forming the mini-batches to include non-local nodes. To minimize the overheads associated with distributed computations, DistDGL uses a high-quality and light-weight mincut graph partitioning algorithm along with multiple balancing constraints. This allows it to reduce communication overheads and statically balance the computations. It further reduces the communication by replicating halo nodes and by using sparse embedding updates. The combination of these design choices allows DistDGL to train high-quality models while achieving high parallel efficiency and memory scalability
GPipe is a distributed model parallel method for neural networks. With GPipe, each model can be specified as a sequence of layers, and consecutive groups of layers can be partitioned into cells. Each cell is then placed on a separate accelerator. Based on this partitioned setup, batch splitting is applied. A mini-batch of training examples is split into smaller micro-batches, then the execution of each set of micro-batches is pipelined over cells. Synchronous mini-batch gradient descent is applied for training, where gradients are accumulated across all micro-batches in a mini-batch and applied at the end of a mini-batch.
KnowPrompt is a prompt-tuning approach for relational understanding. It injects entity and relation knowledge into prompt construction with learnable virtual template words as well as answer words and synergistically optimize their representation with knowledge constraints. To be specific, TYPED MARKER is utilized around entities initialized with aggregated entity-type embeddings as learnable virtual template words to inject entity type knowledge. The average embeddings of each token are leveraged in relation labels as virtual answer words to inject relation knowledge. Since there exist implicit structural constraints among entities and relations, and virtual words should be consistent with the surrounding contexts, synergistic optimization is introduced to obtain optimized virtual templates and answer words. Concretely, a context-aware prompt calibration method is used with implicit structural constraints to inject structural knowledge implications among relational triples and associate prompt embeddings with each other.
Gradient-based optimization
GBO is a novel metaheuristic optimization algorithm. The GBO, inspired by the gradient-based Newton’s method, uses two main operators: gradient search rule (GSR) and local escaping operator (LEO) and a set of vectors to explore the search space. The GSR employs the gradient-based method to enhance the exploration tendency and accelerate the convergence rate to achieve better positions in the search space. The LEO enables the proposed GBO to escape from local optima. The performance of the new algorithm was evaluated in two phases. 28 mathematical test functions were first used to evaluate various characteristics of the GBO, and then six engineering problems were optimized by the GBO. In the first phase, the GBO was compared with five existing optimization algorithms, indicating that the GBO yielded very promising results due to its enhanced capabilities of exploration, exploitation, convergence, and effective avoidance of local optima. The second phase also demonstrated the superior performance of the GBO in solving complex real-world engineering problems. The source codes of GBO are publicly available at https://imanahmadianfar.com/codes/.
Non-monotonically Triggered ASGD
NT-ASGD, or Non-monotonically Triggered ASGD, is an averaged stochastic gradient descent technique. In regular ASGD, we take steps identical to regular SGD but instead of returning the last iterate as the solution, we return , where is the total number of iterations and is a user-specified averaging trigger. NT-ASGD has a non-monotonic criterion that conservatively triggers the averaging when the validation metric fails to improve for multiple cycles. Given that the choice of triggering is irreversible, this conservatism ensures that the randomness of training does not play a major role in the decision.
Mixed Attention Block is an attention module used in the ConvBERT architecture. It is a mixture of self-attention and span-based dynamic convolution (highlighted in pink). They share the same Query but use different Key to generate the attention map and convolution kernel respectively. The number of attention heads is reducing by directly projecting the input to a smaller embedding space to form a bottleneck structure for self-attention and span-based dynamic convolution. Dimensions of the input and output of some blocks are labeled on the left top corner to illustrate the overall framework, where is the embedding size of the input and is the reduction ratio.
Single-Headed Attention is a single-headed attention module used in the SHA-RNN language model. The principle design reasons for single-headedness were simplicity (avoiding running out of memory) and scepticism about the benefits of using multiple heads.
ADAHESSIAN
ADAHESSIAN is a new stochastic optimization algorithm that directly incorporates approximate curvature information from the loss function, and it includes several novel performance-improving features, including a fast Hutchinson based method to approximate the curvature matrix with low computational overhead.
Temporal Pyramid Network
Temporal Pyramid Network, or TPN, is a pyramid level module for action recognition at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks in a plug-and-play manner. The source of features and the fusion of features form a feature hierarchy for the backbone so that it can capture action instances at various tempos. In the TPN, a Backbone Network is used to extract multiple level features, a Spatial Semantic Modulation spatially downsamples features to align semantics, a Temporal Rate Modulation temporally downsamples features to adjust relative tempo among levels, Information Flow aggregates features in various directions to enhance and enrich level-wise representations and Final Prediction rescales and concatenates all levels of pyramid along channel dimension.
How do I file a dispute with Expedia? To file a dispute with Expedia, start by contacting their customer service team via phone at +(1)(888)(829)(0881) or +(1)(888)(829)(0881), or by using their online chat or email support. Clearly explain your issue and provide all relevant booking details and supporting documentation. If the initial response is unsatisfactory, request to escalate the matter to a supervisor. You can follow up again by calling +(1)(888)(829)(0881) or +(1)(888)(829)(0881). If the issue still isn’t resolved, consider disputing the charge with your bank or credit card provider, or filing a complaint with the Better Business Bureau (BBB) or the Federal Trade Commission (FTC).
Pattern-Exploiting Training is a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task. These phrases are then used to assign soft labels to a large set of unlabeled examples. Finally, standard supervised training is performed on the resulting training set. In the case of PET for sentiment classification, first a number of patterns encoding some form of task description are created to convert training examples to cloze questions; for each pattern, a pretrained language model is finetuned. Secondly, the ensemble of trained models annotates unlabeled data. Lastly, a classifier is trained on the resulting soft-labeled dataset.
Data augmentation using Polya-Gamma latent variables.
This method applies Polya-Gamma latent variables as a way to obtain closed form expressions for full-conditionals of posterior distributions in sampling algorithms like MCMC.
Part Affinity Fields
AdvProp is an adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting. Key to the method is the usage of a separate auxiliary batch norm for adversarial examples, as they have different underlying distributions to normal examples.
modReLU is an activation that is a modification of a ReLU. It is a pointwise nonlinearity, , which affects only the absolute value of a complex number, defined as: where is a bias parameter of the nonlinearity. For a dimensional hidden space we learn nonlinearity bias parameters, one per dimension.
GShard is a intra-layer parallel distributed method. It consists of set of simple APIs for annotations, and a compiler extension in XLA for automatic parallelization.
Discriminative Regularization is a regularization technique for variational autoencoders that uses representations from discriminative classifiers to augment the VAE objective function (the lower bound) corresponding to a generative model. Specifically, it encourages the model’s reconstructions to be close to the data example in a representation space defined by the hidden layers of highly-discriminative, neural network based classifiers.
Rigging the Lottery
Dynamic Sparse Training method based on using the gradient norm to select which weights to remove/add in each mask update.
Temporal Dropout or TempD
A method that randomly mask out all features coming from a specific time-step in time-series data. If the model used is independent of uneven sequences or missing data in time-series, like attention-based transformers, the masked time-steps can be just ignored in the forward prediction of the model. Otherwise, they have to be masked out with some numerical value.
Movement Pruning is a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. Magnitude pruning can be seen as utilizing zeroth-order information (absolute value) of the running model. In contrast, movement pruning methods are where importance is derived from first-order information. Intuitively, instead of selecting weights that are far from zero, we retain connections that are moving away from zero during the training process.
Hierarchical Average Precision training for Pertinent ImagE Retrieval