Methods

8,725 machine learning methods and techniques

All Audio Computer Vision General Graphs Natural Language Processing Reinforcement Learning Sequential

PoolFormer

PoolFormer is instantiated from MetaFormer by specifying the token mixer as extremely simple operator, pooling. PoolFormer is utilized as a tool to verify MetaFormer hypothesis "MetaFormer is actually what you need" (vs "Attention is all you need").

Computer VisionIntroduced 20009 papers

OCD

Overfitting Conditional Diffusion Model

GeneralIntroduced 20009 papers

MViT

Multiscale Vision Transformer

Multiscale Vision Transformer, or MViT, is a transformer architecture for modeling visual data such as images and videos. Unlike conventional transformers, which maintain a constant channel capacity and resolution throughout the network, Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.

Computer VisionIntroduced 20009 papers

Collaborative Distillation

Collaborative Distillation is a new knowledge distillation method (named Collaborative Distillation) for encoder-decoder based neural style transfer to reduce the number of convolutional filters. The main idea is underpinned by a finding that the encoder-decoder pairs construct an exclusive collaborative relationship, which is regarded as a new kind of knowledge for style transfer models.

GeneralIntroduced 20009 papers

Expected Sarsa

Expected Sarsa is like Q-learning but instead of taking the maximum over next state-action pairs, we use the expected value, taking into account how likely each action is under the current policy. Except for this change to the update rule, the algorithm otherwise follows the scheme of Q-learning. It is more computationally expensive than Sarsa but it eliminates the variance due to the random selection of . Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Reinforcement LearningIntroduced 20009 papers

Dense Contrastive Learning

Dense Contrastive Learning is a self-supervised learning method for dense prediction tasks. It implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Contrasting with regular contrastive loss, the contrastive loss is computed between the single feature vectors outputted by the global projection head, at the level of global feature, while the dense contrastive loss is computed between the dense feature vectors outputted by the dense projection head, at the level of local feature.

GeneralIntroduced 20009 papers

EoM

Excess of Mass

Excess of Mass aim to maximized the cluster stability

GeneralIntroduced 20009 papers

LMU

Legendre Memory Unit

The Legendre Memory Unit (LMU) is mathematically derived to orthogonalize its continuous-time history – doing so by solving d coupled ordinary differential equations (ODEs), whose phase space linearly maps onto sliding windows of time via the Legendre polynomials up to degree d-1. It is optimal for compressing temporal information. See paper for equations (markdown isn't working). Official github repo: https://github.com/abr/lmu

SequentialIntroduced 20009 papers

Siren

Sinusoidal Representation Network

Siren, or Sinusoidal Representation Network, is a periodic activation function for implicit neural representations. Specifically it uses the sine as a periodic activation function:

GeneralIntroduced 20009 papers

RevNet

A Reversible Residual Network, or RevNet, is a variant of a ResNet where each layer’s activations can be reconstructed exactly from the next layer’s. Therefore, the activations for most layers need not be stored in memory during backpropagation. The result is a network architecture whose activation storage requirements are independent of depth, and typically at least an order of magnitude smaller compared with equally sized ResNets. RevNets are composed of a series of reversible blocks. Units in each layer are partitioned into two groups, denoted and ; the authors find what works best is partitioning the channels. Each reversible block takes inputs and produces outputs according to the following additive coupling rules – inspired the transformation in NICE (nonlinear independent components estimation) – and residual functions and analogous to those in standard ResNets: Each layer’s activations can be reconstructed from the next layer’s activations as follows: Note that unlike residual blocks, reversible blocks must have a stride of 1 because otherwise the layer discards information, and therefore cannot be reversible. Standard ResNet architectures typically have a handful of layers with a larger stride. If we define a RevNet architecture analogously, the activations must be stored explicitly for all non-reversible layers.

Computer VisionIntroduced 20009 papers

GALA

Global-and-Local attention

Most attention mechanisms learn where to focus using only weak supervisory signals from class labels, which inspired Linsley et al. to investigate how explicit human supervision can affect the performance and interpretability of attention models. As a proof of concept, Linsley et al. proposed the global-and-local attention (GALA) module, which extends an SE block with a spatial attention mechanism. Given the input feature map , GALA uses an attention mask that combines global and local attention to tell the network where and on what to focus. As in SE blocks, global attention aggregates global information by global average pooling and then produces a channel-wise attention weight vector using a multilayer perceptron. In local attention, two consecutive convolutions are conducted on the input to produce a positional weight map. The outputs of the local and global pathways are combined by addition and multiplication. Formally, GALA can be represented as: \begin{align} sg &= W{2} \delta (W{1}\text{GAP}(x)) \end{align} \begin{align} sl &= Conv2^{1\times 1} (\delta(Conv1^{1\times1}(X))) \end{align} \begin{align} sg^ &= \text{Expand}(sg) \end{align} \begin{align} sl^ &= \text{Expand}(sl) \end{align} \begin{align} s &= \tanh(a(sg^\ + sl^\) +m \cdot (sg^\ sl^\) ) \end{align} \begin{align} Y &= sX \end{align} where are learnable parameters representing channel-wise weight vectors. Supervised by human-provided feature importance maps, GALA has significantly improved representational power and can be combined with any CNN backbone.

GeneralIntroduced 20009 papers

FastPitch

FastPitch is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The architecture of FastPitch is shown in the Figure. It is based on FastSpeech and composed mainly of two feed-forward Transformer (FFTr) stacks. The first one operates in the resolution of input tokens, the second one in the resolution of the output frames. Let be the sequence of input lexical units, and be the sequence of target mel-scale spectrogram frames. The first FFTr stack produces the hidden representation . The hidden representation is used to make predictions about the duration and average pitch of every character with a 1-D CNN where and . Next, the pitch is projected to match the dimensionality of the hidden representation and added to . The resulting sum is discretely upsampled and passed to the output FFTr, which produces the output mel-spectrogram sequence Ground truth and are used during training, and predicted and are used during inference. The model optimizes mean-squared error (MSE) between the predicted and ground-truth modalities

AudioIntroduced 20009 papers

APPNP

Approximation of Personalized Propagation of Neural Predictions

Neural message-passing algorithms for semi-supervised classification on graphs have recently achieved great success. However, for classifying a node these methods only consider nodes that are a few propagation steps away and the size of this utilized neighbourhood is hard to extend. This paper uses the relationship between graph convolutional networks (GCN) and PageRank to derive an improved propagation scheme based on personalized PageRank. We utilize this propagation procedure to construct a simple model, personalized propagation of neural predictions (PPNP), and its fast approximation, APPNP. Our model's training time is on par or faster and its number of parameters is on par or lower than previous models. It leverages a large, adjustable neighbourhood for classification and can be easily combined with any neural network. We show that this model outperforms several recently proposed methods for semi-supervised classification in the most thorough study done so far for GCN-like models.

GraphsIntroduced 20009 papers

Rainbow DQN

Rainbow DQN is an extended DQN that combines several improvements into a single learner. Specifically: - It uses Double Q-Learning to tackle overestimation bias. - It uses Prioritized Experience Replay to prioritize important transitions. - It uses dueling networks. - It uses multi-step learning. - It uses distributional reinforcement learning instead of the expected return. - It uses noisy linear layers for exploration.

Reinforcement LearningIntroduced 20009 papers

Is Expedia Customer Service available 24/7 hour?

Yes, Expedia customer service is available 24 hours a day by phone at +1-805-330-4056. You can reach a live agent anytime for help with flights, hotels, car rentals, cancellations, or changes +1-805-330-4056. Whether you’re managing an existing booking or facing a travel emergency, support is open day and night +1-805-330-4056. Assistance includes itinerary updates, refund tracking, and travel protection claims +1-805-330-4056. If online options aren’t working or you prefer to speak with someone directly, call Expedia’s 24/7 customer service at +1-805-330-4056. Agents are available around the clock to resolve your travel concerns quickly +1-805-330-4056.

GeneralIntroduced 20009 papers

ASU

Amplifying Sine Unit: An Oscillatory Activation Function for Deep Neural Networks to Recover Nonlinear Oscillations Efficiently

2023

GeneralIntroduced 20009 papers

GATv2

Graph Attention Network v2

The GATv2 operator from the “How Attentive are Graph Attention Networks?” paper, which fixes the static attention problem of the standard GAT layer: since the linear layers in the standard GAT are applied right after each other, the ranking of attended nodes is unconditioned on the query node. In contrast, in GATv2, every node can attend to any other node. GATv2 scoring function:

GraphsIntroduced 20009 papers

Euclidean Norm Regularization

Euclidean Norm Regularization is a regularization step used in generative adversarial networks, and is typically added to both the generator and discriminator losses: where the scalar weight is a parameter. Image: LOGAN

GeneralIntroduced 20009 papers

ZeRO

Zero Redundancy Optimizer (ZeRO) is a sharded data parallel method for distributed training. ZeRODP removes the memory state redundancies across data-parallel processes by partitioning the model states instead of replicating them, and it retains the compute/communication efficiency by retaining the computational granularity and communication volume of DP using a dynamic communication schedule during training.

GeneralIntroduced 20009 papers

Image Scale Augmentation

Image Scale Augmentation is an augmentation technique where we randomly pick the short size of a image within a dimension range. One use case of this augmentation technique is in object detectiont asks.

Computer VisionIntroduced 20009 papers

Spatial Gating Unit

Spatial Gating Unit, or SGU, is a gating unit used in the gMLP architecture to captures spatial interactions. To enable cross-token interactions, it is necessary for the layer to contain a contraction operation over the spatial dimension. The layer is formulated as the output of linear gating: where denotes element-wise multiplication. For training stability, the authors find it critical to initialize as near-zero values and as ones, meaning that and therefore at the beginning of training. This initialization ensures each gMLP block behaves like a regular FFN at the early stage of training, where each token is processed independently, and only gradually injects spatial information across tokens during the course of learning. The authors find it further effective to split into two independent parts along the channel dimension for the gating function and for the multiplicative bypass: They also normalize the input to which empirically improved the stability of large NLP models.

GeneralIntroduced 20009 papers

TabTransformer

TabTransformer is a deep tabular data modeling architecture for supervised and semi-supervised learning. The TabTransformer is built upon self-attention based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. As an overview, the architecture comprises a column embedding layer, a stack of Transformer layers, and a multi-layer perceptron (MLP). The contextual embeddings (outputted by the Transformer layer) are concatenated along with continuous features which is inputted to an MLP. The loss function is then minimized to learn all the parameters in an end-to-end learning.

GeneralIntroduced 20009 papers

Varifocal Loss

Varifocal Loss is a loss function for training a dense object detector to predict the IACS, inspired by focal loss. Unlike the focal loss that deals with positives and negatives equally, Varifocal Loss treats them asymmetrically. where is the predicted IACS and is the target IoU score. For a positive training example, is set as the IoU between the generated bounding box and the ground-truth one (gt IoU), whereas for a negative training example, the training target for all classes is .

GeneralIntroduced 20008 papers

STAC

STAC is a semi-supervised framework for visual object detection along with a data augmentation strategy. STAC deploys highly confident pseudo labels of localized objects from an unlabeled image and updates the model by enforcing consistency via strong augmentations. We generate pseudo labels (i.e., bounding boxes and their class labels) for unlabeled data using test-time inference, including NMS , of the teacher model trained with labeled data. We then compute unsupervised loss with respect to pseudo labels whose confidence scores are above a threshold . The strong augmentations are applied for augmentation consistency during the model training. Target boxes are augmented when global geometric transformations are used.

GeneralIntroduced 20008 papers

Dispute^Resolution^Expedia--How do I file a dispute with Expedia?

How do I file a dispute with Expedia? To file a complaint against Expedia, first try contacting their customer service directly. You can reach them by phone at +(1)(888)(829)(0881) or +(1)(888)(829)(0881), via their online chat support, or through their online help center. If the issue persists, you can escalate it by contacting their corporate office via email or filing a complaint with your payment provider or a consumer protection agency.

GeneralIntroduced 20008 papers

VFNet

VarifocalNet

VarifocalNet is a method aimed at accurately ranking a huge number of candidate detections in object detection. It consists of a new loss function, named Varifocal Loss, for training a dense object detector to predict the IACS, and a new efficient star-shaped bounding box feature representation for estimating the IACS and refining coarse bounding boxes. Combining these two new components and a bounding box refinement branch, results in a dense object detector on the FCOS architecture, what the authors call VarifocalNet or VFNet for short.

Computer VisionIntroduced 20008 papers

classifier-guidance

Computer VisionIntroduced 20008 papers

Denoised Smoothing

Denoised Smoothing is a method for obtaining a provably robust classifier from a fixed pretrained one, without any additional training or fine-tuning of the latter. The basic idea is to prepend a custom-trained denoiser before the pretrained classifier, and then apply randomized smoothing. Randomized smoothing is a certified defense that converts any given classifier into a new smoothed classifier that is characterized by a non-linear Lipschitz property. When queried at a point , the smoothed classifier outputs the class that is most likely to be returned by under isotropic Gaussian perturbations of its inputs. Unfortunately, randomized smoothing requires that the underlying classifier is robust to relatively large random Gaussian perturbations of the input, which is not the case for off-the-shelf pretrained models. By applying our custom-trained denoiser to the classifier , we can effectively make robust to such Gaussian perturbations, thereby making it “suitable” for randomized smoothing.

GeneralIntroduced 20008 papers

Pyramidal Residual Unit

A Pyramidal Residual Unit is a type of residual unit where the number of channels gradually increases as a function of the depth at which the layer occurs, which is similar to a pyramid structure of which the shape gradually widens from the top downwards. It was introduced as part of the PyramidNet architecture.

Computer VisionIntroduced 20008 papers

TDN

Temporaral Difference Network

TDN, or Temporaral Difference Network, is an action recognition model that aims to capture multi-scale temporal information. To fully capture temporal information over the entire video, the TDN is established with a two-level difference modeling paradigm. Specifically, for local motion modeling, temporal difference over consecutive frames is used to supply 2D CNNs with finer motion pattern, while for global motion modeling, temporal difference across segments is incorporated to capture long-range structure for motion feature excitation.

Computer VisionIntroduced 20008 papers

DD-PPO

Decentralized Distributed Proximal Policy Optimization

Decentralized Distributed Proximal Policy Optimization (DD-PPO) is a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever stale'), making it conceptually simple and easy to implement. Proximal Policy Optimization, or PPO, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of TRPO, while using only first-order optimization. Let denote the probability ratio , so . TRPO maximizes a “surrogate” objective: As a general abstraction, DD-PPO implements the following: at step , worker has a copy of the parameters, , calculates the gradient, , and updates via where is any first-order optimization technique (e.g. gradient descent) and performs a reduction (e.g. mean) over all copies of a variable and returns the result to all workers. Distributed DataParallel scales very well (near-linear scaling up to 32,000 GPUs), and is reasonably simple to implement (all workers synchronously running identical code).

Reinforcement LearningIntroduced 20008 papers

Context Enhancement Module

Context Enhancement Module (CEM) is a feature extraction module used in object detection (specifically, ThunderNet) which aims to to enlarge the receptive field. The key idea of CEM is to aggregate multi-scale local context information and global context information to generate more discriminative features. In CEM, the feature maps from three scales are merged: , and . is the global context feature vector by applying a global average pooling on . We then apply a 1 × 1 convolution on each feature map to squeeze the number of channels to . Afterwards, is upsampled by 2× and is broadcast so that the spatial dimensions of the three feature maps are equal. At last, the three generated feature maps are aggregated. By leveraging both local and global context, CEM effectively enlarges the receptive field and refines the representation ability of the thin feature map. Compared with prior FPN structures, CEM involves only two 1×1 convolutions and a fc layer.

Computer VisionIntroduced 20008 papers

AGCN

Adaptive Graph Convolutional Neural Networks

AGCN is a novel spectral graph convolution network that feed on original data of diverse graph structures. Image credit: Adaptive Graph Convolutional Neural Networks

GraphsIntroduced 20008 papers

Positional Encoding Generator

Positional Encoding Generator, or PEG, is a module used in the Conditional Position Encoding position embeddings. It dynamically produce the positional encodings conditioned on the local neighborhood of an input token. To condition on the local neighbors, we first reshape the flattened input sequence of DeiT back to in the 2 -D image space. Then, a function (denoted by in the Figure) is repeatedly applied to the local patch in to produce the conditional positional encodings PEG can be efficiently implemented with a 2-D convolution with kernel and zero paddings. Note that the zero paddings here are important to make the model be aware of the absolute positions, and can be of various forms such as separable convolutions and many others.

GeneralIntroduced 20008 papers

Spatial Feature Transform

Spatial Feature Transform, or SFT, is a layer that generates affine transformation parameters for spatial-wise feature modulation, and was originally proposed within the context of image super-resolution. A Spatial Feature Transform (SFT) layer learns a mapping function that outputs a modulation parameter pair based on some prior condition . The learned parameter pair adaptively influences the outputs by applying an affine transformation spatially to each intermediate feature maps in an SR network. During testing, only a single forward pass is needed to generate the HR image given the LR input and segmentation probability maps. More precisely, the prior is modeled by a pair of affine transformation parameters through a mapping function . Consequently, After obtaining from conditions, the transformation is carried out by scaling and shifting feature maps of a specific layer: where denotes the feature maps, whose dimension is the same as and , and is referred to element-wise multiplication, i.e., Hadamard product. Since the spatial dimensions are preserved, the SFT layer not only performs feature-wise manipulation but also spatial-wise transformation.

Computer VisionIntroduced 20008 papers

GLN

Gated Linear Network

A Gated Linear Network, or GLN, is a type of backpropagation-free neural architecture. What distinguishes GLNs from contemporary neural networks is the distributed and local nature of their credit assignment mechanism; each neuron directly predicts the target, forgoing the ability to learn feature representations in favor of rapid online learning. Individual neurons can model nonlinear functions via the use of data-dependent gating in conjunction with online convex optimization. GLNs are feedforward networks composed of many layers of gated geometric mixing neurons as shown in the Figure . Each neuron in a given layer outputs a gated geometric mixture of the predictions from the previous layer, with the final layer consisting of just a single neuron. In a supervised learning setting, a is trained on (side information, base predictions, label) triplets derived from input-label pairs . There are two types of input to neurons in the network: the first is the side information , which can be thought of as the input features; the second is the input to the neuron, which will be the predictions output by the previous layer, or in the case of layer 0 , some (optionally) provided base predictions that typically will be a function of Each neuron will also take in a constant bias prediction, which helps empirically and is essential for universality guarantees. Weights are learnt in a Gated Linear Network using Online Gradient Descent (OGD) locally at each neuron. They key observation is that as each neuron in layers is itself a gated geometric mixture, all of these neurons can be thought of as individually predicting the target. Given side information , each neuron suffers a loss convex in its active weights of

GeneralIntroduced 20008 papers

UL2

UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.

Natural Language ProcessingIntroduced 20008 papers

Gradual Self-Training

Gradual self-training is a method for semi-supervised domain adaptation. The goal is to adapt an initial classifier trained on a source domain given only unlabeled data that shifts gradually in distribution towards a target domain. This comes up for example in applications ranging from sensor networks and self-driving car perception modules to brain-machine interfaces, where machine learning systems must adapt to data distributions that evolve over time. The gradual self-training algorithm begins with a classifier trained on labeled examples from the source domain (Figure a). For each successive domain , the algorithm generates pseudolabels for unlabeled examples from that domain, and then trains a regularized supervised classifier on the pseudolabeled examples. The intuition, visualized in the Figure, is that after a single gradual shift, most examples are pseudolabeled correctly so self-training learns a good classifier on the shifted data, but the shift from the source to the target can be too large for self-training to correct.

GeneralIntroduced 20008 papers

Embedded Gaussian Affinity

Embedded Gaussian Affinity is a type of affinity or self-similarity function between two points and that uses a Gaussian function in an embedding space: Here and are two embeddings. Note that the self-attention module used in the original Transformer model is a special case of non-local operations in the embedded Gaussian version. This can be seen from the fact that for a given , becomes the softmax computation along the dimension . So we have , which is the self-attention form in the Transformer model. This shows how we can relate this recent self-attention model to the classic computer vision method of non-local means.

GeneralIntroduced 20008 papers

Snapshot Ensembles

Snapshot Ensembles: Train 1, get M for free

The overhead cost of training multiple deep neural networks could be very high in terms of the training time, hardware, and computational resource requirement and often acts as obstacle for creating deep ensembles. To overcome these barriers Huang et al. proposed a unique method to create ensemble which at the cost of training one model, yields multiple constituent model snapshots that can be ensembled together to create a strong learner. The core idea behind the concept is to make the model converge to several local minima along the optimization path and save the model parameters at these local minima points. During the training phase, a neural network would traverse through many such points. The lowest of all such local minima is known as the Global Minima. The larger the model, more are the number of parameters and larger the number of local minima points. This implies, there are discrete sets of weights and biases, at which the model is making fewer errors. So, every such minimum can be considered a weak but a potential learner model for the problem being solved. Multiple such snapshot of weights and biases are recorded which can later be ensembled to get a better generalized model which makes the least amount of mistakes.

GeneralIntroduced 20008 papers

PIRL

Pretext-Invariant Representation Learning (PIRL, pronounced as “pearl”) learns invariant representations based on pretext tasks. PIRL is used with a commonly used pretext task that involves solving jigsaw puzzles. Specifically, PIRL constructs image representations that are similar to the representation of transformed versions of the same image and different from the representations of other images.

GeneralIntroduced 20008 papers

Rotary Embeddings

Rotary Position Embedding

Rotary Position Embedding, or RoPE, is a type of position embedding which encodes absolute positional information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation. Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding.

GeneralIntroduced 20008 papers

NIMA

Neural Image Assessment

In the context of image enhancement, maximizing NIMA score as a prior can increase the likelihood of enhancing perceptual quality of an image.

GeneralIntroduced 20008 papers

RFB

Receptive Field Block

Receptive Field Block (RFB) is a module for strengthening the deep features learned from lightweight CNN models so that they can contribute to fast and accurate detectors. Specifically, RFB makes use of multi-branch pooling with varying kernels corresponding to RFs of different sizes, applies dilated convolution layers to control their eccentricities, and reshapes them to generate final representation.

Computer VisionIntroduced 20008 papers

Pyramidal Bottleneck Residual Unit

A Pyramidal Bottleneck Residual Unit is a type of residual unit where the number of channels gradually increases as a function of the depth at which the layer occurs, which is similar to a pyramid structure of which the shape gradually widens from the top downwards. It also consists of a bottleneck using 1x1 convolutions. It was introduced as part of the PyramidNet architecture.

Computer VisionIntroduced 20008 papers

GCT

Gated Channel Transformation

GCT first collects global information by computing the l2-norm of each channel. Next, a learnable vector is applied to scale the feature. Then a competition mechanism is adopted by channel normalization to interact between channels. Unlike previous methods, GCT first collects global information by computing the -norm of each channel. Next, a learnable vector is applied to scale the feature. Then a competition mechanism is adopted by channel normalization to interact between channels. Like other common normalization methods, a learnable scale parameter and bias are applied to rescale the normalization. However, unlike previous methods, GCT adopts tanh activation to control the attention vector. Finally, it not only multiplies the input by the attention vector but also adds an identity connection. GCT can be written as: \begin{align} s = F\text{gct}(X, \theta) & = \tanh (\gamma CN(\alpha \text{Norm}(X)) + \beta) \end{align} \begin{align} Y & = s X + X \end{align} where , and are trainable parameters. indicates the -norm of each channel. is channel normalization. A GCT block has fewer parameters than an SE block, and as it is lightweight, can be added after each convolutional layer of a CNN.

GeneralIntroduced 20008 papers

OHEM

Online Hard Example Mining

Some object detection datasets contain an overwhelming number of easy examples and a small number of hard examples. Automatic selection of these hard examples can make training more effective and efficient. OHEM, or Online Hard Example Mining, is a bootstrapping technique that modifies SGD to sample from examples in a non-uniform way depending on the current loss of each example under consideration. The method takes advantage of detection-specific problem structure in which each SGD mini-batch consists of only one or two images, but thousands of candidate examples. The candidate examples are subsampled according to a distribution that favors diverse, high loss instances.

GeneralIntroduced 20008 papers

FIERCE

Feature Information Entropy Regularized Cross Entropy

FIERCE is an entropic regularization on the feature space

GeneralIntroduced 20007 papers

uPIT

utterance level permutation invariant training

GeneralIntroduced 20007 papers

GraphSAINT

Graph sampling based inductive learning method

Scalable method to train large scale GNN models via sampling small subgraphs.

GraphsIntroduced 20007 papers

PreviousPage 15 of 175Next