8,725 machine learning methods and techniques
VirText, or Visual representations from Textual annotations is a pretraining approach using semantically dense captions to learn visual representations. First a ConvNet and Transformer are jointly trained from scratch to generate natural language captions for images. Then, the learned features are transferred to downstream visual recognition tasks.
PCA Whitening is a processing step for image based data that makes input less redundant. Adjacent pixel or feature values can be highly correlated, and whitening through the use of PCA reduces this degree of correlation. Image Source: Wikipedia
ADAHESSIAN
ADAHESSIAN is a new stochastic optimization algorithm that directly incorporates approximate curvature information from the loss function, and it includes several novel performance-improving features, including a fast Hutchinson based method to approximate the curvature matrix with low computational overhead.
ZFNet is a classic convolutional neural network. The design was motivated by visualizing intermediate feature layers and the operation of the classifier. Compared to AlexNet, the filter sizes are reduced and the stride of the convolutions are reduced.
Temporal Pyramid Network
Temporal Pyramid Network, or TPN, is a pyramid level module for action recognition at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks in a plug-and-play manner. The source of features and the fusion of features form a feature hierarchy for the backbone so that it can capture action instances at various tempos. In the TPN, a Backbone Network is used to extract multiple level features, a Spatial Semantic Modulation spatially downsamples features to align semantics, a Temporal Rate Modulation temporally downsamples features to adjust relative tempo among levels, Information Flow aggregates features in various directions to enhance and enrich level-wise representations and Final Prediction rescales and concatenates all levels of pyramid along channel dimension.
How do I file a dispute with Expedia? To file a dispute with Expedia, start by contacting their customer service team via phone at +(1)(888)(829)(0881) or +(1)(888)(829)(0881), or by using their online chat or email support. Clearly explain your issue and provide all relevant booking details and supporting documentation. If the initial response is unsatisfactory, request to escalate the matter to a supervisor. You can follow up again by calling +(1)(888)(829)(0881) or +(1)(888)(829)(0881). If the issue still isn’t resolved, consider disputing the charge with your bank or credit card provider, or filing a complaint with the Better Business Bureau (BBB) or the Federal Trade Commission (FTC).
YOLOv1 is a single-stage object detection model. Object detection is framed as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. The network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means the network reasons globally about the full image and all the objects in the image.
SNet is a convolutional neural network architecture and object detection backbone used for the ThunderNet two-stage object detector. SNet uses ShuffleNetV2 basic blocks but replaces all 3×3 depthwise convolutions with 5×5 depthwise convolutions.
ChebNet involves a formulation of CNNs in the context of spectral graph theory, which provides the necessary mathematical background and efficient numerical schemes to design fast localized convolutional filters on graphs. Description from: Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
LOGAN is a generative adversarial network that uses a latent optimization approach using natural gradient descent (NGD). For the Fisher matrix in NGD, the authors use the empirical Fisher with Tikhonov damping: They also use Euclidian Norm regularization for the optimization step. For LOGAN's base architecture, BigGAN-deep is used with a few modifications: increasing the size of the latent source from to , to compensate the randomness of the source lost when optimising . 2, using the uniform distribution instead of the standard normal distribution for to be consistent with the clipping operation, using leaky ReLU (with the slope of 0.2 for the negative part) instead of ReLU as the non-linearity for smoother gradient flow for .
context2vec is an unsupervised model for learning generic context embedding of wide sentential contexts, using a bidirectional LSTM. A large plain text corpora is trained on to learn a neural model that embeds entire sentential contexts and target words in the same low-dimensional space, which is optimized to reflect inter-dependencies between targets and their entire sentential context as a whole. In contrast to word2vec that use context modeling mostly internally and considers the target word embeddings as their main output, the focus of context2vec is the context representation. context2vec achieves its objective by assigning similar embeddings to sentential contexts and their associated target words.
Pattern-Exploiting Training is a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task. These phrases are then used to assign soft labels to a large set of unlabeled examples. Finally, standard supervised training is performed on the resulting training set. In the case of PET for sentiment classification, first a number of patterns encoding some form of task description are created to convert training examples to cloze questions; for each pattern, a pretrained language model is finetuned. Secondly, the ensemble of trained models annotates unlabeled data. Lastly, a classifier is trained on the resulting soft-labeled dataset.
Generalized Mean Pooling (GeM) computes the generalized mean of each channel in a tensor. Formally: where is a parameter. Setting this exponent as increases the contrast of the pooled feature map and focuses on the salient features of the image. GeM is a generalization of the average pooling commonly used in classification networks () and of spatial max-pooling layer (). Source: MultiGrain Image Source: Eva Mohedano
Data augmentation using Polya-Gamma latent variables.
This method applies Polya-Gamma latent variables as a way to obtain closed form expressions for full-conditionals of posterior distributions in sampling algorithms like MCMC.
Part Affinity Fields
AdvProp is an adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting. Key to the method is the usage of a separate auxiliary batch norm for adversarial examples, as they have different underlying distributions to normal examples.
Mirror-BERT converts pretrained language models into effective universal text encoders without any supervision, in 20-30 seconds. It is an extremely simple, fast, and effective contrastive learning technique. It relies on fully identical or slightly modified string pairs as positive (i.e., synonymous) fine-tuning examples, and aims to maximise their similarity during identity fine-tuning.
A Fractal Block is an image model block that utilizes an expansion rule that yields a structural layout of truncated fractals. For the base case where is a convolutional layer, we then have recursive fractals of the form: Where is the number of columns. For the join layer (green in Figure), we use the element-wise mean rather than concatenation or addition.
Directed Acyclic Graph Neural Network
A GNN for dags, which injects their topological order as an inductive bias via asynchronous message passing.
IFBlock is a video model block used in the IFNet architecture for video frame interpolation. IFBlocks do not contain expensive operators like cost volume or forward warping and use 3 × 3 convolution and deconvolution as building blocks. Each IFBlock has a feed-forward structure consisting of several convolutional layers and an upsampling operator. Except for the layer that outputs the optical flow residuals and the fusion map, PReLU activations are used.
Strip Pooling is a pooling strategy for scene parsing which considers a long but narrow kernel, i.e., or . As an alternative to global pooling, strip pooling offers two advantages. First, it deploys a long kernel shape along one spatial dimension and hence enables capturing long-range relations of isolated regions. Second, it keeps a narrow kernel shape along the other spatial dimension, which facilitates capturing local context and prevents irrelevant regions from interfering the label prediction. Integrating such long but narrow pooling kernels enables the scene parsing networks to simultaneously aggregate both global and local context. This is essentially different from the traditional spatial pooling which collects context from a fixed square region.
A Global Convolutional Network, or GCN, is a semantic segmentation building block that utilizes a large kernel to help perform classification and localization tasks simultaneously. It can be used in a FCN-like structure, where the GCN is used to generate semantic score maps. Instead of directly using larger kernels or global convolution, the GCN module employs a combination of and convolutions, which enables dense connections within a large region in the feature map
DeepMask is an object proposal algorithm based on a convolutional neural network. Given an input image patch, DeepMask generates a class-agnostic mask and an associated score which estimates the likelihood of the patch fully containing a centered object (without any notion of an object category). The core of the model is a ConvNet which jointly predicts the mask and the object score. A large part of the network is shared between those two tasks: only the last few network layers are specialized for separately outputting a mask and score prediction.
Criss-Cross Network
Criss-Cross Network (CCNet) aims to obtain full-image contextual information in an effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11× less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. 3) The state-of-the-art performance.
Large convolutional kernels
Usage of larger than typical convolutional kernel sizes, as also seen in 'Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs'
IFNet is an architecture for video frame interpolation that adopts a coarse-to-fine strategy with progressively increased resolutions: it iteratively updates intermediate flows and soft fusion mask via successive IFBlocks. Conceptually, according to the iteratively updated flow fields, we can move corresponding pixels from two input frames to the same location in a latent intermediate frame and use a fusion mask to combine pixels from two input frames. Unlike most previous optical flow models, IFBlocks do not contain expensive operators like cost volume or forward warping and use 3 × 3 convolution and deconvolution as building blocks.
Content-Aware ReAssembly of FEatures (CARAFE) is an operator for feature upsampling in convolutional neural networks. CARAFE has several appealing properties: (1) Large field of view. Unlike previous works (e.g. bilinear interpolation) that only exploit subpixel neighborhood, CARAFE can aggregate contextual information within a large receptive field. (2) Content-aware handling. Instead of using a fixed kernel for all samples (e.g. deconvolution), CARAFE enables instance-specific content-aware handling, which generates adaptive kernels on-the-fly. (3) Lightweight and fast to compute.
Dilated convolution with learnable spacings
Dilated convolution with learnable spacings (DCLS) is a type of convolution that allows the spacings between the non-zero elements of the kernel to be learned during training. This makes it possible to increase the receptive field of the convolution without increasing the number of parameters, which can improve the performance of the network on tasks that require long-range dependencies. A dilated convolution is a type of convolution that allows the kernel to be skipped over some of the input features. This is done by inserting zeros between the non-zero elements of the kernel. The effect of this is to increase the receptive field of the convolution without increasing the number of parameters. DCLS takes this idea one step further by allowing the spacings between the non-zero elements of the kernel to be learned during training. This means that the network can learn to skip over different input features depending on the task at hand. This can be particularly helpful for tasks that require long-range dependencies, such as image segmentation and object detection. DCLS has been shown to be effective for a variety of tasks, including image classification, object detection, and semantic segmentation. It is a promising new technique that has the potential to improve the performance of convolutional neural networks on a variety of tasks.
modReLU is an activation that is a modification of a ReLU. It is a pointwise nonlinearity, , which affects only the absolute value of a complex number, defined as: where is a bias parameter of the nonlinearity. For a dimensional hidden space we learn nonlinearity bias parameters, one per dimension.
A Relativistic GAN is a type of generative adversarial network. It has a relativistic discriminator which estimates the probability that the given real data is more realistic than a randomly sampled fake data. The idea is to endow GANs with the property that the probability of real data being real () should decrease as the probability of fake data being real () increases. With a standard GAN, we can achieve this as follows. The standard GAN discriminator can be defined, in term of the non-transformed layer , as . A simple way to make discriminator relativistic - having the output of depend on both real and fake data - is to sample from real/fake data pairs and define it as . The modification can be interpreted as: the discriminator estimates the probability that the given real data is more realistic than a randomly sampled fake data. More generally a Relativistic GAN can be interpreted as having a discriminator of the form , where is the activation function, to be relativistic.
PolarNet is an improved grid representation for online, single-scan LiDAR point clouds. Instead of using common spherical or bird's-eye-view projection, the polar bird's-eye-view representation balances the points across grid cells in a polar coordinate system, indirectly aligning a segmentation network's attention with the long-tailed distribution of the points along the radial axis.
GShard is a intra-layer parallel distributed method. It consists of set of simple APIs for annotations, and a compiler extension in XLA for automatic parallelization.
Continuous Bag-of-Words Word2Vec
Continuous Bag-of-Words Word2Vec is an architecture for creating word embeddings that uses future words as well as past words to create a word embedding. The objective function for CBOW is: In the CBOW model, the distributed representations of context are used to predict the word in the middle of the window. This contrasts with Skip-gram Word2Vec where the distributed representation of the input word is used to predict the context.
Discriminative Regularization is a regularization technique for variational autoencoders that uses representations from discriminative classifiers to augment the VAE objective function (the lower bound) corresponding to a generative model. Specifically, it encourages the model’s reconstructions to be close to the data example in a representation space defined by the hidden layers of highly-discriminative, neural network based classifiers.
Span-Based Dynamic Convolution is a type of convolution used in the ConvBERT architecture to capture local dependencies between tokens. Kernels are generated by taking in a local span of current token, which better utilizes local dependency and discriminates different meanings of the same token (e.g., if “a” is in front of “can” in the input sentence, “can” is apparently a noun not a verb). Specifically, with classic convolution, we would have fixed parameters shared for all input tokens. Dynamic convolution is therefore preferable because it has higher flexibility in capturing local dependencies of different tokens. Dynamic convolution uses a kernel generator to produce different kernels for different input tokens. However, such dynamic convolution cannot differentiate the same tokens within different context and generate the same kernels (e.g., the three “can” in Figure (b)). Therefore the span-based dynamic convolution is developed to produce more adaptive convolution kernels by receiving an input span instead of only a single token, which enables discrimination of generated kernels for the same tokens within different context. For example, as shown in Figure (c), span-based dynamic convolution produces different kernels for different “can” tokens.
Fast Feedforward Networks
A log-time alternative to feedforward layers outperforming both the vanilla feedforward and mixture-of-experts approaches.
Vision-and-Language Transformer
ViLT is a minimal vision-and-language pre-training transformer model where processing of visual inputs is simplified to just the same convolution-free manner that text inputs are processed. The model-specific components of ViLT require less computation than the transformer component for multimodal interactions. ViLTThe model is pre-trained on the following objectives: image text matching, masked language modeling, and word patch alignment.
Prioritized Sweeping is a reinforcement learning technique for model-based algorithms that prioritizes updates according to a measure of urgency, and performs these updates first. A queue is maintained of every state-action pair whose estimated value would change nontrivially if updated, prioritized by the size of the change. When the top pair in the queue is updated, the effect on each of its predecessor pairs is computed. If the effect is greater than some threshold, then the pair is inserted in the queue with the new priority. Source: Sutton and Barto, Reinforcement Learning, 2nd Edition
Sample Consistency Network (SCNet) is a method for instance segmentation which ensures the IoU distribution of the samples at training time are as close to that at inference time. To this end, only the outputs of the last box stage are used for mask predictions at both training and inference. The Figure shows the IoU distribution of the samples going to the mask branch at training time with/without sample consistency compared to that at inference time.
Visual Parsing is a vision and language pretrained model that adopts self-attention for visual feature learning where each visual token is an approximate weighted mixture of all tokens. Thus, visual parsing provides the dependencies of each visual token pair. It helps better learning of visual relation with the language and promote inter modal alignment. The model is composed of a vision Transformer that takes an image as input and outputs the visual tokens and a multimodal Transformer. It applies a linear layer and a Layer Normalization to embed the vision tokens. It follows BERT to get word embeddings. Vision and language tokens are concatenated to form the input sequences. A multi-modal Transformer is used to fuse the vision and language modality. A metric named Inter-Modality Flow (IMF) is used to quantify the interactions between two modalities. Three pretraining tasks are adopted: Masked Language Modeling (MLM), Image-Text Matching (ITM), and Masked Feature Regression (MFR). MFR is a novel task that is included to mask visual tokens with similar or correlated semantics in this framework.
SimpleNet is a convolutional neural network with 13 layers. The network employs a homogeneous design utilizing 3 × 3 kernels for convolutional layer and 2 × 2 kernels for pooling operations. The only layers which do not use 3 × 3 kernels are 11th and 12th layers, these layers, utilize 1 × 1 convolutional kernels. Feature-map down-sampling is carried out using nonoverlaping 2 × 2 max-pooling. In order to cope with the problem of vanishing gradient and also over-fitting, SimpleNet also uses batch-normalization with moving average fraction of 0.95 before any ReLU non-linearity.
Detrended Partial-Cross-Correlation Analysis
Based on detrended cross-correlation analysis (DCCA), this method is improved by including partial-correlation technique, which can be applied to quantify the relations of two non-stationary signals (with influences of other signals removed) on different time scales.
Soft Actor Critic (Autotuned Temperature is a modification of the SAC reinforcement learning algorithm. SAC can suffer from brittleness to the temperature hyperparameter. Unlike in conventional reinforcement learning, where the optimal policy is independent of scaling of the reward function, in maximum entropy reinforcement learning the scaling factor has to be compensated by the choice a of suitable temperature, and a sub-optimal temperature can drastically degrade performance. To resolve this issue, SAC with Autotuned Temperature has an automatic gradient-based temperature tuning method that adjusts the expected entropy over the visited states to match a target value.
Rigging the Lottery
Dynamic Sparse Training method based on using the gradient norm to select which weights to remove/add in each mask update.
Perturbed-Attention Guidance
Temporal Dropout or TempD
A method that randomly mask out all features coming from a specific time-step in time-series data. If the model used is independent of uneven sequences or missing data in time-series, like attention-based transformers, the masked time-steps can be just ignored in the forward prediction of the model. Otherwise, they have to be masked out with some numerical value.
Principal Neighbourhood Aggregation
Principal Neighbourhood Aggregation (PNA) is a general and flexible architecture for graphs combining multiple aggregators with degree-scalers (which generalize the sum aggregator).
Graph Representation with Global structure