8,725 machine learning methods and techniques
One-Shot Aggregation is an image model block that is an alternative to Dense Blocks, by aggregating intermediate features. It is proposed as part of the VoVNet architecture. Each convolution layer is connected by two-way connection. One way is connected to the subsequent layer to produce the feature with a larger receptive field while the other way is aggregated only once into the final output feature map. The difference with DenseNet is that the output of each layer is not routed to all subsequent intermediate layers which makes the input size of intermediate layers constant.
Simple Neural Attention Meta-Learner
The Simple Neural Attention Meta-Learner, or SNAIL, combines the benefits of temporal convolutions and attention to solve meta-learning tasks. They introduce positional dependence through temporal convolutions to make the model applicable to reinforcement tasks - where the observations, actions, and rewards are intrinsically sequential. They also introduce attention in order to provide pinpoint access over an infinitely large context. SNAIL is constructing by combining the two: we use temporal convolutions to produce the context over which we use a causal attention operation.
An Eligibility Trace is a memory vector that parallels the long-term weight vector . The idea is that when a component of participates in producing an estimated value, the corresponding component of is bumped up and then begins to fade away. Learning will then occur in that component of if a nonzero TD error occurs before the trade falls back to zero. The trace-decay parameter determines the rate at which the trace falls. Intuitively, they tackle the credit assignment problem by capturing both a frequency heuristic - states that are visited more often deserve more credit - and a recency heuristic - states that are visited more recently deserve more credit. Source: Sutton and Barto, Reinforcement Learning, 2nd Edition
Track objects as points
Our tracker, CenterTrack, applies a detection model to a pair of images and detections from the prior frame. Given this minimal input, CenterTrack localizes objects and predicts their associations with the previous frame. That's it. CenterTrack is simple, online (no peeking into the future), and real-time.
TrOCR is an end-to-end Transformer-based OCR model for text recognition with pre-trained CV and NLP models. It leverages the Transformer architecture for both image understanding and wordpiece-level text generation. It first resizes the input text image into and then the image is split into a sequence of 16 patches which are used as the input to image Transformers. Standard Transformer architecture with the self-attention mechanism is leveraged on both encoder and decoder parts, where wordpiece units are generated as the recognized text from the input image.
Dialogue-Adaptive Pre-training Objective
Dialogue-Adaptive Pre-training Objective (DAPO) is a pre-training objective for dialogue adaptation, which is designed to measure qualities of dialogues from multiple important aspects, like Readability, Consistency and Fluency which have already been focused on by general LM pre-training objectives, and those also significant for assessing dialogues but ignored by general LM pre-training objectives, like Diversity and Specificity.
Bilateral grid is a new data structure that enables fast edge-aware image processing. It enables edge-aware image manipulations such as local tone mapping on high resolution images in real time. Source: Chen et al. Image source: Chen et al.
online deep learning
Deep Neural Networks (DNNs) are typically trained by backpropagation in a batch learning setting, which requires the entire training data to be made available prior to the learning task. This is not scalable for many real-world scenarios where new data arrives sequentially in a stream form. We aim to address an open challenge of "Online Deep Learning" (ODL) for learning DNNs on the fly in an online setting. Unlike traditional online learning that often optimizes some convex objective function with respect to a shallow model (e.g., a linear/kernel-based hypothesis), ODL is significantly more challenging since the optimization of the DNN objective function is non-convex, and regular backpropagation does not work well in practice, especially for online learning settings.
Wasserstein GAN (Gradient Penalty)
Wasserstein GAN + Gradient Penalty, or WGAN-GP, is a generative adversarial network that uses the Wasserstein loss formulation plus a gradient norm penalty to achieve Lipschitz continuity. The original WGAN uses weight clipping to achieve 1-Lipschitz functions, but this can lead to undesirable behaviour by creating pathological value surfaces and capacity underuse, as well as gradient explosion/vanishing without careful tuning of the weight clipping parameter . A Gradient Penalty is a soft version of the Lipschitz constraint, which follows from the fact that functions are 1-Lipschitz iff the gradients are of norm at most 1 everywhere. The squared difference from norm 1 is used as the gradient penalty.
NAS-FPN is a Feature Pyramid Network that is discovered via Neural Architecture Search in a novel scalable search space covering all cross-scale connections. The discovered architecture consists of a combination of top-down and bottom-up connections to fuse features across scales
Self-Attention Network
Self-Attention Network (SANet) proposes two variations of self-attention used for image recognition: 1) pairwise self-attention which generalizes standard dot-product attention and is fundamentally a set operator, and 2) patchwise self-attention which is strictly more powerful than convolution.
ProphetNet is a sequence-to-sequence pre-training model that introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of optimizing one-step-ahead prediction in the traditional sequence-to-sequence model, the ProphetNet is optimized by -step ahead prediction that predicts the next tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and further help predict multiple future tokens.
Relation-aware Global Attention
In relation-aware global attention (RGA) stresses the importance of global structural information provided by pairwise relations, and uses it to produce attention maps. RGA comes in two forms, spatial RGA (RGA-S) and channel RGA (RGA-C). RGA-S first reshapes the input feature map to and the pairwise relation matrix is computed using \begin{align} Q &= \delta(W^QX) \end{align} \begin{align} K &= \delta(W^KX) \end{align} \begin{align} R &= Q^TK \end{align} The relation vector at position is defined by stacking pairwise relations at all positions: \begin{align} ri = [R(i, :); R(:,i)] \end{align} and the spatial relation-aware feature can be written as \begin{align} Yi = [g^c\text{avg}(\delta(W^\varphi xi)); \delta(W^\phi ri)] \end{align} where denotes global average pooling in the channel domain. Finally, the spatial attention score at position is given by \begin{align} ai = \sigma(W2\delta(W1yi)) \end{align} RGA-C has the same form as RGA-S, except for taking the input feature map as a set of -dimensional features. RGA uses global relations to generate the attention score for each feature node, so provides valuable structural information and significantly enhances the representational power. RGA-S and RGA-C are flexible enough to be used in any CNN network; Zhang et al. propose using them jointly in sequence to better capture both spatial and cross-channel relationships.
A Noisy Linear Layer is a linear layer with parametric noise added to the weights. This induced stochasticity can be used in reinforcement learning networks for the agent's policy to aid efficient exploration. The parameters of the noise are learned with gradient descent along with any other remaining network weights. Factorized Gaussian noise is the type of noise usually employed. The noisy linear layer takes the form: where and are random variables.
RepPoints is a representation for object detection that consists of a set of points which indicate the spatial extent of an object and semantically significant local areas. This representation is learned via weak localization supervision from rectangular ground-truth boxes and implicit recognition feedback. Based on the richer RepPoints representation, the authors develop an anchor-free object detector that yields improved performance compared to using bounding boxes.
Triplet attention comprises of three branches each responsible for capturing crossdimension between the spatial dimensions and channel dimension of the input. Given an input tensor with shape (C × H × W), each branch is responsible for aggregating cross-dimensional interactive features between either the spatial dimension H or W and the channel dimension C.
BigGAN-deep is a deeper version (4x) of BigGAN. The main difference is a slightly differently designed residual block. Here the vector is concatenated with the conditional vector without splitting it into chunks. It is also based on residual blocks with bottlenecks. BigGAN-deep uses a different strategy than BigGAN aimed at preserving identity throughout the skip connections. In G, where the number of channels needs to be reduced, BigGAN-deep simply retains the first group of channels and drop the rest to produce the required number of channels. In D, where the number of channels should be increased, BigGAN-deep passes the input channels unperturbed, and concatenates them with the remaining channels produced by a 1 × 1 convolution. As far as the network configuration is concerned, the discriminator is an exact reflection of the generator. There are two blocks at each resolution (BigGAN uses one), and as a result BigGAN-deep is four times deeper than BigGAN. Despite their increased depth, the BigGAN-deep models have significantly fewer parameters mainly due to the bottleneck structure of their residual blocks.
MODEL EDITOR NETWORKS WITH GRADIENT DECOMPOSITION
GPT-NeoX is an autoregressive transformer decoder model whose architecture largely follows that of GPT-3, with a few notable deviations. The model has 20 billion parameters with 44 layers, a hidden dimension size of 6144, and 64 heads. The main difference with GPT-3 is the change in tokenizer, the addition of Rotary Positional Embeddings, the parallel computation of attention and feed-forward layers, and a different initialization scheme and hyperparameters.
Dynamic Convolution
DynamicConv is a type of convolution for sequential modelling where it has kernels that vary over time as a learned function of the individual time steps. It builds upon LightConv and takes the same form but uses a time-step dependent kernel:
Distributed Distributional DDPG
D4PG, or Distributed Distributional DDPG, is a policy gradient algorithm that extends upon the DDPG. The improvements include a distributional updates to the DDPG algorithm, combined with the use of multiple distributed workers all writing into the same replay table. The biggest performance gain of other simpler changes was the use of -step returns. The authors found that the use of prioritized experience replay was less crucial to the overall D4PG algorithm especially on harder problems.
A Global Context Network, or GCNet, utilises global context blocks to model long-range dependencies in images. It is based on the Non-Local Network, but it modifies the architecture so less computation is required. Global context blocks are applied to multiple layers in a backbone network to construct the GCNet.
Model-based Subsampling
To avoid the problem caused by low-frequent entity-relation pairs, our MBS uses the estimated probabilities from a trained model to calculate frequencies for each triplet and query. By using , the NS loss in KGE with MBS is represented as follows: \begin{align} &\ell{mbs}(\mathbf{\theta};\mathbf{\theta}') \nonumber \\ =&-\frac{1}{|D|}\sum{(x,y) \in D} \Bigl[A{mbs}(\mathbf{\theta}')\log(\sigma(s{\mathbf{\theta}}(x,y)+\gamma))\nonumber\\ &+\frac{1}{\nu}sum{y{i}\sim pn(y{i}|x)}^{\nu}B{mbs}(\mathbf{\theta}')\log(\sigma(-s{\mathbf{\theta}}(x,yi)-\gamma))\Bigr], \end{align}
DeepCluster is a self-supervision approach for learning image representations. DeepCluster iteratively groups the features with a standard clustering algorithm, k-means, and uses the subsequent assignments as supervision to update the weights of the network
Galactica is a language model which uses a Transformer architecture in a decoder-only setup with the following modifications: - It uses GeLU activations on all model sizes - It uses a 2048 length context window for all model sizes - It does not use biases in any of the dense kernels or layer norms - It uses learned positional embeddings for the model - A vocabulary of 50k tokens was constructed using BPE. The vocabulary was generated from a randomly selected 2% subset of the training data
Domain Adaptative Neighborhood Clustering via Entropy Optimization
Domain Adaptive Neighborhood Clustering via Entropy Optimization (DANCE) is a self-supervised clustering method that harnesses the cluster structure of the target domain using self-supervision. This is done with a neighborhood clustering technique that self-supervises feature learning in the target. At the same time, useful source features and class boundaries are preserved and adapted with a partial domain alignment loss that the authors refer to as entropy separation loss. This loss allows the model to either match each target example with the source, or reject it as unknown.
Polynomial Rate Decay is a learning rate schedule where we polynomially decay the learning rate.
FixRes is an image scaling strategy that seeks to optimize classifier performance. It is motivated by the observation that data augmentations induce a significant discrepancy between the size of the objects seen by the classifier at train and test time: in fact, a lower train resolution improves the classification at test time! FixRes is a simple strategy to optimize the classifier performance, that employs different train and test resolutions. The calibrations are: (a) calibrating the object sizes by adjusting the crop size and (b) adjusting statistics before spatial pooling.
A Bottleneck Transformer Block is a block used in Bottleneck Transformers that replaces the spatial 3 × 3 convolution layer in a Residual Block with Multi-Head Self-Attention (MHSA).
Edge-augmented Graph Transformer
Transformer neural networks have achieved state-of-the-art results for unstructured data such as text and images but their adoption for graph-structured data has been limited. This is partly due to the difficulty of incorporating complex structural information in the basic transformer framework. We propose a simple yet powerful extension to the transformer - residual edge channels. The resultant framework, which we call Edge-augmented Graph Transformer (EGT), can directly accept, process and output structural information as well as node information. It allows us to use global self-attention, the key element of transformers, directly for graphs and comes with the benefit of long-range interaction among nodes. Moreover, the edge channels allow the structural information to evolve from layer to layer, and prediction tasks on edges/links can be performed directly from the output embeddings of these channels. In addition, we introduce a generalized positional encoding scheme for graphs based on Singular Value Decomposition which can improve the performance of EGT. Our framework, which relies on global node feature aggregation, achieves better performance compared to Convolutional/Message-Passing Graph Neural Networks, which rely on local feature aggregation within a neighborhood. We verify the performance of EGT in a supervised learning setting on a wide range of experiments on benchmark datasets. Our findings indicate that convolutional aggregation is not an essential inductive bias for graphs and global self-attention can serve as a flexible and adaptive alternative.
A Scatter Connection is a type of connection that allows a vector to be "scattered" onto a layer representing a map, so that a vector at a specific location corresponds to objects of interest at that location (e.g. units in Starcraft II). This allows for the integration of spatial and non-spatial features.
Patch AutoAugment
Patch AutoAugment is a patch-level automatic data augmentation algorithm that automatically searches for the optimal augmentation policies for the patches of an image. Specifically, PAA allows each patch DA operation to be controlled by an agent and models it as a Multi-Agent Reinforcement Learning (MARL) problem. At each step, PAA samples the most effective operation for each patch based on its content and the semantics of the whole image. The agents cooperate as a team and share a unified team reward for achieving the joint optimal DA policy of the whole image. PAA is co-trained with a target network through adversarial training. At each step, the policy network samples the most effective operation for each patch based on its content and the semantics of the image.
MoCo v3 aims to stabilize training of self-supervised ViTs. MoCo v3 is an incremental improvement of MoCo v1/2. Two crops are used for each image under random data augmentation. They are encoded by two encoders and with output vectors and . behaves like a "query", where the goal of learning is to retrieve the corresponding "key". The objective is to minimize a contrastive loss function of the following form: This approach aims to train the Transformer in the contrastive/Siamese paradigm. The encoder consists of a backbone (e.g., ResNet and ViT), a projection head, and an extra prediction head. The encoder has the back the backbone and projection head but not the prediction head. is updated by the moving average of , excluding the prediction head.
Test-time Local Converter
TLC convert the global operation to a local one so that it extract representations based on local spatial region of features as in training phase.
Ape-X is a distributed architecture for deep reinforcement learning. The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shared experience replay memory; the learner replays samples of experience and updates the neural network. The architecture relies on prioritized experience replay to focus only on the most significant data generated by the actors. In contrast to Gorila, Ape-X uses a shared, centralized replay memory, and instead of sampling uniformly, it prioritizes, to sample the most useful data more often. All communications are batched with the centralized replay, increasing the efficiency and throughput at the cost of some latency. And by learning off-policy, Ape-X has the ability to combine data from many distributed actors, by giving the different actors different exploration policies, broadening the diversity of the experience they jointly encounter.
ResNet-D is a modification on the ResNet architecture that utilises an average pooling tweak for downsampling. The motivation is that in the unmodified ResNet, the 1 × 1 convolution for the downsampling block ignores 3/4 of input feature maps, so this is modified so no information will be ignored
DeepMind AlphaStar
AlphaStar is a reinforcement learning agent for tackling the game of Starcraft II. It learns a policy using a neural network for parameters that receives observations as inputs and chooses actions as outputs. Additionally, the policy conditions on a statistic that summarizes a strategy sampled from human data such as a build order [1]. AlphaStar uses numerous types of architecture to incorporate different types of features. Observations of player and enemy units are processed with a Transformer. Scatter connections are used to integrate spatial and non-spatial information. The temporal sequence of observations is processed by a core LSTM. Minimap features are extracted with a Residual Network. To manage the combinatorial action space, the agent uses an autoregressive policy and a recurrent pointer network. The agent is trained first with supervised learning from human replays. Parameters are subsequently trained using reinforcement learning that maximizes the win rate against opponents. The RL algorithm is based on a policy-gradient algorithm similar to actor-critic. Updates are performed asynchronously and off-policy. To deal with this, a combination of and V-trace are used, as well as a new self-imitation algorithm (UPGO). Lastly, to address game-theoretic challenges, AlphaStar is trained with league training to try to approximate a fictitious self-play (FSP) setting which avoids cycles by computing a best response against a uniform mixture of all previous policies. The league of potential opponents includes a diverse range of agents, including policies from current and previous agents. Image Credit: Yekun Chai References 1. Chai, Yekun. "Deciphering AlphaStar on StarCraft II." (2019). https://cyk1337.github.io/notes/2019/07/21/RL/DRL/Decipher-AlphaStar-on-StarCraft-II/ Code Implementation 1. https://github.com/opendilab/DI-star
Metropolis-Hastings is a Markov Chain Monte Carlo (MCMC) algorithm for approximate inference. It allows for sampling from a probability distribution where direct sampling is difficult - usually owing to the presence of an intractable integral. M-H consists of a proposal distribution to draw a parameter value. To decide whether is accepted or rejected, we then calculate a ratio: We then draw a random number and accept if it is under the ratio, reject otherwise. If we accept, we set and repeat. By the end we have a sample of values that we can use to form quantities over an approximate posterior, such as the expectation and uncertainty bounds. In practice, we typically have a period of tuning to achieve an acceptable acceptance ratio for the algorithm, as well as a warmup period to reduce bias towards initialization values. Image: Samuel Hudec
Deep Graph Infomax
Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs—both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs centered around nodes of interest, and can thus be reused for downstream node-wise learning tasks. In contrast to most prior approaches to unsupervised learning with GCNs, DGI does not rely on random walk objectives, and is readily applicable to both transductive and inductive learning setups. Description and image from: DEEP GRAPH INFOMAX
ZCA Whitening is an image preprocessing method that leads to a transformation of data such that the covariance matrix is the identity matrix, leading to decorrelated features. Image Source: Alex Krizhevsky
Residual Multi-Layer Perceptrons
Residual Multi-Layer Perceptrons, or ResMLP, is an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. At the end of the network, the patch representations are average pooled, and fed to a linear classifier. Layer normalization is replaced with a simpler affine transformation, thanks to the absence of self-attention layers which makes training more stable. The affine operator is applied at the beginning ("pre-normalization") and end ("post-normalization") of each residual block. As a pre-normalization, Aff replaces LayerNorm without using channel-wise statistics. Initialization is achieved as , and . As a post-normalization, Aff is similar to LayerScale and is initialized with the same small value.
First Integer Neighbor Clustering Hierarchy
Please enter a description about the method here
Negative Face Recognition
Negative Face Recognition, or NFR, is a face recognition approach that enhances the soft-biometric privacy on the template-level by representing face templates in a complementary (negative) domain. While ordinary templates characterize facial properties of an individual, negative templates describe facial properties that does not exist for this individual. This suppresses privacy-sensitive information from stored templates. Experiments are conducted on two publicly available datasets captured under controlled and uncontrolled scenarios on three privacy-sensitive attributes.
Self-Supervised Deep Supervision
The method exploits the finding that high correlation of segmentation performance among each U-Net's decoder layer -- with discriminative layer attached -- tends to have higher segmentation performance in the final segmentation map. By introducing an "Inter-layer Divergence Loss", based on Kulback-Liebler Divergence, to promotes the consistency between each discriminative output from decoder layers by minimizing the divergence. If we assume that each decoder layer is equivalent to PDE function parameterized by weight parameter : Then our objective is trying to make each discriminative output similar to each other: Hence the objective is to .
Dual Attention Network
In the field of scene segmentation, encoder-decoder structures cannot make use of the global relationships between objects, whereas RNN-based structures heavily rely on the output of the long-term memorization. To address the above problems, Fu et al. proposed a novel framework, the dual attention network (DANet), for natural scene image segmentation. Unlike CBAM and BAM, it adopts a self-attention mechanism instead of simply stacking convolutions to compute the spatial attention map, which enables the network to capture global information directly. DANet uses in parallel a position attention module and a channel attention module to capture feature dependencies in spatial and channel domains. Given the input feature map , convolution layers are applied first in the position attention module to obtain new feature maps. Then the position attention module selectively aggregates the features at each position using a weighted sum of features at all positions, where the weights are determined by feature similarity between corresponding pairs of positions. The channel attention module has a similar form except for dimensional reduction to model cross-channel relations. Finally the outputs from the two branches are fused to obtain final feature representations. For simplicity, we reshape the feature map to whereupon the overall process can be written as \begin{align} Q,\quad K,\quad V &= WqX,\quad WkX,\quad WvX \end{align} \begin{align} Y^\text{pos} &= X+ V\text{Softmax}(Q^TK) \end{align} \begin{align} Y^\text{chn} &= X+ \text{Softmax}(XX^T)X \end{align} \begin{align} Y &= Y^\text{pos} + Y^\text{chn} \end{align} where , , are used to generate new feature maps. The position attention module enables DANet to capture long-range contextual information and adaptively integrate similar features at any scale from a global viewpoint, while the channel attention module is responsible for enhancing useful channels as well as suppressing noise. Taking spatial and channel relationships into consideration explicitly improves the feature representation for scene segmentation. However, it is computationally costly, especially for large input feature maps.
Factor Graph Attention
A general multimodal attention unit for any number of modalities. Graphical models inspire it, i.e., it infers several attention beliefs via aggregated interaction messages.
SAINT is a hybrid deep learning approach to solving tabular data problems. SAINT performs attention over both rows and columns, and it includes an enhanced embedding method. The architecture, pre-training and training pipeline are as follows: - layers with 2 attention blocks each, one self-attention block, and a novel intersample attention blocks that computes attention across samples are used. - For pre-training, this involves minimizing the contrastive and denoising losses between a given data point and its views generated by CutMix and mixup. During finetuning/regular training, data passes through an embedding layer and then the SAINT model. Lastly, the contextual embeddings from SAINT are used to pass only the embedding corresponding to the CLS token through an MLP to obtain the final prediction.
Kernel Activation Function
A Kernel Activation Function is a non-parametric activation function defined as a one-dimensional kernel approximator: where: 1. The dictionary of the kernel elements is fixed by sampling the -axis with a uniform step around 0. 2. The user selects the kernel function (e.g., Gaussian, ReLU, Softplus) and the number of kernel elements as a hyper-parameter. A larger dictionary leads to more expressive activation functions and a larger number of trainable parameters. 3. The linear coefficients are adapted independently at every neuron via standard back-propagation. In addition, the linear coefficients can be initialized using kernel ridge regression to behave similarly to a known function in the beginning of the optimization process.