8,725 machine learning methods and techniques
Aging Evolution, or Regularized Evolution, is an evolutionary algorithm for neural architecture search. Whereas in tournament selection, the best architectures are kept, in aging evolution we associate each genotype with an age, and bias the tournament selection to choose the younger genotypes. In the context of architecture search, aging evolution allows us to explore the search space more, instead of zooming in on good models too early, as non-aging evolution would.
Accuracy-Robustness Area
In the space of adversarial perturbation against classifier accuracy, the ARA is the area between a classifier's curve and the straight line defined by a naive classifier's maximum accuracy. Intuitively, the ARA measures a combination of the classifier’s predictive power and its ability to overcome an adversary. Importantly, when contrasted against existing robustness metrics, the ARA takes into account the classifier’s performance against all adversarial examples, without bounding them by some arbitrary .
OPT-IML is a version of OPT fine-tuned on a large collection of 1500+ NLP tasks divided into various task categories.
A Unitary RNN is a recurrent neural network architecture that uses a unitary hidden to hidden matrix. Specifically they concern dynamics of the form: where is a unitary matrix . The product of unitary matrices is a unitary matrix, so can be parameterised as a product of simpler unitary matrices: where , , are learned diagonal complex matrices, and , are learned reflection matrices. Matrices and are the discrete Fourier transformation and its inverse. P is any constant random permutation. The activation function applies a rectified linear unit with a learned bias to the modulus of each complex number. Only the diagonal and reflection matrices, and , are learned, so Unitary RNNs have fewer parameters than LSTMs with comparable numbers of hidden units. Source: Associative LSTMs
This method proposes first discretizing observations and calculating the action distribution distance under comparable cases (intersection states).
The Enhanced Fusion Framework proposes three different ideas to improve the existing MI-based BCI frameworks. Image source: Fumanal-Idocin et al.
ConViT is a type of vision transformer that uses a gated positional self-attention module (GPSA), a form of positional self-attention which can be equipped with a “soft” convolutional inductive bias. The GPSA layers are initialized to mimic the locality of convolutional layers, then each attention head is given the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information.
Adaptive Non-Maximum Suppression is a non-maximum suppression algorithm that applies a dynamic suppression threshold to an instance according to the target density. The motivation is to find an NMS algorithm that works well for pedestrian detection in a crowd. Intuitively, a high NMS threshold keeps more crowded instances while a low NMS threshold wipes out more false positives. The adaptive-NMS thus applies a dynamic suppression strategy, where the threshold rises as instances gather and occlude each other and decays when instances appear separately. To this end, an auxiliary and learnable sub-network is designed to predict the adaptive NMS threshold for each instance.
Network Embedding as Matrix Factorization:
Differential Diffusion is an enhancement of image-to-image diffusion models that adds the ability to control the amount of change applied to each image fragment via a change map.
SqueezeNeXt is a type of convolutional neural network that uses the SqueezeNet architecture as a baseline, but makes a number of changes. First, a more aggressive channel reduction is used by incorporating a two-stage squeeze module. This significantly reduces the total number of parameters used with the 3×3 convolutions. Secondly, it uses separable 3 × 3 convolutions to further reduce the model size, and removes the additional 1×1 branch after the squeeze module. Thirdly, the network use an element-wise addition skip connection similar to that of ResNet architecture.
Canonical Tensor Decomposition with N3 Regularizer
Canonical Tensor Decomposition, trained with N3 regularizer
Gated Positional Self-Attention
Gated Positional Self-Attention (GPSA) is a self-attention module for vision transformers, used in the ConViT architecture, that can be initialized as a convolutional layer -- helping a ViT learn inductive biases about locality.
Seesaw Loss is a loss function for long-tailed instance segmentation. It dynamically re-balances the gradients of positive and negative samples on a tail class with two complementary factors: mitigation factor and compensation factor. The mitigation factor reduces punishments to tail categories w.r.t the ratio of cumulative training instances between different categories. Meanwhile, the compensation factor increases the penalty of misclassified instances to avoid false positives of tail categories. The synergy of the two factors enables Seesaw Loss to mitigate the overwhelming punishments on tail classes as well as compensate for the risk of misclassification caused by diminished penalties. Here works as a tunable balancing factor between different classes. By a careful design of , Seesaw loss adjusts the punishments on class j from positive samples of class . Seesaw loss determines by a mitigation factor and a compensation factor, as: The mitigation factor decreases the penalty on tail class according to a ratio of instance numbers between tail class and head class . The compensation factor increases the penalty on class whenever an instance of class is misclassified to class .
Gated Transformer-XL
Gated Transformer-XL, or GTrXL, is a Transformer-based architecture for reinforcement learning. It introduces architectural modifications that improve the stability and learning speed of the original Transformer and XL variant. Changes include: - Placing the layer normalization on only the input stream of the submodules. A key benefit to this reordering is that it now enables an identity map from the input of the transformer at the first layer to the output of the transformer after the last layer. This is in contrast to the canonical transformer, where there are a series of layer normalization operations that non-linearly transform the state encoding. - Replacing residual connections with gating layers. The authors' experiments found that GRUs were the most effective form of gating.
Parallel GAN
PipeDream-2BW is an asynchronous pipeline parallel method that supports memory-efficient pipeline parallelism, a hybrid form of parallelism that combines data and model parallelism with input pipelining. PipeDream-2BW uses a novel pipelining and weight gradient coalescing strategy, combined with the double buffering of weights, to ensure high throughput, low memory footprint, and weight update semantics similar to data parallelism. In addition, PipeDream2BW automatically partitions the model over the available hardware resources, while respecting hardware constraints such as memory capacities of accelerators, and topologies and bandwidths of interconnects. PipeDream-2BW also determines when to employ existing memory-savings techniques, such as activation recomputation, that trade off extra computation for lower memory footprint. The two main features are a double-buffered weight update (2BW) and flush mechanisms ensure high throughput. PipeDream-2BW splits models into stages over multiple workers, and each stage is replicated an equal number of times (with data-parallel updates across replicas of the same stage). Such parallel pipelines work well for models where each layer is repeated a fixed number of times (e.g., transformer models).
Branch attention can be seen as a dynamic branch selection mechanism: which to pay attention to, used with a multi-branch structure.
DeeBERT is a method for accelerating BERT inference. It inserts extra classification layers (which are referred to as off-ramps) between each transformer layer of BERT. All transformer layers and off-ramps are jointly fine-tuned on a given downstream dataset. At inference time, after a sample goes through a transformer layer, it is passed to the following off-ramp. If the off-ramp is confident of the prediction, the result is returned; otherwise, the sample is sent to the next transformer layer.
Double Deep Q-Learning
time-causal limit kernel
The time-causal limit kernel is a temporal smoothing kernel that is (i) time-causal, (ii) time-recursive and (iii) obeys temporal scale covariance. This kernel constitutes the limit case of coupling an infinite number of truncated exponential kernels in cascade, with specifically chosen time constants to obtain temporal scale covariance. For practical purposes, the infinite convolution operation can often be well approximated by a moderate number (4-8) truncated exponential kernels coupled in cascade. The discrete implementation can, in turn, be performed by a set of first-order recursive filters coupled in cascade.
Neural adjoint method
The NA method can be divided into two steps: (i) Training a neural network approximation of f , and (ii) inference of xˆ. Step (i) is conventional and involves training a generic neural network on a dataset ˆ of input/output pairs from the simulator, denoted D, resulting in f, an approximation of the forward ˆ model. This is illustrated in the left inset of Fig 1. In step (ii), our goal is to use ∂f/∂x to help us gradually adjust x so that we achieve a desired output of the forward model, y. This is similar to many classical inverse modeling approaches, such as the popular Adjoint method [8, 9]. For many practical ˆ expression for the simulator, from which it is trivial to compute ∂f/∂x, and furthermore, we can use modern deep learning software packages to efficiently estimate gradients, given a loss function L. More formally, let y be our target output, and let xˆi be our current estimate of the solution, where i indexes each solution we obtain in an iterative gradient-based estimation procedure. Then we compute xˆi+1 with inverse problems, however, obtaining ∂f/∂x requires significant expertise and/or effort, making these approaches challenging. Crucially, fˆ from step (i) provides us with a closed-form differentiable
Composite Backbone Network
CBNet is a backbone architecture that consists of multiple identical backbones (specially called Assistant Backbones and Lead Backbone) and composite connections between neighbor backbones. From left to right, the output of each stage in an Assistant Backbone, namely higher-level features, flows to the parallel stage of the succeeding backbone as part of inputs through composite connections. Finally, the feature maps of the last backbone named Lead Backbone are used for object detection. The features extracted by CBNet for object detection fuse the high-level and low-level features of multiple backbones, hence improve the detection performance.
The Routing Transformer is a Transformer that endows self-attention with a sparse routing module based on online k-means. Each attention module considers a clustering of the space: the current timestep only attends to context belonging to the same cluster. In other word, the current time-step query is routed to a limited number of context through its cluster assignment.
Population Based Augmentation, or PBA, is a data augmentation strategy (PBA), which generates nonstationary augmentation policy schedules instead of a fixed augmentation policy. In PBA we consider the augmentation policy search problem as a special case of hyperparameter schedule learning. It leverages Population Based Training (PBT), a hyperparameter search algorithm which optimizes the parameters of a network jointly with their hyperparameters to maximize performance. The output of PBT is not an optimal hyperparameter configuration but rather a trained model and schedule of hyperparameters. In PBA, we are only interested in the learned schedule and discard the child model result (similar to AutoAugment). This learned augmentation schedule can then be used to improve the training of different (i.e., larger and costlier to train) models on the same dataset. PBT executes as follows. To start, a fixed population of models are randomly initialized and trained in parallel. At certain intervals, an “exploit-and-explore” procedure is applied to the worse performing population members, where the model clones the weights of a better performing model (i.e., exploitation) and then perturbs the hyperparameters of the cloned model to search in the hyperparameter space (i.e., exploration). Because the weights of the models are cloned and never reinitialized, the total computation required is the computation to train a single model times the population size.
PyTorch DDP (Distributed Data Parallel) is a distributed data parallel implementation for PyTorch. To guarantee mathematical equivalence, all replicas start from the same initial values for model parameters and synchronize gradients to keep parameters consistent across training iterations. To minimize the intrusiveness, the implementation exposes the same forward API as the user model, allowing applications to seamlessly replace subsequent occurrences of a user model with the distributed data parallel model object with no additional code changes. Several techniques are integrated into the design to deliver high-performance training, including bucketing gradients, overlapping communication with computation, and skipping synchronization.
CuBERT, or Code Understanding BERT, is a BERT based model for code understanding. In order to achieve this, the authors curate a massive corpus of Python programs collected from GitHub. GitHub projects are known to contain a large amount of duplicate code. To avoid biasing the model to such duplicated code, authors perform deduplication using the method of Allamanis (2018). The resulting corpus has 7.4 million files with a total of 9.3 billion tokens (16 million unique).
Adversarial Color Enhancement is an approach to generating unrestricted adversarial images by optimizing a color filter via gradient descent.
CurricularFace, or Adaptive Curriculum Learning, is a method for face recognition that embeds the idea of curriculum learning into the loss function to achieve a new training scheme. This training scheme mainly addresses easy samples in the early training stage and hard ones in the later stage. Specifically, CurricularFace adaptively adjusts the relative importance of easy and hard samples during different training stages.
Replica exchange stochastic gradient Langevin Dynamics
reSGLD proposes to simulate a high-temperature particle for exploration and a low-temperature particle for exploitation and allows them to swap simultaneously. Moreover, a correction term is included to avoid biases.
Class-MLP is an alternative to average pooling, which is an adaptation of the class-attention token introduced in CaiT. In CaiT, this consists of two layers that have the same structure as the transformer, but in which only the class token is updated based on the frozen patch embeddings. In Class-MLP, the same approach is used, but after aggregating the patches with a linear layer, we replace the attention-based interaction between the class and patch embeddings by simple linear layers, still keeping the patch embeddings frozen. This increases the performance, at the expense of adding some parameters and computational cost. This pooling variant is referred to as “class-MLP”, since the purpose of these few layers is to replace average pooling.
Adaptive Bezier-Curve Network
Adaptive Bezier-Curve Network, or ABCNet, is an end-to-end framework for arbitrarily-shaped scene text spotting. It adaptively fits arbitrary-shaped text by a parameterized bezier curve. It also utilizes a feature alignment layer, BezierAlign, to calculate convolutional features of text instances in curved shapes. These features are then passed to a light-weight recognition head.
Primal Wasserstein Imitation Learning
Primal Wasserstein Imitation Learning, or PWIL, is a method for imitation learning which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions. The reward function is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and requires little fine-tuning.
3 Dimensional Soft Attention
IoU-Net is an object detection architecture that introduces localization confidence. IoU-Net learns to predict the IoU between each detected bounding box and the matched ground-truth. The network acquires this confidence of localization, which improves the NMS procedure by preserving accurately localized bounding boxes. Furthermore, an optimization-based bounding box refinement method is proposed, where the predicted IoU is formulated as the objective.
Recurrent Back Projection Network
Learning to Match
L2M is a learning algorithm that can work for most cross-domain distribution matching tasks. It automatically learns the cross-domain distribution matching without relying on hand-crafted priors on the matching loss. Instead, L2M reduces the inductive bias by using a meta-network to learn the distribution matching loss in a data-driven way.
Proxy Anchor Loss for Deep Metric Learning
Relational Reflection Entity Alignment
Deep Residual Pansharpening Neural Network
In the field of fusing multi-spectral and panchromatic images (Pan-sharpening), the impressive effectiveness of deep neural networks has been recently employed to overcome the drawbacks of traditional linear models and boost the fusing accuracy. However, to the best of our knowledge, existing research works are mainly based on simple and flat networks with relatively shallow architecture, which severely limited their performances. In this paper, the concept of residual learning has been introduced to form a very deep convolutional neural network to make a full use of the high non-linearity of deep learning models. By both quantitative and visual assessments on a large number of high quality multi-spectral images from various sources, it has been supported that our proposed model is superior to all mainstream algorithms included in the comparison, and achieved the highest spatial-spectral unified accuracy.
Please enter a description about the method here
Adaptive EMA Mixture
Please enter a description about the method here
Segment Sorting
Continuous Kernel Convolution
DSelect-k is a continuously differentiable and sparse gate for Mixture-of-experts (MoE), based on a novel binary encoding formulation. Given a user-specified parameter , the gate selects at most out of the experts. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. This explicit control over sparsity leads to a cardinality-constrained optimization problem, which is computationally challenging. To circumvent this challenge, the authors use a unconstrained reformulation that is equivalent to the original problem. The reformulated problem uses a binary encoding scheme to implicitly enforce the cardinality constraint. By carefully smoothing the binary encoding variables, the reformulated problem can be effectively optimized using first-order methods such as SGD. The motivation for this method is that existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods.