5,489 machine learning methods and techniques
Compressed Memory is a secondary FIFO memory component proposed as part of the Compressive Transformer model. The Compressive Transformer keeps a fine-grained memory of past activations, which are then compressed into coarser compressed memories. For choices of compression functions the authors consider (1) max/mean pooling, where the kernel and stride is set to the compression rate ; (2) 1D convolution also with kernel & stride set to ; (3) dilated convolutions; (4) most-used where the memories are sorted by their average attention (usage) and the most-used are preserved.
Gradient-Based Decision Tree Ensembles
Contrastive Multiview Coding (CMC) is a self-supervised learning approach, based on CPC, that learns representations that capture information shared between multiple sensory views. The core idea is to set an anchor view and the sample positive and negative data points from the other view and maximise agreement between positive pairs in learning from two views. Contrastive learning is used to build the embedding.
Nearest-Neighbor Contrastive Learning of Visual Representations
Singular Value Decomposition Parameterization
Kernel Density Matrices
Kernel density matrices provide a simpler yet effective mechanism for representing joint probability distributions of both continuous and discrete random variables. This abstraction allows the construction of differentiable models for density estimation, inference, and sampling, and enables their integration into end-to-end deep neural models.
efficient channel attention
An ECA block has similar formulation to an SE block including a squeeze module for aggregating global spatial information and an efficient excitation module for modeling cross-channel interaction. Instead of indirect correspondence, an ECA block only considers direct interaction between each channel and its k-nearest neighbors to control model complexity. Overall, the formulation of an ECA block is: \begin{align} s = F\text{eca}(X, \theta) & = \sigma (\text{Conv1D}(\text{GAP}(X))) \end{align} \begin{align} Y & = s X \end{align} where denotes 1D convolution with a kernel of shape across the channel domain, to model local cross-channel interaction. The parameter decides the coverage of interaction, and in ECA the kernel size is adaptively determined from the channel dimensionality instead of by manual tuning, using cross-validation: \begin{equation} k = \psi(C) = \left | \frac{\log2(C)}{\gamma}+\frac{b}{\gamma}\right |\text{odd} \end{equation} where and are hyperparameters. indicates the nearest odd function of . Compared to SENet, ECANet has an improved excitation module, and provides an efficient and effective block which can readily be incorporated into various CNNs.
Tanh Exponential Activation Function
Lightweight or mobile neural networks used for real-time computer vision tasks contain fewer parameters than normal networks, which lead to a constrained performance. In this work, we proposed a novel activation function named Tanh Exponential Activation Function (TanhExp) which can improve the performance for these networks on image classification task significantly. The definition of TanhExp is . We demonstrate the simplicity, efficiency, and robustness of TanhExp on various datasets and network models and TanhExp outperforms its counterparts in both convergence speed and accuracy. Its behaviour also remains stable even with noise added and dataset altered. We show that without increasing the size of the network, the capacity of lightweight neural networks can be enhanced by TanhExp with only a few training epochs and no extra parameters added.
Strip Pooling Network
Spatial pooling usually operates on a small region which limits its capability to capture long-range dependencies and focus on distant regions. To overcome this, Hou et al. proposed strip pooling, a novel pooling method capable of encoding long-range context in either horizontal or vertical spatial domains. Strip pooling has two branches for horizontal and vertical strip pooling. The horizontal strip pooling part first pools the input feature in the horizontal direction: \begin{align} y^1 = \text{GAP}^w (X) \end{align} Then a 1D convolution with kernel size 3 is applied in to capture the relationship between different rows and channels. This is repeated times to make the output consistent with the input shape: \begin{align} yh = \text{Expand}(\text{Conv1D}(y^1)) \end{align} Vertical strip pooling is performed in a similar way. Finally, the outputs of the two branches are fused using element-wise summation to produce the attention map: \begin{align} s &= \sigma(Conv^{1\times 1}(y{v} + y{h})) \end{align} \begin{align} Y &= s X \end{align} The strip pooling module (SPM) is further developed in the mixed pooling module (MPM). Both consider spatial and channel relationships to overcome the locality of convolutional neural networks. SPNet achieves state-of-the-art results for several complex semantic segmentation benchmarks.
A Cyclical Learning Rate Policy combines a linear learning rate decay with warm restarts. Image: ESPNetv2
Elastic Margin Loss for Deep Face Recognition
Gaussian Mixture Variational Autoencoder
GMVAE, or Gaussian Mixture Variational Autoencoder, is a stochastic regularization layer for transformers. A GMVAE layer is trained using a 700-dimensional internal representation of the first MLP layer. For every output from the first MLP layer, the GMVAE layer first computes a latent low-dimensional representation sampling from the GMVAE posterior distribution to then provide at the output a reconstruction sampled from a generative model.
Prompt Gradient Alignment
MeshGraphNet is a framework for learning mesh-based simulations using graph neural networks. The model can be trained to pass messages on a mesh graph and to adapt the mesh discretization during forward simulation. The model uses an Encode-Process-Decode architecture trained with one-step supervision, and can be applied iteratively to generate long trajectories at inference time. The encoder transforms the input mesh into a graph, adding extra world-space edges. The processor performs several rounds of message passing along mesh edges and world edges, updating all node and edge embeddings. The decoder extracts the acceleration for each node, which is used to update the mesh to produce .
The NVAE Encoder Residual Cell is a residual connection block used in the NVAE architecture for the encoder. It applies two series of BN-Swish-Conv layers without changing the number of channels.
Hierarchical Multi-Task Learning
Multi-task learning (MTL) introduces an inductive bias, based on a-priori relations between tasks: the trainable model is compelled to model more general dependencies by using the abovementioned relation as an important data feature. Hierarchical MTL, in which different tasks use different levels of the deep neural network, provides more effective inductive bias compared to “flat” MTL. Also, hierarchical MTL helps to solve the vanishing gradient problem in deep learning.
A scalable second order optimization algorithm for deep learning. Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.
Residual Normal Distributions are used to help the optimization of VAEs, preventing optimization from entering an unstable region. This can happen due to sharp gradients caused in situations where the encoder and decoder produce distributions far away from each other. The residual distribution parameterizes relative to . Let be a Normal distribution for the th variable in in prior. Define , where and are the relative location and scale of the approximate posterior with respect to the prior. With this parameterization, when the prior moves, the approximate posterior moves accordingly, if not changed.
Neo-fuzzy-neuron
Neo-fuzzy-neuron is a type of artificial neural network that combines the characteristics of both fuzzy logic and neural networks. It uses a fuzzy inference system to model non-linear relationships between inputs and outputs, and a feedforward neural network to learn the parameters of the fuzzy system. The combination of these two approaches provides a flexible and powerful tool for solving a wide range of problems in areas such as pattern recognition, control, and prediction.
Harris Hawks optimization
HHO is a popular swarm-based, gradient-free optimization algorithm with several active and time-varying phases of exploration and exploitation. This algorithm initially published by the prestigious Journal of Future Generation Computer Systems (FGCS) in 2019, and from the first day, it has gained increasing attention among researchers due to its flexible structure, high performance, and high-quality results. The main logic of the HHO method is designed based on the cooperative behaviour and chasing styles of Harris' hawks in nature called "surprise pounce". Currently, there are many suggestions about how to enhance the functionality of HHO, and there are also several enhanced variants of the HHO in the leading Elsevier and IEEE transaction journals. From the algorithmic behaviour viewpoint, there are several effective features in HHO : Escaping energy parameter has a dynamic randomized time-varying nature, which can further improve and harmonize the exploratory and exploitive patterns of HHO. This factor also supports HHO to conduct a smooth transition between exploration and exploitation. Different exploration mechanisms with respect to the average location of hawks can increase the exploratory trends of HHO throughout initial iterations. Diverse LF-based patterns with short-length jumps enrich the exploitative behaviours of HHO when directing a local search. The progressive selection scheme supports search agents to progressively advance their position and only select a better position, which can improve the superiority of solutions and intensification powers of HHO throughout the optimization procedure. HHO shows a series of searching strategies and then, it selects the best movement step. This feature has also a constructive influence on the exploitation inclinations of HHO. The randomized jump strength can assist candidate solutions in harmonising the exploration and exploitation leanings. The application of adaptive and time-varying components allows HHO to handle difficulties of a feature space including local optimal solutions, multi-modality, and deceptive optima. 🔗 The source codes of HHO are publicly available at https://aliasgharheidari.com/HHO.html
The NVAE Generative Residual Cell is a skip connection block used as part of the NVAE architecture for the generator. The residual cell expands the number of channels times before applying the depthwise separable convolution, and then maps it back to channels. The design motivation was to help model long-range correlations in the data by increasing the receptive field of the network, which explains the expanding path but also the use of depthwise convolutions to keep a handle on parameter count.
Meta Pseudo Labels is a semi-supervised learning method that uses a teacher network to generate pseudo labels on unlabeled data to teach a student network. The teacher receives feedback from the student to inform the teacher to generate better pseudo labels. This feedback signal is used as a reward to train the teacher throughout the course of the student’s learning.
TextGrad is a powerful framework building automatic differentiation'' via text. TextGrad implements backpropagation through text feedback provided by LLMs, strongly building on the gradient metaphor
Spectral Clustering Spectral clustering aims to partition the data points into clusters using the spectrum of the graph Laplacians Given a dataset with data points, spectral clustering algorithm first constructs similarity matrix , where indicates the similarity between data points and via a similarity measure metric. Let , where is called graph Laplacian and is a diagonal matrix with . The objective function of spectral clustering can be formulated based on the graph Laplacian as follow: \begin{equation} \label{eq:SCobj} {\max{{U}} \operatorname{tr}\left({U}^{T} {L} {U}\right)}, \\ {\text { s.t. } \quad {U}^{T} {{U}={I}}}, \end{equation} where denotes the trace norm of a matrix. The rows of matrix are the low dimensional embedding of the original data points. Generally, spectral clustering computes as the bottom eigenvectors of , and finally applies -means on to obtain the clustering results. Large-scale Spectral Clustering To capture the relationship between all data points in , an similarity matrix is needed to be constructed in conventional spectral clustering, which costs time and memory and is not feasible for large-scale clustering tasks. Instead of a full similarity matrix, many accelerated spectral clustering methods are using a similarity sub-matrix to represent each data points by the cross-similarity between data points and a set of representative data points (i.e., landmarks) via some similarity measures, as \begin{equation} \label{eq: cross-similarity} B = \Phi(X,R), \end{equation} where () is a set of landmarks with the same dimension to , indicate a similarity measure metric, and is the similarity sub-matrix to represent the with respect to the . For large-scale spectral clustering using such similarity matrix, a symmetric similarity matrix can be designed as \begin{equation} \label{eq: WusedB } W=\left[\begin{array}{ll} \mathbf{0} & B ; \\ B^{T} & \mathbf{0} \end{array}\right]. \end{equation} The size of matrix is . Taking the advantage of the bipartite structure, some fast eigen-decomposition methods can then be used to obtain the spectral embedding. Finally, -means is conducted on the embedding to obtain clustering results. The clustering result is directly related to the quality of that consists of the similarities between data points and landmarks. Thus, the performance of landmark selection is crucial to the clustering result.
The Lovasz-Softmax loss is a loss function for multiclass semantic segmentation that incorporates the softmax operation in the Lovasz extension. The Lovasz extension is a means by which we can achieve direct optimization of the mean intersection-over-union loss in neural networks.
Dynamic Range Activator
Recursive functions with heteroscedasticity, sparse and high-variance target distributions introduces a huge complexity that makes their accurate modeling with Neural Networks a difficult task. A main property of recursive maps (e.g factorial function), is their dramatic growth and drop. Learning this recursive behavior requires not only fitting high-frequency patterns within a bounded region but also successfully extrapolating those patterns beyond that region. In time series prediction tasks, capturing periodic even behavior is a challenge. Various methods have been employed to model periodic patterns effectively. However, these approaches typically deal with uni-modal data that also exhibit relatively low variance in both In-Distribution (ID) and Out-Of-Distribution (OOD) regions and do not generalize well to recursive problems with the high-variance observed in our context. Thus, to enable Transformers to capture such behavior and perform proper inference for multi-modal recursive problems, we enhance them by introducing the Dynamic Range Activator (DRA). The DRA is designed to handle the recursive and factorial growth properties inherent in enumerative problems with minimal computational overhead and can be integrated into existing neural networks without requiring significant architectural changes. DRA integrates both harmonic and hyperbolic components as follows, \begin{equation} \mathrm{DRA}(x) := x + a \sin^2\left(\frac{x}{b}\right) + c \cos(bx) + d \tanh(bx) \,, \end{equation} where are learnable parameters. It allows the function to simultaneously model periodic data (through sine and cosine) and rapid growth or attenuation (through the hyperbolic tangent) response.
Generalized Focal Loss (GFL) is a loss function for object detection that combines Quality Focal Loss and Distribution Focal Loss into a general form.
Spatial and Channel SE Blocks
To aggregate global spatial information, an SE block applies global pooling to the feature map. However, it ignores pixel-wise spatial information, which is important in dense prediction tasks. Therefore, Roy et al. proposed spatial and channel SE blocks (scSE). Like BAM, spatial SE blocks are used, complementing SE blocks, to provide spatial attention weights to focus on important regions. Given the input feature map , two parallel modules, spatial SE and channel SE, are applied to feature maps to encode spatial and channel information respectively. The channel SE module is an ordinary SE block, while the spatial SE module adopts convolution for spatial squeezing. The outputs from the two modules are fused. The overall process can be written as \begin{align} sc & = \sigma (W{2} \delta (W{1}\text{GAP}(X))) \end{align} \begin{align} X\text{chn} & = sc X \end{align} \begin{align} ss &= \sigma(\text{Conv}^{1\times 1}(X)) \end{align} \begin{align} X\text{spa} & = ss X \end{align} \begin{align} Y &= f(X\text{spa},X\text{chn}) \end{align} where denotes the fusion function, which can be maximum, addition, multiplication or concatenation. The proposed scSE block combines channel and spatial attention to enhance features as well as capturing pixel-wise spatial information. Segmentation tasks are greatly benefited as a result. The integration of an scSE block in F-CNNs makes a consistent improvement in semantic segmentation at negligible extra cost.
Symbolic rule learning methods find regularities in data that can be expressed in the form of 'if-then' rules based on symbolic representations of the data.
VisuoSpatial Foresight
VisuoSpatial Foresight is a method for robotic fabric manipulation that leverages a combination of RGB and depth information to learn goal conditioned fabric manipulation policies for a variety of long horizon tasks.
Self-adaptive Training is a training algorithm that dynamically corrects problematic training labels by model predictions to improve generalization of deep learning for potentially corrupted training data. Accumulated predictions are used to augment the training dynamics. The use of an exponential-moving-average scheme alleviates the instability issue of model predictions, smooths out the training target during the training process and enables the algorithm to completely change the training labels if necessary.
Supporting Clustering with Contrastive Learning
SCCL, or Supporting Clustering with Contrastive Learning, is a framework to leverage contrastive learning to promote better separation in unsupervised clustering. It combines the top-down clustering with the bottom-up instance-wise contrastive learning to achieve better inter-cluster distance and intra-cluster distance. During training, we jointly optimize a clustering loss over the original data instances and an instance-wise contrastive loss over the associated augmented pairs.
LayerDrop is a form of structured dropout for Transformer models which has a regularization effect during training and allows for efficient pruning at inference time. It randomly drops layers from the Transformer according to an "every other" strategy where pruning with a rate means dropping the layers at depth such that .
ZeRO-Offload is a sharded data parallel method for distributed training. It exploits both CPU memory and compute for offloading, while offering a clear path towards efficiently scaling on multiple GPUs by working with ZeRO-powered data parallelism. The symbiosis allows ZeRO-Offload to maintain a single copy of the optimizer states on the CPU memory regardless of the data parallel degree. Furthermore, it keeps the aggregate communication volume between GPU and CPU, as well as the aggregate CPU computation a constant regardless of data parallelism, allowing ZeRO-Offload to effectively utilize the linear increase in CPU compute with the increase in the data parallelism degree.
BigBiGAN is a type of BiGAN with a BigGAN image generator. The authors initially used ResNet as a baseline for the encoder followed by a 4-layer MLP with skip connections, but they experimented with RevNets and found they outperformed with increased network width, so opted for this type of encoder for the final architecture.
Gated Adaptive Network for Deep Automated Learning of Features
We propose a novel high-performance, interpretable, and parameter \& computationally efficient deep learning architecture for tabular data, Gated Adaptive Network for Deep Automated Learning of Features (GANDALF). GANDALF relies on a new tabular processing unit with a gating mechanism and in-built feature selection called Gated Feature Learning Unit (GFLU) as a feature representation learning unit. We demonstrate that GANDALF outperforms or stays at-par with SOTA approaches like XGBoost, SAINT, FT-Transformers, etc. by experiments on multiple established public benchmarks. We have made available the code at github.com/manujosephv/pytorchtabular under MIT License.
Virtual Data Augmentation, or VDA, is a framework for robustly fine-tuning pre-trained language model. Based on the original token embeddings, a multinomial mixture for augmenting virtual data is constructed, where a masked language model guarantees the semantic relevance and the Gaussian noise provides the augmentation diversity. Furthermore, a regularized training strategy is proposed to balance the two aspects.
MNN
Mobile Neural Network (MNN) is a mobile inference engine tailored to mobile applications. The contributions of MNN include: (1) presenting a mechanism called pre-inference that manages to conduct runtime optimization; (2) delivering thorough kernel optimization on operators to achieve optimal computation performance; (3) introducing backend abstraction module which enables hybrid scheduling and keeps the engine lightweight.
The Hard Sigmoid is an activation function used for neural networks of the form: Image Source: Rinat Maksutov
Holographic Reduced Representations are a simple mechanism to represent an associative array of key-value pairs in a fixed-size vector. Each individual key-value pair is the same size as the entire associative array; the array is represented by the sum of the pairs. Concretely, consider a complex vector key , which is the same size as the complex vector value x. The pair is "bound" together by element-wise complex multiplication, which multiplies the moduli and adds the phases of the elements: Given keys , , and input vectors , , , the associative array is: where we call a memory trace. Define the key inverse: To retrieve the item associated with key , we multiply the memory trace element-wise by the vector . For example: The product is exactly together with a noise term. If the phases of the elements of the key vector are randomly distributed, the noise term has zero mean. Source: Associative LSTMs
Distributional Generalization is a type of generalization that roughly states that outputs of a classifier at train and test time are close as distributions, as opposed to close in just their average error. This behavior is not captured by classical generalization, which would only consider the average error and not the distribution of errors over the input domain.
Graph Contrastive Coding is a self-supervised graph neural network pre-training framework to capture the universal network topological properties across multiple networks. GCC's pre-training task is designed as subgraph instance discrimination in and across networks and leverages contrastive learning to empower graph neural networks to learn the intrinsic and transferable structural representations.
Mixture Normalization is normalization technique that relies on an approximation of the probability density function of the internal representations. Any continuous distribution can be approximated with arbitrary precision using a Gaussian Mixture Model (GMM). Hence, instead of computing one set of statistical measures from the entire population (of instances in the mini-batch) as Batch Normalization does, Mixture Normalization works on sub-populations which can be identified by disentangling modes of the distribution, estimated via GMM. While BN can only scale and/or shift the whole underlying probability density function, mixture normalization operates like a soft piecewise normalizing transform, capable of completely re-structuring the data distribution by independently scaling and/or shifting individual modes of distribution.
Feedback Memory is a type of attention module used in the Feedback Transformer architecture. It allows a transformer to to use the most abstract representations from the past directly as inputs for the current timestep. This means that the model does not form its representation in parallel, but sequentially token by token. More precisely, we replace the context inputs to attention modules with memory vectors that are computed over the past, i.e.: where a memory vector is computed by summing the representations of each layer at the -th time step: where are learnable scalar parameters. Here corresponds to token embeddings. The weighting of different layers by a softmax output gives the model more flexibility as it can average them or select one of them. This modification of the self-attention input adapts the computation of the Transformer from parallel to sequential, summarized in the Figure. Indeed, it gives the ability to formulate the representation based on past representations from any layer , while in a standard Transformer this is only true for . This change can be viewed as exposing all previous computations to all future computations, providing better representations of the input. Such capacity would allow much shallower models to capture the same level of abstraction as a deeper architecture.
Sensor Dropout or SensD
A method that randomly mask out all features coming from a specific sensor in multi-sensor models for Earth observation. Depending on the fusion strategy, the mask out can be done at the input, feature or decision level.
Accordion is a gradient communication scheduling algorithm that is generic across models while imposing low computational overheads. Accordion inspects the change in the gradient norms to detect critical regimes and adjusts the communication schedule dynamically. Accordion works for both adjusting the gradient compression rate or the batch size without additional parameter tuning.
Rank-based loss
Low Variance Regularization
Method introduces a novel unlabeled debiasing technique which works on classification task to reduce the bias of the transformer based language models on downstream classification task. In their method authors use the classes as metric for regularization and punish the network if the embedding produced by the model are far from each other. by doing so the authors claim to be able to reduce the domain shift caused by any unwanted attribute information hence results in fair embedding.
Semi-Parametric Editing with a Retrieval-Augmented Counterfac- tual Model
Parametric UMAP is a non-parametric graph-based dimensionality reduction algorithm that extends the second step of UMAP to a parametric optimization over neural network weights, learning a parametric relationship between data and embedding.