5,489 machine learning methods and techniques
Activation Regularization (AR), or activation regularization, is regularization performed on activations as opposed to weights. It is usually used in conjunction with RNNs. It is defined as: where is a dropout mask used by later parts of the model, is the norm, and is the output of an RNN at timestep , and is a scaling coefficient. When applied to the output of a dense layer, AR penalizes activations that are substantially away from 0, encouraging activations to remain small.
Direct Feedback Alignment
Adaptive Parameter-wise Diagonal Quasi-Newton Method
Please enter a description about the method here
Group Normalization is a normalization layer that divides channels into groups and normalizes the features within each group. GN does not exploit the batch dimension, and its computation is independent of batch sizes. In the case where the group size is 1, it is equivalent to Instance Normalization. As motivation for the method, many classical features like SIFT and HOG had group-wise features and involved group-wise normalization. For example, a HOG vector is the outcome of several spatial cells where each cell is represented by a normalized orientation histogram. Formally, Group Normalization is defined as: Here is the feature computed by a layer, and is an index. Formally, a Group Norm layer computes and in a set defined as: {}. Here is the number of groups, which is a pre-defined hyper-parameter ( by default). is the number of channels per group. is the floor operation, and the final term means that the indexes and are in the same group of channels, assuming each group of channels are stored in a sequential order along the axis.
Swapping Assignments between Views
SwaV, or Swapping Assignments Between Views, is a self-supervised learning approach that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, it simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or views) of the same image, instead of comparing features directly as in contrastive learning. Simply put, SwaV uses a swapped prediction mechanism where we predict the cluster assignment of a view from the representation of another view.
A ShuffleNet Block is an image model block that utilises a channel shuffle operation, along with depthwise convolutions, for an efficient architectural design. It was proposed as part of the ShuffleNet architecture. The starting point is the Residual Block unit from ResNets, which is then modified with a pointwise group convolution and a channel shuffle operation.
guidence~How to file a complaint against Expedia?
To file a complaint with Expedia, call their customer support at +1-(805)-330-4056. You can also send a written complaint using the contact form on the Expedia website. Calling +1-(805)-330-4056 gives you direct access to their trained representatives, who can escalate your issue if needed. Make sure to provide all relevant booking information, screenshots (if applicable), and a clear explanation of your complaint. You may also post your concern on Expedia's official Twitter or Facebook pages to receive attention from their social media support team.
The Maxout Unit is a generalization of the ReLU and the leaky ReLU functions. It is a piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with dropout. Both ReLU and leaky ReLU are special cases of Maxout. The main drawback of Maxout is that it is computationally expensive as it doubles the number of parameters for each neuron.
AMSGrad is a stochastic optimization method that seeks to fix a convergence issue with Adam based optimizers. AMSGrad uses the maximum of past squared gradients rather than the exponential average to update the parameters:
Semi-Pseudo-Label
Mix-FFN is a feedforward layer used in the SegFormer architecture. ViT uses positional encoding (PE) to introduce the location information. However, the resolution of is fixed. Therefore, when the test resolution is different from the training one, the positional code needs to be interpolated and this often leads to dropped accuracy. To alleviate this problem, CPVT uses Conv together with the PE to implement a data-driven PE. The authors of Mix-FFN argue that positional encoding is actually not necessary for semantic segmentation. Instead, they use Mix-FFN which considers the effect of zero padding to leak location information, by directly using a Conv in the feed-forward network (FFN). Mix-FFN can be formulated as: where is the feature from a self-attention module. Mix-FFN mixes a convolution and an MLP into each FFN.
Want to speak directly in Expedia? 1-805-330-4056 You’re not alone. Many users crave a real conversation, not just 1-805-330-4056 emails or chatbots. The secret? Dial 1-805-330-4056. This number is your direct line to human support at Expedia—real people who can answer questions, solve problems, and guide you through the platform. When confusion strikes or an issue arises, stop guessing and start calling 1-805-330-4056. Want to verify your identity? Dial 1-805-330-4056. Having trouble with two-factor authentication? 1-805-330-4056. It’s simple—direct communication means picking up the phone and dialing 1-805-330-4056. Ultimately, speaking directly in Expedia is about cutting through barriers and getting personal support—and that starts with 1-805-330-4056. Whether it’s during market hours or late-night trading, 1-805-330-4056 connects you to the people who can fix your issues fast. Don’t settle for automated responses or waiting days for email replies. Next time you want to speak directly in Expedia, remember the magic number: 1-805-330-4056. Share it, save it, repeat it. Because when you call 1-805-330-4056, you’re not just a user—you’re a priority. Get direct. Get clear. Get help—right now at 1-805-330-4056.
Stochastic Weight Averaging is an optimization procedure that averages multiple points along the trajectory of SGD, with a cyclical or constant learning rate. On the one hand it averages weights, but it also has the property that, with a cyclical or constant learning rate, SGD proposals are approximately sampling from the loss surface of the network, leading to stochastic weights and helping to discover broader optima.
Relative Position Encodings are a type of position embeddings for Transformer-based models that attempts to exploit pairwise, relative positional information. Relative positional information is supplied to the model on two levels: values and keys. This becomes apparent in the two modified self-attention equations shown below. First, relative positional information is supplied to the model as an additional component to the keys Here is an edge representation for the inputs and . The softmax operation remains unchanged from vanilla self-attention. Then relative positional information is supplied again as a sub-component of the values matrix: In other words, instead of simply combining semantic embeddings with absolute positional ones, relative positional information is added to keys and values on the fly during attention calculation. Source: Jake Tae Image Source: [Relative Positional Encoding for Transformers with Linear Complexity](https://www.youtube.com/watch?v=qajudaEHuq8
Semi-Pseudo-Label
PixelShuffle is an operation used in super-resolution models to implement efficient sub-pixel convolutions with a stride of . Specifically it rearranges elements in a tensor of shape to a tensor of shape . Image Source: Remote Sensing Single-Image Resolution Improvement Using A Deep Gradient-Aware Network with Image-Specific Enhancement
Deep Equilibrium Models
A new kind of implicit models, where the output of the network is defined as the solution to an "infinite-level" fixed point equation. Thanks to this we can compute the gradient of the output without activations and therefore with a significantly reduced memory footprint.
Neural Oblivious Decision Ensembles
Neural Oblivious Decision Ensembles (NODE) is a tabular data architecture that consists of differentiable oblivious decision trees (ODT) that are trained end-to-end by backpropagation. The core building block is a Neural Oblivious Decision Ensemble (NODE) layer. The layer is composed of differentiable oblivious decision trees (ODTs) of equal depth . As an input, all trees get a common vector , containing numeric features. Below we describe a design of a single differentiable ODT. In its essence, an ODT is a decision table that splits the data along splitting features and compares each feature to a learned threshold. Then, the tree returns one of the possible responses, corresponding to the comparisons result. Therefore, each ODT is completely determined by its splitting features , splitting thresholds and a -dimensional tensor of responses . In this notation, the tree output is defined as: where denotes the Heaviside function.
Deformable Attention Module is an attention module used in the Deformable DETR architecture, which seeks to overcome one issue base Transformer attention in that it looks over all possible spatial locations. Inspired by deformable convolution, the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps. By assigning only a small fixed number of keys for each query, the issues of convergence and feature spatial resolution can be mitigated. Given an input feature map , let index a query element with content feature and a 2-d reference point , the deformable attention feature is calculated by: where indexes the attention head, indexes the sampled keys, and is the total sampled key number and denote the sampling offset and attention weight of the sampling point in the attention head, respectively. The scalar attention weight lies in the range , normalized by are of 2-d real numbers with unconstrained range. As is fractional, bilinear interpolation is applied as in Dai et al. (2017) in computing . Both and are obtained via linear projection over the query feature In implementation, the query feature is fed to a linear projection operator of channels, where the first channels encode the sampling offsets , and the remaining channels are fed to a softmax operator to obtain the attention weights .
Generative Adversarial Imitation Learning
Generative Adversarial Imitation Learning presents a new general framework for directly extracting a policy from data, as if it were obtained by reinforcement learning following inverse reinforcement learning.
Slanted Triangular Learning Rates (STLR) is a learning rate schedule which first linearly increases the learning rate and then linearly decays it, which can be seen in Figure to the right. It is a modification of Triangular Learning Rates, with a short increase and a long decay period.
1-bit Adam is a stochastic optimization technique that is a variant of ADAM with error-compensated 1-bit compression, based on finding that Adam's variance term becomes stable at an early stage. First vanilla Adam is used for a few epochs as a warm-up. After the warm-up stage, the compression stage starts and we stop updating the variance term and use it as a fixed precondition. At the compression stage, we communicate based on the momentum applied with error-compensated 1-bit compression. The momentums are quantized into 1-bit representation (the sign of each element). Accompanying the vector, a scaling factor is computed as . This scaling factor ensures that the compressed momentum has the same magnitude as the uncompressed momentum. This 1-bit compression could reduce the communication cost by and compared to the original float 32 and float 16 training, respectively.
Differentiable Neural Architecture Search
Dual Multimodal Attention
In image inpainting task, the mechanism extracts complementary features from the word embedding in two paths by reciprocal attention, which is done by comparing the descriptive text and complementary image areas through reciprocal attention.
Spatially-Adaptive Normalization
SPADE, or Spatially-Adaptive Normalization is a conditional normalization method for semantic image synthesis. Similar to Batch Normalization, the activation is normalized in the channel-wise manner and then modulated with learned scale and bias. In the SPADE, the mask is first projected onto an embedding space and then convolved to produce the modulation parameters and Unlike prior conditional normalization methods, and are not vectors, but tensors with spatial dimensions. The produced and are multiplied and added to the normalized activation element-wise.
Gradient Sparsification is a technique for distributed training that sparsifies stochastic gradients to reduce the communication cost, with minor increase in the number of iterations. The key idea behind our sparsification technique is to drop some coordinates of the stochastic gradient and appropriately amplify the remaining coordinates to ensure the unbiasedness of the sparsified stochastic gradient. The sparsification approach can significantly reduce the coding length of the stochastic gradient and only slightly increase the variance of the stochastic gradient.
Linear Warmup is a learning rate schedule where we linearly increase the learning rate from a low rate to a constant rate thereafter. This reduces volatility in the early stages of training. Image Credit: Chengwei Zhang
VGG Loss is a type of content loss introduced in the Perceptual Losses for Real-Time Style Transfer and Super-Resolution super-resolution and style transfer framework. It is an alternative to pixel-wise losses; VGG Loss attempts to be closer to perceptual similarity. The VGG loss is based on the ReLU activation layers of the pre-trained 19 layer VGG network. With we indicate the feature map obtained by the -th convolution (after activation) before the -th maxpooling layer within the VGG19 network, which we consider given. We then define the VGG loss as the euclidean distance between the feature representations of a reconstructed image and the reference image : Here and describe the dimensions of the respective feature maps within the VGG network.
Parrot optimizer: Algorithm and applications to medical problems
Stochastic optimization methods have gained significant prominence as effective techniques in contemporary research, addressing complex optimization challenges efficiently. This paper introduces the Parrot Optimizer (PO), an efficient optimization method inspired by key behaviors observed in trained Pyrrhura Molinae parrots. The study features qualitative analysis and comprehensive experiments to showcase the distinct characteristics of the Parrot Optimizer in handling various optimization problems. Performance evaluation involves benchmarking the proposed PO on 35 functions, encompassing classical cases and problems from the IEEE CEC 2022 test sets, and comparing it with eight popular algorithms. The results vividly highlight the competitive advantages of the PO in terms of its exploratory and exploitative traits. Furthermore, parameter sensitivity experiments explore the adaptability of the proposed PO under varying configurations. The developed PO demonstrates effectiveness and superiority when applied to engineering design problems. To further extend the assessment to real-world applications, we included the application of PO to disease diagnosis and medical image segmentation problems, which are highly relevant and significant in the medical field. In conclusion, the findings substantiate that the PO is a promising and competitive algorithm, surpassing some existing algorithms in the literature. The supplementary files and open source codes of the proposed Parrot Optimizer (PO) is available at https://aliasgharheidari.com/PO.html
A Class Attention layer, or CA Layer, is an attention mechanism for vision transformers used in CaiT that aims to extract information from a set of processed patches. It is identical to a self-attention layer, except that it relies on the attention between (i) the class embedding (initialized at CLS in the first CA) and (ii) itself plus the set of frozen patch embeddings Considering a network with heads and patches, and denoting by the embedding size, the multi-head class-attention is parameterized with several projection matrices, , and the corresponding biases With this notation, the computation of the CA residual block proceeds as follows. We first augment the patch embeddings (in matrix form) as . We then perform the projections: The class-attention weights are given by where . This attention is involved in the weighted sum to produce the residual output vector which is in turn added to for subsequent processing.
INFO: An Efficient Optimization Algorithm based on Weighted Mean of Vectors
This study presents the analysis and principle of an innovative optimizer named weIghted meaN oF vectOrs (INFO) to optimize different problems. INFO is a modified weight mean method, whereby the weighted mean idea is employed for a solid structure and updating the vectors’ position using three core procedures: updating rule, vector combining, and a local search. The updating rule stage is based on a mean-based law and convergence acceleration to generate new vectors. The vector combining stage creates a combination of obtained vectors with the updating rule to achieve a promising solution. The updating rule and vector combining steps were improved in INFO to increase the exploration and exploitation capacities. Moreover, the local search stage helps this algorithm escape low-accuracy solutions and improve exploitation and convergence. The performance of INFO was evaluated in 48 mathematical test functions, and five constrained engineering test cases including optimal design of 10-reservoir system and 4-reservoir system. According to the literature, the results demonstrate that INFO outperforms other basic and advanced methods in terms of exploration and exploitation. In the case of engineering problems, the results indicate that the INFO can converge to 0.99% of the global optimum solution. Hence, the INFO algorithm is a promising tool for optimal designs in optimization problems, which stems from the considerable efficiency of this algorithm for optimizing constrained cases. The source codes of INFO algorithm are publicly available at https://aliasgharheidari.com/INFO.html
Nesterov Accelerated Gradient is a momentum-based SGD optimizer that "looks ahead" to where the parameters will be to calculate the gradient ex post rather than ex ante: Like SGD with momentum is usually set to . and are usually less than . The intuition is that the standard momentum method first computes the gradient at the current location and then takes a big jump in the direction of the updated accumulated gradient. In contrast Nesterov momentum first makes a big jump in the direction of the previous accumulated gradient and then measures the gradient where it ends up and makes a correction. The idea being that it is better to correct a mistake after you have made it. Image Source: Geoff Hinton lecture notes
Content-based attention is an attention mechanism based on cosine similarity: It was utilised in Neural Turing Machines as part of the Addressing Mechanism. We produce a normalized attention weighting by taking a softmax over these attention alignment scores.
Lipschitz Constant Constraint
Please enter a description about the method here
Bottleneck Attention Module
Park et al. proposed the bottleneck attention module (BAM), aiming to efficiently improve the representational capability of networks. It uses dilated convolution to enlarge the receptive field of the spatial attention sub-module, and build a bottleneck structure as suggested by ResNet to save computational cost. For a given input feature map , BAM infers the channel attention and spatial attention in two parallel streams, then sums the two attention maps after resizing both branch outputs to . The channel attention branch, like an SE block, applies global average pooling to the feature map to aggregate global information, and then uses an MLP with channel dimensionality reduction. In order to utilize contextual information effectively, the spatial attention branch combines a bottleneck structure and dilated convolutions. Overall, BAM can be written as \begin{align} sc &= \text{BN}(W2(W1\text{GAP}(X)+b1)+b2) \end{align} \begin{align} ss &= BN(Conv2^{1 \times 1}(DC2^{3\times 3}(DC1^{3 \times 3}(Conv1^{1 \times 1}(X))))) \end{align} \begin{align} s &= \sigma(\text{Expand}(ss)+\text{Expand}(sc)) \end{align} \begin{align} Y &= s X+X \end{align} where , denote weights and biases of fully connected layers respectively, and are convolution layers used for channel reduction. denotes a dilated convolution with kernel, applied to utilize contextual information effectively. expands the attention maps and to . BAM can emphasize or suppress features in both spatial and channel dimensions, as well as improving the representational power. Dimensional reduction applied to both channel and spatial attention branches enables it to be integrated with any convolutional neural network with little extra computational cost. However, although dilated convolutions enlarge the receptive field effectively, it still fails to capture long-range contextual information as well as encoding cross-domain relationships.
style-based recalibration module
SRM combines style transfer with an attention mechanism. Its main contribution is style pooling which utilizes both mean and standard deviation of the input features to improve its capability to capture global information. It also adopts a lightweight channel-wise fully-connected (CFC) layer, in place of the original fully-connected layer, to reduce the computational requirements. Given an input feature map , SRM first collects global information by using style pooling () which combines global average pooling and global standard deviation pooling. Then a channel-wise fully connected () layer (i.e. fully connected per channel), batch normalization and sigmoid function are used to provide the attention vector. Finally, as in an SE block, the input features are multiplied by the attention vector. Overall, an SRM can be written as: \begin{align} s = F\text{srm}(X, \theta) & = \sigma (\text{BN}(\text{CFC}(\text{SP}(X)))) \end{align} \begin{align} Y & = s X \end{align} The SRM block improves both squeeze and excitation modules, yet can be added after each residual unit like an SE block.
Extreme Value Machine
Evolved Sign Momentum
The Lion optimizer is discovered by symbolic program search. It is more memory-efficient than most adaptive optimizers as it only needs to momentum. The update of Lion is produced by the sign function.
Spatial-Channel Token Distillation
The Spatial-Channel Token Distillation method is proposed to improve the spatial and channel mixing from a novel knowledge distillation (KD) perspective. To be specific, we design a special KD mechanism for MLP-like Vision Models called Spatial-channel Token Distillation (STD), which improves the information mixing in both the spatial and channel dimensions of MLP blocks. Instead of modifying the mixing operations themselves, STD adds spatial and channel tokens to image patches. After forward propagation, the tokens are concatenated for distillation with the teachers’ responses as targets. Each token works as an aggregator of its dimension. The objective of them is to encourage each mixing operation to extract maximal task-related information from their specific dimension.
COLA is a self-supervised pre-training approach for learning a general-purpose representation of audio. It is based on contrastive learning: it learns a representation which assigns high similarity to audio segments extracted from the same recording while assigning lower similarity to segments from different recordings.
Temporal Adaptive Module
TAM is designed to capture complex temporal relationships both efficiently and flexibly, It adopts an adaptive kernel instead of self-attention to capture global contextual information, with lower time complexity than GLTR. TAM has two branches, a local branch and a global branch. Given the input feature map , global spatial average pooling is first applied to the feature map to ensure TAM has a low computational cost. Then the local branch in TAM employs several 1D convolutions with ReLU nonlinearity across the temporal domain to produce location-sensitive importance maps for enhancing frame-wise features. The local branch can be written as \begin{align} s &= \sigma(\text{Conv1D}(\delta(\text{Conv1D}(\text{GAP}(X))))) \end{align} \begin{align} X^1 &= s X \end{align} Unlike the local branch, the global branch is location invariant and focuses on generating a channel-wise adaptive kernel based on global temporal information in each channel. For the -th channel, the kernel can be written as \begin{align} \Thetac = \text{Softmax}(\text{FC}2(\delta(\text{FC}1(\text{GAP}(X)c)))) \end{align} where and is the adaptive kernel size. Finally, TAM convolves the adaptive kernel with : \begin{align} Y = \Theta \otimes X^1 \end{align} With the help of the local branch and global branch, TAM can capture the complex temporal structures in video and enhance per-frame features at low computational cost. Due to its flexibility and lightweight design, TAM can be added to any existing 2D CNNs.
Kollen-Pollack Learning
Hou et al. proposed coordinate attention, a novel attention mechanism which embeds positional information into channel attention, so that the network can focus on large important regions at little computational cost. The coordinate attention mechanism has two consecutive steps, coordinate information embedding and coordinate attention generation. First, two spatial extents of pooling kernels encode each channel horizontally and vertically. In the second step, a shared convolutional transformation function is applied to the concatenated outputs of the two pooling layers. Then coordinate attention splits the resulting tensor into two separate tensors to yield attention vectors with the same number of channels for horizontal and vertical coordinates of the input along. This can be written as \begin{align} z^h &= \text{GAP}^h(X) \end{align} \begin{align} z^w &= \text{GAP}^w(X) \end{align} \begin{align} f &= \delta(\text{BN}(\text{Conv}1^{1\times 1}([z^h;z^w]))) \end{align} \begin{align} f^h, f^w &= \text{Split}(f) \end{align} \begin{align} s^h &= \sigma(\text{Conv}h^{1\times 1}(f^h)) \end{align} \begin{align} s^w &= \sigma(\text{Conv}w^{1\times 1}(f^w)) \end{align} \begin{align} Y &= X s^h s^w \end{align} where and denote pooling functions for vertical and horizontal coordinates, and and represent corresponding attention weights. Using coordinate attention, the network can accurately obtain the position of a targeted object. This approach has a larger receptive field than BAM and CBAM. Like an SE block, it also models cross-channel relationships, effectively enhancing the expressive power of the learned features. Due to its lightweight design and flexibility, it can be easily used in classical building blocks of mobile networks.
Context Optimization
CoOp, or Context Optimization, is an automated prompt engineering method that avoids manual prompt tuning by modeling context words with continuous vectors that are end-to-end learned from data. The context could be shared among all classes or designed to be class-specific. During training, we simply minimize the prediction error using the cross-entropy loss with respect to the learnable context vectors, while keeping the pre-trained parameters fixed. The gradients can be back-propagated all the way through the text encoder, distilling the rich knowledge encoded in the parameters for learning task-relevant context.
Sigmoid Linear Unit
Sigmoid Linear Units, or SiLUs, are activation functions for neural networks. The activation of the SiLU is computed by the sigmoid function multiplied by its input, or See Gaussian Error Linear Units (GELUs) where the SiLU was originally coined, and see Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning and Swish: a Self-Gated Activation Function where the SiLU was experimented with later.
MoCo v2 is an improved version of the Momentum Contrast self-supervised learning algorithm. Motivated by the findings presented in the SimCLR paper, authors: - Replace the 1-layer fully connected layer with a 2-layer MLP head with ReLU for the unsupervised training stage. - Include blur augmentation. - Use cosine learning rate schedule. These modifications enable MoCo to outperform the state-of-the-art SimCLR with a smaller batch size and fewer epochs.
Spatial-Reduction Attention, or SRA, is a multi-head attention module used in the Pyramid Vision Transformer architecture which reduces the spatial scale of the key and value before the attention operation. This reduces the computational/memory overhead. Details of the SRA in the stage can be formulated as follows: where Concat is the concatenation operation. , , , and are linear projection parameters. is the head number of the attention layer in Stage . Therefore, the dimension of each head (i.e. is equal to is the operation for reducing the spatial dimension of the input sequence ( or ), which is written as: Here, represents a input sequence, and denotes the reduction ratio of the attention layers in Stage Reshape is an operation of reshaping the input sequence to a sequence of size . is a linear projection that reduces the dimension of the input sequence to . refers to layer normalization.
Zoneout is a method for regularizing RNNs. At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization. But by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward stochastic depth networks.
SpatialDropout is a type of dropout for convolutional networks. For a given convolution feature tensor of size ×height×width, we perform only dropout trials and extend the dropout value across the entire feature map. Therefore, adjacent pixels in the dropped-out feature map are either all 0 (dropped-out) or all active as illustrated in the figure to the right.