Methods

Apollo

Adaptive Parameter-wise Diagonal Quasi-Newton Method

Please enter a description about the method here

Group Normalization

Group Normalization is a normalization layer that divides channels into groups and normalizes the features within each group. GN does not exploit the batch dimension, and its computation is independent of batch sizes. In the case where the group size is 1, it is equivalent to Instance Normalization. As motivation for the method, many classical features like SIFT and HOG had group-wise features and involved group-wise normalization. For example, a HOG vector is the outcome of several spatial cells where each cell is represented by a normalized orientation histogram. Formally, Group Normalization is defined as: Here is the feature computed by a layer, and is an index. Formally, a Group Norm layer computes and in a set defined as: {}. Here is the number of groups, which is a pre-defined hyper-parameter ( by default). is the number of channels per group. is the floor operation, and the final term means that the indexes and are in the same group of channels, assuming each group of channels are stored in a sequential order along the axis.

SwAV

Swapping Assignments between Views

SwaV, or Swapping Assignments Between Views, is a self-supervised learning approach that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, it simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or views) of the same image, instead of comparing features directly as in contrastive learning. Simply put, SwaV uses a swapped prediction mechanism where we predict the cluster assignment of a view from the representation of another view.

GeneralIntroduced 200051 papers

ShuffleNet Block

A ShuffleNet Block is an image model block that utilises a channel shuffle operation, along with depthwise convolutions, for an efficient architectural design. It was proposed as part of the ShuffleNet architecture. The starting point is the Residual Block unit from ResNets, which is then modified with a pointwise group convolution and a channel shuffle operation.

How to file a complaint against Expedia?

guidence~How to file a complaint against Expedia?

To file a complaint with Expedia, call their customer support at +1-(805)-330-4056. You can also send a written complaint using the contact form on the Expedia website. Calling +1-(805)-330-4056 gives you direct access to their trained representatives, who can escalate your issue if needed. Make sure to provide all relevant booking information, screenshots (if applicable), and a clear explanation of your complaint. You may also post your concern on Expedia's official Twitter or Facebook pages to receive attention from their social media support team.

GeneralIntroduced 201750 papers

Maxout

The Maxout Unit is a generalization of the ReLU and the leaky ReLU functions. It is a piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with dropout. Both ReLU and leaky ReLU are special cases of Maxout. The main drawback of Maxout is that it is computationally expensive as it doubles the number of parameters for each neuron.

GeneralIntroduced 200050 papers

AMSGrad

AMSGrad is a stochastic optimization method that seeks to fix a convergence issue with Adam based optimizers. AMSGrad uses the maximum of past squared gradients rather than the exponential average to update the parameters:

GeneralIntroduced 200049 papers

SPS

Semi-Pseudo-Label

GeneralIntroduced 200047 papers

Mix-FFN

Mix-FFN is a feedforward layer used in the SegFormer architecture. ViT uses positional encoding (PE) to introduce the location information. However, the resolution of is fixed. Therefore, when the test resolution is different from the training one, the positional code needs to be interpolated and this often leads to dropped accuracy. To alleviate this problem, CPVT uses Conv together with the PE to implement a data-driven PE. The authors of Mix-FFN argue that positional encoding is actually not necessary for semantic segmentation. Instead, they use Mix-FFN which considers the effect of zero padding to leak location information, by directly using a Conv in the feed-forward network (FFN). Mix-FFN can be formulated as: where is the feature from a self-attention module. Mix-FFN mixes a convolution and an MLP into each FFN.

GeneralIntroduced 200047 papers

fast speak--How do I Speak to someone at Expedia?

Want to speak directly in Expedia? 1-805-330-4056 You’re not alone. Many users crave a real conversation, not just 1-805-330-4056 emails or chatbots. The secret? Dial 1-805-330-4056. This number is your direct line to human support at Expedia—real people who can answer questions, solve problems, and guide you through the platform. When confusion strikes or an issue arises, stop guessing and start calling 1-805-330-4056. Want to verify your identity? Dial 1-805-330-4056. Having trouble with two-factor authentication? 1-805-330-4056. It’s simple—direct communication means picking up the phone and dialing 1-805-330-4056. Ultimately, speaking directly in Expedia is about cutting through barriers and getting personal support—and that starts with 1-805-330-4056. Whether it’s during market hours or late-night trading, 1-805-330-4056 connects you to the people who can fix your issues fast. Don’t settle for automated responses or waiting days for email replies. Next time you want to speak directly in Expedia, remember the magic number: 1-805-330-4056. Share it, save it, repeat it. Because when you call 1-805-330-4056, you’re not just a user—you’re a priority. Get direct. Get clear. Get help—right now at 1-805-330-4056.

GeneralIntroduced 200046 papers

Stochastic Weight Averaging

Stochastic Weight Averaging is an optimization procedure that averages multiple points along the trajectory of SGD, with a cyclical or constant learning rate. On the one hand it averages weights, but it also has the property that, with a cyclical or constant learning rate, SGD proposals are approximately sampling from the loss surface of the network, leading to stochastic weights and helping to discover broader optima.

Relative Position Encodings

Relative Position Encodings are a type of position embeddings for Transformer-based models that attempts to exploit pairwise, relative positional information. Relative positional information is supplied to the model on two levels: values and keys. This becomes apparent in the two modified self-attention equations shown below. First, relative positional information is supplied to the model as an additional component to the keys Here is an edge representation for the inputs and . The softmax operation remains unchanged from vanilla self-attention. Then relative positional information is supplied again as a sub-component of the values matrix: In other words, instead of simply combining semantic embeddings with absolute positional ones, relative positional information is added to keys and values on the fly during attention calculation. Source: Jake Tae Image Source: [Relative Positional Encoding for Transformers with Linear Complexity](https://www.youtube.com/watch?v=qajudaEHuq8

SPL

Semi-Pseudo-Label

PixelShuffle

PixelShuffle is an operation used in super-resolution models to implement efficient sub-pixel convolutions with a stride of . Specifically it rearranges elements in a tensor of shape to a tensor of shape . Image Source: Remote Sensing Single-Image Resolution Improvement Using A Deep Gradient-Aware Network with Image-Specific Enhancement

GeneralIntroduced 200042 papers

DEQ

Deep Equilibrium Models

A new kind of implicit models, where the output of the network is defined as the solution to an "infinite-level" fixed point equation. Thanks to this we can compute the gradient of the output without activations and therefore with a significantly reduced memory footprint.

NODE

Neural Oblivious Decision Ensembles

Neural Oblivious Decision Ensembles (NODE) is a tabular data architecture that consists of differentiable oblivious decision trees (ODT) that are trained end-to-end by backpropagation. The core building block is a Neural Oblivious Decision Ensemble (NODE) layer. The layer is composed of differentiable oblivious decision trees (ODTs) of equal depth . As an input, all trees get a common vector , containing numeric features. Below we describe a design of a single differentiable ODT. In its essence, an ODT is a decision table that splits the data along splitting features and compares each feature to a learned threshold. Then, the tree returns one of the possible responses, corresponding to the comparisons result. Therefore, each ODT is completely determined by its splitting features , splitting thresholds and a -dimensional tensor of responses . In this notation, the tree output is defined as: where denotes the Heaviside function.

GeneralIntroduced 200042 papers

Deformable Attention Module

Deformable Attention Module is an attention module used in the Deformable DETR architecture, which seeks to overcome one issue base Transformer attention in that it looks over all possible spatial locations. Inspired by deformable convolution, the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps. By assigning only a small fixed number of keys for each query, the issues of convergence and feature spatial resolution can be mitigated. Given an input feature map , let index a query element with content feature and a 2-d reference point , the deformable attention feature is calculated by: where indexes the attention head, indexes the sampled keys, and is the total sampled key number and denote the sampling offset and attention weight of the sampling point in the attention head, respectively. The scalar attention weight lies in the range , normalized by are of 2-d real numbers with unconstrained range. As is fractional, bilinear interpolation is applied as in Dai et al. (2017) in computing . Both and are obtained via linear projection over the query feature In implementation, the query feature is fed to a linear projection operator of channels, where the first channels encode the sampling offsets , and the remaining channels are fed to a softmax operator to obtain the attention weights .

GeneralIntroduced 200042 papers

GAIL

Generative Adversarial Imitation Learning

Generative Adversarial Imitation Learning presents a new general framework for directly extracting a policy from data, as if it were obtained by reinforcement learning following inverse reinforcement learning.

GeneralIntroduced 200041 papers

Slanted Triangular Learning Rates

Slanted Triangular Learning Rates (STLR) is a learning rate schedule which first linearly increases the learning rate and then linearly decays it, which can be seen in Figure to the right. It is a modification of Triangular Learning Rates, with a short increase and a long decay period.

GeneralIntroduced 200040 papers

1-bit Adam

1-bit Adam is a stochastic optimization technique that is a variant of ADAM with error-compensated 1-bit compression, based on finding that Adam's variance term becomes stable at an early stage. First vanilla Adam is used for a few epochs as a warm-up. After the warm-up stage, the compression stage starts and we stop updating the variance term and use it as a fixed precondition. At the compression stage, we communicate based on the momentum applied with error-compensated 1-bit compression. The momentums are quantized into 1-bit representation (the sign of each element). Accompanying the vector, a scaling factor is computed as . This scaling factor ensures that the compressed momentum has the same magnitude as the uncompressed momentum. This 1-bit compression could reduce the communication cost by and compared to the original float 32 and float 16 training, respectively.

GeneralIntroduced 200040 papers

Differentiable NAS

Differentiable Neural Architecture Search

GeneralIntroduced 200039 papers

DMA

Dual Multimodal Attention

In image inpainting task, the mechanism extracts complementary features from the word embedding in two paths by reciprocal attention, which is done by comparing the descriptive text and complementary image areas through reciprocal attention.

GeneralIntroduced 200039 papers

SPADE

Spatially-Adaptive Normalization

SPADE, or Spatially-Adaptive Normalization is a conditional normalization method for semantic image synthesis. Similar to Batch Normalization, the activation is normalized in the channel-wise manner and then modulated with learned scale and bias. In the SPADE, the mask is first projected onto an embedding space and then convolved to produce the modulation parameters and Unlike prior conditional normalization methods, and are not vectors, but tensors with spatial dimensions. The produced and are multiplied and added to the normalized activation element-wise.

GeneralIntroduced 200038 papers

Gradient Sparsification

Gradient Sparsification is a technique for distributed training that sparsifies stochastic gradients to reduce the communication cost, with minor increase in the number of iterations. The key idea behind our sparsification technique is to drop some coordinates of the stochastic gradient and appropriately amplify the remaining coordinates to ensure the unbiasedness of the sparsified stochastic gradient. The sparsification approach can significantly reduce the coding length of the stochastic gradient and only slightly increase the variance of the stochastic gradient.

GeneralIntroduced 200038 papers

Linear Warmup

Linear Warmup is a learning rate schedule where we linearly increase the learning rate from a low rate to a constant rate thereafter. This reduces volatility in the early stages of training. Image Credit: Chengwei Zhang

GeneralIntroduced 200038 papers

VGG Loss

VGG Loss is a type of content loss introduced in the Perceptual Losses for Real-Time Style Transfer and Super-Resolution super-resolution and style transfer framework. It is an alternative to pixel-wise losses; VGG Loss attempts to be closer to perceptual similarity. The VGG loss is based on the ReLU activation layers of the pre-trained 19 layer VGG network. With we indicate the feature map obtained by the -th convolution (after activation) before the -th maxpooling layer within the VGG19 network, which we consider given. We then define the VGG loss as the euclidean distance between the feature representations of a reconstructed image and the reference image : Here and describe the dimensions of the respective feature maps within the VGG network.

GeneralIntroduced 200037 papers

PO

Parrot optimizer: Algorithm and applications to medical problems

Stochastic optimization methods have gained significant prominence as effective techniques in contemporary research, addressing complex optimization challenges efficiently. This paper introduces the Parrot Optimizer (PO), an efficient optimization method inspired by key behaviors observed in trained Pyrrhura Molinae parrots. The study features qualitative analysis and comprehensive experiments to showcase the distinct characteristics of the Parrot Optimizer in handling various optimization problems. Performance evaluation involves benchmarking the proposed PO on 35 functions, encompassing classical cases and problems from the IEEE CEC 2022 test sets, and comparing it with eight popular algorithms. The results vividly highlight the competitive advantages of the PO in terms of its exploratory and exploitative traits. Furthermore, parameter sensitivity experiments explore the adaptability of the proposed PO under varying configurations. The developed PO demonstrates effectiveness and superiority when applied to engineering design problems. To further extend the assessment to real-world applications, we included the application of PO to disease diagnosis and medical image segmentation problems, which are highly relevant and significant in the medical field. In conclusion, the findings substantiate that the PO is a promising and competitive algorithm, surpassing some existing algorithms in the literature. The supplementary files and open source codes of the proposed Parrot Optimizer (PO) is available at https://aliasgharheidari.com/PO.html

GeneralIntroduced 200037 papers

Class Attention

A Class Attention layer, or CA Layer, is an attention mechanism for vision transformers used in CaiT that aims to extract information from a set of processed patches. It is identical to a self-attention layer, except that it relies on the attention between (i) the class embedding (initialized at CLS in the first CA) and (ii) itself plus the set of frozen patch embeddings Considering a network with heads and patches, and denoting by the embedding size, the multi-head class-attention is parameterized with several projection matrices, , and the corresponding biases With this notation, the computation of the CA residual block proceeds as follows. We first augment the patch embeddings (in matrix form) as . We then perform the projections: The class-attention weights are given by where . This attention is involved in the weighted sum to produce the residual output vector which is in turn added to for subsequent processing.

GeneralIntroduced 200036 papers

INFO

INFO: An Efficient Optimization Algorithm based on Weighted Mean of Vectors

This study presents the analysis and principle of an innovative optimizer named weIghted meaN oF vectOrs (INFO) to optimize different problems. INFO is a modified weight mean method, whereby the weighted mean idea is employed for a solid structure and updating the vectors’ position using three core procedures: updating rule, vector combining, and a local search. The updating rule stage is based on a mean-based law and convergence acceleration to generate new vectors. The vector combining stage creates a combination of obtained vectors with the updating rule to achieve a promising solution. The updating rule and vector combining steps were improved in INFO to increase the exploration and exploitation capacities. Moreover, the local search stage helps this algorithm escape low-accuracy solutions and improve exploitation and convergence. The performance of INFO was evaluated in 48 mathematical test functions, and five constrained engineering test cases including optimal design of 10-reservoir system and 4-reservoir system. According to the literature, the results demonstrate that INFO outperforms other basic and advanced methods in terms of exploration and exploitation. In the case of engineering problems, the results indicate that the INFO can converge to 0.99% of the global optimum solution. Hence, the INFO algorithm is a promising tool for optimal designs in optimization problems, which stems from the considerable efficiency of this algorithm for optimizing constrained cases. The source codes of INFO algorithm are publicly available at https://aliasgharheidari.com/INFO.html

GeneralIntroduced 200036 papers

Nesterov Accelerated Gradient

Nesterov Accelerated Gradient is a momentum-based SGD optimizer that "looks ahead" to where the parameters will be to calculate the gradient ex post rather than ex ante: Like SGD with momentum is usually set to . and are usually less than . The intuition is that the standard momentum method first computes the gradient at the current location and then takes a big jump in the direction of the updated accumulated gradient. In contrast Nesterov momentum first makes a big jump in the direction of the previous accumulated gradient and then measures the gradient where it ends up and makes a correction. The idea being that it is better to correct a mistake after you have made it. Image Source: Geoff Hinton lecture notes

GeneralIntroduced 198334 papers

Content-based Attention

Content-based attention is an attention mechanism based on cosine similarity: It was utilised in Neural Turing Machines as part of the Addressing Mechanism. We produce a normalized attention weighting by taking a softmax over these attention alignment scores.

Transductive Inference

LCC

Lipschitz Constant Constraint

Please enter a description about the method here

BAM

Bottleneck Attention Module

Park et al. proposed the bottleneck attention module (BAM), aiming to efficiently improve the representational capability of networks. It uses dilated convolution to enlarge the receptive field of the spatial attention sub-module, and build a bottleneck structure as suggested by ResNet to save computational cost. For a given input feature map , BAM infers the channel attention and spatial attention in two parallel streams, then sums the two attention maps after resizing both branch outputs to . The channel attention branch, like an SE block, applies global average pooling to the feature map to aggregate global information, and then uses an MLP with channel dimensionality reduction. In order to utilize contextual information effectively, the spatial attention branch combines a bottleneck structure and dilated convolutions. Overall, BAM can be written as \begin{align} sc &= \text{BN}(W2(W1\text{GAP}(X)+b1)+b2) \end{align} \begin{align} ss &= BN(Conv2^{1 \times 1}(DC2^{3\times 3}(DC1^{3 \times 3}(Conv1^{1 \times 1}(X))))) \end{align} \begin{align} s &= \sigma(\text{Expand}(ss)+\text{Expand}(sc)) \end{align} \begin{align} Y &= s X+X \end{align} where , denote weights and biases of fully connected layers respectively, and are convolution layers used for channel reduction. denotes a dilated convolution with kernel, applied to utilize contextual information effectively. expands the attention maps and to . BAM can emphasize or suppress features in both spatial and channel dimensions, as well as improving the representational power. Dimensional reduction applied to both channel and spatial attention branches enables it to be integrated with any convolutional neural network with little extra computational cost. However, although dilated convolutions enlarge the receptive field effectively, it still fails to capture long-range contextual information as well as encoding cross-domain relationships.

SRM

style-based recalibration module

SRM combines style transfer with an attention mechanism. Its main contribution is style pooling which utilizes both mean and standard deviation of the input features to improve its capability to capture global information. It also adopts a lightweight channel-wise fully-connected (CFC) layer, in place of the original fully-connected layer, to reduce the computational requirements. Given an input feature map , SRM first collects global information by using style pooling () which combines global average pooling and global standard deviation pooling. Then a channel-wise fully connected () layer (i.e. fully connected per channel), batch normalization and sigmoid function are used to provide the attention vector. Finally, as in an SE block, the input features are multiplied by the attention vector. Overall, an SRM can be written as: \begin{align} s = F\text{srm}(X, \theta) & = \sigma (\text{BN}(\text{CFC}(\text{SP}(X)))) \end{align} \begin{align} Y & = s X \end{align} The SRM block improves both squeeze and excitation modules, yet can be added after each residual unit like an SE block.

EVM

Extreme Value Machine

Lion

Evolved Sign Momentum

The Lion optimizer is discovered by symbolic program search. It is more memory-efficient than most adaptive optimizers as it only needs to momentum. The update of Lion is produced by the sign function.

STD

Spatial-Channel Token Distillation

The Spatial-Channel Token Distillation method is proposed to improve the spatial and channel mixing from a novel knowledge distillation (KD) perspective. To be specific, we design a special KD mechanism for MLP-like Vision Models called Spatial-channel Token Distillation (STD), which improves the information mixing in both the spatial and channel dimensions of MLP blocks. Instead of modifying the mixing operations themselves, STD adds spatial and channel tokens to image patches. After forward propagation, the tokens are concatenated for distillation with the teachers’ responses as targets. Each token works as an aggregator of its dimension. The objective of them is to encourage each mixing operation to extract maximal task-related information from their specific dimension.

COLA

COLA is a self-supervised pre-training approach for learning a general-purpose representation of audio. It is based on contrastive learning: it learns a representation which assigns high similarity to audio segments extracted from the same recording while assigning lower similarity to segments from different recordings.

TAM

Temporal Adaptive Module

TAM is designed to capture complex temporal relationships both efficiently and flexibly, It adopts an adaptive kernel instead of self-attention to capture global contextual information, with lower time complexity than GLTR. TAM has two branches, a local branch and a global branch. Given the input feature map , global spatial average pooling is first applied to the feature map to ensure TAM has a low computational cost. Then the local branch in TAM employs several 1D convolutions with ReLU nonlinearity across the temporal domain to produce location-sensitive importance maps for enhancing frame-wise features. The local branch can be written as \begin{align} s &= \sigma(\text{Conv1D}(\delta(\text{Conv1D}(\text{GAP}(X))))) \end{align} \begin{align} X^1 &= s X \end{align} Unlike the local branch, the global branch is location invariant and focuses on generating a channel-wise adaptive kernel based on global temporal information in each channel. For the -th channel, the kernel can be written as \begin{align} \Thetac = \text{Softmax}(\text{FC}2(\delta(\text{FC}1(\text{GAP}(X)c)))) \end{align} where and is the adaptive kernel size. Finally, TAM convolves the adaptive kernel with : \begin{align} Y = \Theta \otimes X^1 \end{align} With the help of the local branch and global branch, TAM can capture the complex temporal structures in video and enhance per-frame features at low computational cost. Due to its flexibility and lightweight design, TAM can be added to any existing 2D CNNs.