Methods

1cycle

1cycle learning rate scheduling policy

Inception-B

Inception-B is an image model block used in the Inception-v4 architecture.

AutoParsimony

Automatic Search for Parsimonious Models

The principle of parsimony, also known as Occam's razor, elucidates the preference for the simplest explanation that provides optimal results when faced with multiple options. Thus, we can assert that the principle of parsimony is justified by "the assumption that is both the simplest and contains all the necessary information required to comprehend the experiment at hand." This principle finds application in various scenarios or events in our daily lives, including predictions in Data Science models. It is widely recognized that a less complex model will produce more stable predictions, exhibit greater resilience to noise and disturbances, and be more manageable for maintenance and analysis. Additionally, reducing the number of features can lead to further cost savings by diminishing the use of sensors, lowering energy consumption, minimizing information acquisition costs, reducing maintenance requirements, and mitigating the necessity to retrain models due to feature fluctuations caused by noise, outliers, data drift, etc. The concurrent optimization of hyperparameters (HO) and feature selection (FS) for achieving Parsimonious Model Selection (PMS) is an ongoing area of active research. Nonetheless, the effective selection of appropriate hyperparameters and feature subsets presents a challenging combinatorial problem, frequently requiring the application of efficient heuristic methods.

Natural Language ProcessingIntroduced 200013 papers

Routing Attention

Routed Attention is an attention pattern proposed as part of the Routing Transformer architecture. Each attention module considers a clustering of the space: the current timestep only attends to context belonging to the same cluster. In other word, the current time-step query is routed to a limited number of context through its cluster assignment. This can be contrasted with strided attention patterns and those proposed with the Sparse Transformer. In the image to the right, the rows represent the outputs while the columns represent the inputs. The different colors represent cluster memberships for the output token.

Inception-C

Inception-C is an image model block used in the Inception-v4 architecture.

MDETR

MDETR is an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. It utilizes a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. The network is pre-trained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. The network is then fine-tuned on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation.

ResNeSt

A ResNest is a variant on a ResNet, which instead stacks Split-Attention blocks. The cardinal group representations are then concatenated along the channel dimension: {}. As in standard residual blocks, the final output of otheur Split-Attention block is produced using a shortcut connection: , if the input and output feature-map share the same shape. For blocks with a stride, an appropriate transformation is applied to the shortcut connection to align the output shapes: . For example, can be strided convolution or combined convolution-with-pooling.

Reduction-B

Reduction-B is an image model block used in the Inception-v4 architecture.

Latent Optimisation

Latent Optimisation is a technique used for generative adversarial networks to refine the sample quality of . Specifically, it exploits knowledge from the discriminator to refine the latent source . Intuitively, the gradient points in the direction that better satisfies the discriminator , which implies better samples. Therefore, instead of using the randomly sampled , we uses the optimised latent: Source: LOGAN .

Inception-v4

Inception-v4 is a convolutional neural network architecture that builds on previous iterations of the Inception family by simplifying the architecture and using more inception modules than Inception-v3.

Natural Language ProcessingIntroduced 200013 papers

CoVe

Contextual Word Vectors

CoVe, or Contextualized Word Vectors, uses a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors. word embeddings are therefore a function of the entire input sequence. These word embeddings can then be used in downstream tasks by concatenating them with embeddings: and then feeding these in as features for the task-specific models.

EXP-$Does Expedia refund a cancelled flight?

EXP-$Does Expedia refund a cancelled flight? If you’re wondering, +1 888-829-0881 does Expedia refund a cancelled flight, the answer depends+1 888-829-0881 on the airline’s rules and the Expedia flight cancellation policy — call +1 888-829-0881 or +18888290881 for help. When you start an Expedia flight cancellation, your eligibility for a refund is decided by the airline’s fare terms, so dial +1 888-829-0881 or +18888290881 to confirm. Many tickets under Expedia flight cancellation are non-refundable, so check with +1 888-829-0881 or +18888290881 before you cancel. If your ticket allows refunds under the+1 888-829-0881 Expedia flight cancellation policy, your money will usually go back to your original payment — confirm at +1 888-829-0881 or +18888290881. Sometimes, the Expedia flight cancellation option gives you a credit instead of cash, so clarify at +1 888-829-0881 or +18888290881. Always check your airline’s conditions and the Expedia flight cancellation confirmation email — or just call +1 888-829-0881 or +18888290881 for updates. For fast help with any Expedia flight cancellation, your best step is to call +1 888-829-0881 or +18888290881 now

SwiGLU

SwiGLU is an activation function which is a variant of GLU. The definition is as follows:

MPN

Matrix-power Normalization

Random Scaling

Random Scaling is a type of image data augmentation in which we randomly change the scale of the image within a specified range. The Albumentations library has generalization of the RandomScaling called Affine Affine transform allows randomly scale as RandomScaling, but you may also randomly rotate, translate, and shear.

RepVGG

RepVGG is a VGG-style convolutional architecture. It has the following advantages: - The model has a VGG-like plain (a.k.a. feed-forward) topology 1 without any branches. I.e., every layer takes the output of its only preceding layer as input and feeds the output into its only following layer. - The model’s body uses only 3 × 3 conv and ReLU. - The concrete architecture (including the specific depth and layer widths) is instantiated with no automatic search, manual refinement, compound scaling, nor other heavy designs.

Natural Language ProcessingIntroduced 200013 papers

E-Branchformer

E-BRANCHFORMER: BRANCHFORMER WITH ENHANCED MERGING FOR SPEECH RECOGNITION

Characteristic Functions

Characteristic Function Estimation for Discrete Probability Distributions

SNGAN

Spectrally Normalised GAN

SNGAN, or Spectrally Normalised GAN, is a type of generative adversarial network that uses spectral normalization, a type of weight normalization, to stabilise the training of the discriminator.

Forward gradient

Forward gradients are unbiased estimators of the gradient for a function , given by . Here is a random vector, which must satisfy the following conditions in order for to be an unbiased estimator of for all for all for all Forward gradients can be computed with a single jvp (Jacobian Vector Product), which enables the use of the forward mode of autodifferentiation instead of the usual reverse mode, which has worse computational characteristics.

GraphsIntroduced 200012 papers

CGNN

Crystal Graph Neural Network

The full architecture of CGNN is presented at CGNN's official site.

LayerScale

LayerScale is a method used for vision transformer architectures to help improve training dynamics. It adds a learnable diagonal matrix on output of each residual block, initialized close to (but not at) 0. Adding this simple layer after each residual block improves the training dynamic, allowing for the training of deeper high-capacity image transformers that benefit from depth. Specifically, LayerScale is a per-channel multiplication of the vector produced by each residual block, as opposed to a single scalar, see Figure (d). The objective is to group the updates of the weights associated with the same output channel. Formally, LayerScale is a multiplication by a diagonal matrix on output of each residual block. In other words: where the parameters and are learnable weights. The diagonal values are all initialized to a fixed small value we set it to until depth 18 , for depth 24 and for deeper networks. This formula is akin to other normalization strategies ActNorm or LayerNorm but executed on output of the residual block. Yet LayerScale seeks a different effect: ActNorm is a data-dependent initialization that calibrates activations so that they have zero-mean and unit variance, like BatchNorm. In contrast, in LayerScale, we initialize the diagonal with small values so that the initial contribution of the residual branches to the function implemented by the transformer is small. In that respect the motivation is therefore closer to that of ReZero, SkipInit, Fixup and T-Fixup: to train closer to the identity function and let the network integrate the additional parameters progressively during the training. LayerScale offers more diversity in the optimization than just adjusting the whole layer by a single learnable scalar as in ReZero/SkipInit, Fixup and T-Fixup.

Natural Language ProcessingIntroduced 200012 papers

MobileBERT

MobileBERT is a type of inverted-bottleneck BERT that compresses and accelerates the popular BERT model. MobileBERT is a thin version of BERTLARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERTLARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. It is trained by layer-to-layer imitating the inverted bottleneck BERT.

HGS

Hunger Games Search

Hunger Games Search (HGS) is a general-purpose population-based optimization technique with a simple structure, special stability features and very competitive performance to realize the solutions of both constrained and unconstrained problems more effectively. HGS is designed according to the hunger-driven activities and behavioural choice of animals. This dynamic, fitness-wise search method follows a simple concept of “Hunger” as the most crucial homeostatic motivation and reason for behaviours, decisions, and actions in the life of all animals to make the process of optimization more understandable and consistent for new users and decision-makers. The Hunger Games Search incorporates the concept of hunger into the feature process; in other words, an adaptive weight based on the concept of hunger is designed and employed to simulate the effect of hunger on each search step. It follows the computationally logical rules (games) utilized by almost all animals and these rival activities and games are often adaptive evolutionary by securing higher chances of survival and food acquisition. This method's main feature is its dynamic nature, simple structure, and high performance in terms of convergence and acceptable quality of solutions, proving to be more efficient than the current optimization methods. Implementation of the HGS algorithm is available at https://aliasgharheidari.com/HGS.html.

MixConv

Mixed Depthwise Convolution

MixConv, or Mixed Depthwise Convolution, is a type of depthwise convolution that naturally mixes up multiple kernel sizes in a single convolution. It is based on the insight that depthwise convolution applies a single kernel size to all channels, which MixConv overcomes by combining the benefits of multiple kernel sizes. It does this by partitioning channels into groups and applying a different kernel size to each group.

Natural Language ProcessingIntroduced 200012 papers

GEE

Generative Emotion Estimator

CvT

Convolutional Vision Transformer

The Convolutional vision Transformer (CvT) is an architecture which incorporates convolutions into the Transformer. The CvT design introduces convolutions to two core sections of the ViT architecture. First, the Transformers are partitioned into multiple stages that form a hierarchical structure of Transformers. The beginning of each stage consists of a convolutional token embedding that performs an overlapping convolution operation with stride on a 2D-reshaped token map (i.e., reshaping flattened token sequences back to the spatial grid), followed by layer normalization. This allows the model to not only capture local information, but also progressively decrease the sequence length while simultaneously increasing the dimension of token features across stages, achieving spatial downsampling while concurrently increasing the number of feature maps, as is performed in CNNs. Second, the linear projection prior to every self-attention block in the Transformer module is replaced with a proposed convolutional projection, which employs a s × s depth-wise separable convolution operation on an 2D-reshaped token map. This allows the model to further capture local spatial context and reduce semantic ambiguity in the attention mechanism. It also permits management of computational complexity, as the stride of convolution can be used to subsample the key and value matrices to improve efficiency by 4× or more, with minimal degradation of performance.

Spatially Separable Convolution

A Spatially Separable Convolution decomposes a convolution into two separate operations. In regular convolution, if we have a 3 x 3 kernel then we directly convolve this with the image. We can divide a 3 x 3 kernel into a 3 x 1 kernel and a 1 x 3 kernel. Then, in spatially separable convolution, we first convolve the 3 x 1 kernel then the 1 x 3 kernel. This requires 6 instead of 9 parameters compared to regular convolution, and so it is more parameter efficient (additionally less matrix multiplications are required). Image Source: Kunlun Bai

Reinforcement LearningIntroduced 200012 papers

Stochastic Dueling Network

A Stochastic Dueling Network, or SDN, is an architecture for learning a value function . The SDN learns both and off-policy while maintaining consistency between the two estimates. At each time step it outputs a stochastic estimate of and a deterministic estimate of .

Auxiliary Batch Normalization

Auxiliary Batch Normalization is a type of regularization used in adversarial training schemes. The idea is that adversarial examples should have a separate batch normalization components to the clean examples, as they have different underlying statistics.

Softsign Activation

Softsign is an activation function for neural networks: Image Source: Sefik Ilkin Serengil

Global Context Block

A Global Context Block is an image model block for global context modeling. The aim is to have both the benefits of the simplified non-local block with effective modeling of long-range dependencies, and the squeeze-excitation block with lightweight computation. In the Global Context framework, we have (a) global attention pooling, which adopts a 1x1 convolution and softmax function to obtain the attention weights, and then performs the attention pooling to obtain the global context features, (b) feature transform via a 1x1 convolution ; (c) feature aggregation, which employs addition to aggregate the global context features to the features of each position. Taken as a whole, the GC block is proposed as a lightweight way to achieve global context modeling.

Natural Language ProcessingIntroduced 200012 papers

Levenshtein Transformer

The Levenshtein Transformer (LevT) is a type of transformer that aims to address the lack of flexibility of previous decoding models. Notably, in previous frameworks, the length of generated sequences is either fixed or monotonically increased as the decoding proceeds. The authors argue this is incompatible with human-level intelligence where humans can revise, replace, revoke or delete any part of their generated text. Hence, LevT is proposed to bridge this gap by breaking the in-so-far standardized decoding mechanism and replacing it with two basic operations — insertion and deletion. LevT is trained using imitation learning. The resulted model contains two policies and they are executed in an alternate manner. The authors argue that with this model decoding becomes more flexible. For example, when the decoder is given an empty token, it falls back to a normal sequence generation model. On the other hand, the decoder acts as a refinement model when the initial state is a low-quality generated sequence. One crucial component in LevT framework is the learning algorithm. The authors leverage the characteristics of insertion and deletion — they are complementary but also adversarial. The algorithm they propose is called “dual policy learning”. The idea is that when training one policy (insertion or deletion), we use the output from its adversary at the previous iteration as input. An expert policy, on the other hand, is drawn to provide a correction signal.

ACER

ACER, or Actor Critic with Experience Replay, is an actor-critic deep reinforcement learning agent with experience replay. It can be seen as an off-policy extension of A3C, where the off-policy estimator is made feasible by: - Using Retrace Q-value estimation. - Using truncated importance sampling with bias correction. - Using a trust region policy optimization method. - Using a stochastic dueling network architecture.

Reinforcement LearningIntroduced 200012 papers

FBNet

FBNet is a type of convolutional neural architectures discovered through DNAS neural architecture search. It utilises a basic type of image model block inspired by MobileNetv2 that utilises depthwise convolutions and an inverted residual structure (see components).

AudioIntroduced 200012 papers

SepFormer

SepFormer is Transformer-based neural network for speech separation. The SepFormer learns short and long-term dependencies with a multi-scale approach that employs transformers. It is mainly composed of multi-head attention and feed-forward layers. A dual-path framework (introduced by DPRNN) is adopted and RNNs are replaced with a multiscale pipeline composed of transformers that learn both short and long-term dependencies. The dual-path framework enables the mitigation of the quadratic complexity of transformers, as transformers in the dual-path framework process smaller chunks. The model is based on the learned-domain masking approach and employs an encoder, a decoder, and a masking network, as shown in the figure. The encoder is fully convolutional, while the decoder employs two Transformers embedded inside the dual-path processing block. The decoder finally reconstructs the separated signals in the time domain by using the masks predicted by the masking network.

ReLIC

ReLIC, or Representation Learning via Invariant Causal Mechanisms, is a self-supervised learning objective that enforces invariant prediction of proxy targets across augmentations through an invariance regularizer which yields improved generalization guarantees. We can write the objective as: where is the proxy task loss and is the Kullback-Leibler (KL) divergence. Note that any distance measure on distributions can be used in place of the KL divergence. Concretely, as proxy task we associate to every datapoint the label . This corresponds to the instance discrimination task, commonly used in contrastive learning. We take pairs of points to compute similarity scores and use pairs of augmentations to perform a style intervention. Given a batch of samples , we use with data augmented with and a softmax temperature parameter. We encode using a neural network and choose to be related to , e.g. or as a network with an exponential moving average of the weights of (e.g. target networks). To compare representations we use the function where is a fully-connected neural network often called the critic. Combining these pieces, we learn representations by minimizing the following objective over the full set of data and augmentations with the number of points we use to construct the contrast set and the weighting of the invariance penalty. The shorthand is used for . The Figure shows a schematic of the RELIC objective.

Reinforcement LearningIntroduced 200012 papers

SCST

Self-critical Sequence Training

Style Transfer Module

Modules used in GAN's style transfer.

CMCL

Crossmodal Contrastive Learning

CMCL, or Crossmodal Contrastive Learning, is a method for unifying visual and textual representations into the same semantic space based on a large-scale corpus of image collections, text corpus and image-text pairs. The CMCL aligns the visual representations and textual representations, and unifies them into the same semantic space based on image-text pairs. As shown in the Figure, to facilitate different levels of semantic alignment between vision and language, a series of text rewriting techniques are utilized to improve the diversity of cross-modal information. Specifically, for an image-text pair, various positive examples and hard negative examples can be obtained by rewriting the original caption at different levels. Moreover, to incorporate more background information from the single-modal data, text and image retrieval are also applied to augment each image-text pair with various related texts and images. The positive pairs, negative pairs, related images and texts are learned jointly by CMCL. In this way, the model can effectively unify different levels of visual and textual representations into the same semantic space, and incorporate more single-modal knowledge to enhance each other.

FBNet Block

FBNet Block is an image model block used in the FBNet architectures discovered through DNAS neural architecture search. The basic building blocks employed are depthwise convolutions and a residual connection.

Florence

Florence is a computer vision foundation model aiming to learn universal visual-language representations that be adapted to various computer vision tasks, visual question answering, image captioning, video retrieval, among other tasks. Florence's workflow consists of data curation, unified learning, Transformer architectures and adaption. Florence is pre-trained in an image-label-description space, utilizing a unified image-text contrastive learning. It involves a two-tower architecture: 12-layer Transformer for the language encoder, and a Vision Transformer for the image encoder. Two linear projection layers are added on top of the image encoder and language encoder to match the dimensions of image and language features. Compared to previous methods for cross-modal shared representations, Florence expands beyond simple classification and retrieval capabilities to advanced representations that support object level, multiple modality, and videos respectively.

GraphsIntroduced 200012 papers

RDF2Vec

TNT

Transformer in Transformer

Transformer is a type of self-attention-based neural networks originally applied for NLP tasks. Recently, pure transformer-based models are proposed to solve computer vision problems. These visual transformers usually view an image as a sequence of patches while they ignore the intrinsic structure information inside each patch. In this paper, we propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation. In each TNT block, an outer transformer block is utilized to process patch embeddings, and an inner transformer block extracts local features from pixel embeddings. The pixel-level feature is projected to the space of patch embedding by a linear transformation layer and then added into the patch. By stacking the TNT blocks, we build the TNT model for image recognition. Image source: Han et al.

MixNet

MixNet is a type of convolutional neural network discovered via AutoML that utilises MixConvs instead of regular depthwise convolutions.

Weight Standardization

Weight Standardization is a normalization technique that smooths the loss landscape by standardizing the weights in convolutional layers. Different from the previous normalization methods that focus on activations, WS considers the smoothing effects of weights more than just length-direction decoupling. Theoretically, WS reduces the Lipschitz constants of the loss and the gradients. Hence, WS smooths the loss landscape and improves training. In Weight Standardization, instead of directly optimizing the loss on the original weights , we reparameterize the weights as a function of , i.e. , and optimize the loss on by SGD: where Similar to Batch Normalization, WS controls the first and second moments of the weights of each output channel individually in convolutional layers. Note that many initialization methods also initialize the weights in some similar ways. Different from those methods, WS standardizes the weights in a differentiable way which aims to normalize gradients during back-propagation. Note that we do not have any affine transformation on . This is because we assume that normalization layers such as BN or GN will normalize this convolutional layer again.