Methods

Adaptive Feature Pooling

Adaptive Feature Pooling pools features from all levels for each proposal in object detection and fuses them for the following prediction. For each proposal, we map them to different feature levels. Following the idea of Mask R-CNN, RoIAlign is used to pool feature grids from each level. Then a fusion operation (element-wise max or sum) is utilized to fuse feature grids from different levels. The motivation for this technique is that in an FPN we assign proposals to different feature levels based on the size of proposals, which could be suboptimal if images with small differences are assigned to different levels, or if the importance of features is not strongly correlated to their level which they belong.

Shifted Softplus

Shifted Softplus is an activation function , which SchNet employs as non-linearity throughout the network in order to obtain a smooth potential energy surface. The shifting ensures that and improves the convergence of the network. This activation function shows similarity to ELUs, while having infinite order of continuity.

Reinforcement LearningIntroduced 201420 papers

DPG

Deterministic Policy Gradient

Deterministic Policy Gradient, or DPG, is a policy gradient method for reinforcement learning. Instead of the policy function being modeled as a probability distribution, DPG considers and calculates gradients for a deterministic policy .

AugMix

AugMix mixes augmented images through linear interpolations. Consequently it is like Mixup but instead mixes augmented versions of the same image.

Switch FFN

A Switch FFN is a sparse layer that operates independently on tokens within an input sequence. It is shown in the blue block in the figure. We diagram two tokens ( = “More” and = “Parameters” below) being routed (solid lines) across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value (dotted-line).

Natural Language ProcessingIntroduced 200020 papers

Reformer

Reformer is a Transformer based architecture that seeks to make efficiency improvements. Dot-product attention is replaced by one that uses locality-sensitive hashing, changing its complexity from O() to O(), where is the length of the sequence. Furthermore, Reformers use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of times, where is the number of layers.

SchNet

Schrödinger Network

SchNet is an end-to-end deep neural network architecture based on continuous-filter convolutions. It follows the deep tensor neural network framework, i.e. atom-wise representations are constructed by starting from embedding vectors that characterize the atom type before introducing the configuration of the system by a series of interaction blocks.

GraphsIntroduced 200020 papers

Set Transformer

Many machine learning tasks such as multiple instance learning, 3D shape recognition, and few-shot image classification are defined on sets of instances. Since solutions to such problems do not depend on the order of elements of the set, models used to address them should be permutation invariant. We present an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set. The model consists of an encoder and a decoder, both of which rely on attention mechanisms. In an effort to reduce computational complexity, we introduce an attention scheme inspired by inducing point methods from sparse Gaussian process literature. It reduces the computation time of self-attention from quadratic to linear in the number of elements in the set. We show that our model is theoretically attractive and we evaluate it on a range of tasks, demonstrating the state-of-the-art performance compared to recent methods for set-structured data.

MetaFormer

MetaFormer is a general architecture abstracted from Transformers by not specifying the token mixer.

Stochastic Gradient Variational Bayes

AudioIntroduced 200020 papers

FastSpeech 2

FastSpeech2 is a text-to-speech model that aims to improve upon FastSpeech by better solving the one-to-many mapping problem in TTS, i.e., multiple speech variations corresponding to the same text. It attempts to solve this problem by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, in FastSpeech 2, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference. The encoder converts the phoneme embedding sequence into the phoneme hidden sequence, and then the variance adaptor adds different variance information such as duration, pitch and energy into the hidden sequence, finally the mel-spectrogram decoder converts the adapted hidden sequence into mel-spectrogram sequence in parallel. FastSpeech 2 uses a feed-forward Transformer block, which is a stack of self-attention and 1D-convolution as in FastSpeech, as the basic structure for the encoder and mel-spectrogram decoder.

InceptionTime

SequentialIntroduced 200019 papers

Adabelief

ALiBi

Attention with Linear Biases

ALiBi, or Attention with Linear Biases, is a positioning method that allows Transformer language models to consume, at inference time, sequences which are longer than the ones they were trained on. ALiBi does this without using actual position embeddings. Instead, computing the attention between a certain key and query, ALiBi penalizes the attention value that that query can assign to the key depending on how far away the key and query are. So when a key and query are close by, the penalty is very low, and when they are far away, the penalty is very high. This method was motivated by the simple reasoning that words that are close-by matter much more than ones that are far away. This method is as fast as the sinusoidal or absolute embedding methods (the fastest positioning methods there are). It outperforms those methods and Rotary embeddings when evaluating sequences that are longer than the ones the model was trained on (this is known as extrapolation).

Reinforcement LearningIntroduced 200019 papers

RLAIF

Reinforcement Learning from AI Feedback

MFR

Meta Face Recognition

Meta Face Recognition (MFR) is a meta-learning face recognition method. MFR synthesizes the source/target domain shift with a meta-optimization objective, which requires the model to learn effective representations not only on synthesized source domains but also on synthesized target domains. Specifically, domain-shift batches are built through a domain-level sampling strategy and back-propagated gradients/meta-gradients are obtained on synthesized source/target domains by optimizing multi-domain distributions. The gradients and meta-gradients are further combined to update the model to improve generalization.

Soups

Model Soups

Compress an ensemble of models into a single one by averaging their weights (under certain pre-conditions).

Random Erasing

Random Erasing is a data augmentation method for training the convolutional neural network (CNN), which randomly selects a rectangle region in an image and erases its pixels with random values. In this process, training images with various levels of occlusion are generated, which reduces the risk of over-fitting and makes the model robust to occlusion. Random Erasing is parameter learning free, easy to implement, and can be integrated with most of the CNN-based recognition models. Random Erasing is complementary to commonly used data augmentation techniques such as random cropping and flipping, and can be implemented in various vision tasks, such as image classification, object detection, semantic segmentation. In the Albumentations library, there is a generalization of RandomErasing called CoarseDropout, which allows masking an arbitrary number of regions of rectangular shape. It could be applied to images, segmentation masks, and key points. Documentation for CoarseDropout

DPN

Dual Path Network

A Dual Path Network (DPN) is a convolutional neural network which presents a new topology of connection paths internally. The intuition is that ResNets enables feature re-usage while DenseNet enables new feature exploration, and both are important for learning good representations. To enjoy the benefits from both path topologies, Dual Path Networks share common features while maintaining the flexibility to explore new features through dual path architectures. We formulate such a dual path architecture as follows: where and denote the extracted information at -th step from individual path, is a feature learning function as . The first equation refers to the densely connected path that enables exploring new features. The second equation refers to the residual path that enables common features re-usage. The third equation defines the dual path that integrates them and feeds them to the last transformation function in the last equation.

RealNVP

RealNVP is a generative model that utilises real-valued non-volume preserving (real NVP) transformations for density estimation. The model can perform efficient and exact inference, sampling and log-density estimation of data points.

Dynamic Memory Network

A Dynamic Memory Network is a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers. Questions trigger an iterative attention process which allows the model to condition its attention on the inputs and the result of previous iterations. These results are then reasoned over in a hierarchical recurrent sequence model to generate answers. The DMN consists of a number of modules: - Input Module: The input module encodes raw text inputs from the task into distributed vector representations. The input takes forms like a sentence, a long story, a movie review and so on. - Question Module: The question module encodes the question of the task into a distributed vector representation. For question answering, the question may be a sentence such as "Where did the author first fly?". The representation is fed into the episodic memory module, and forms the basis, or initial state, upon which the episodic memory module iterates. - Episodic Memory Module: Given a collection of input representations, the episodic memory module chooses which parts of the inputs to focus on through the attention mechanism. It then produces a ”memory” vector representation taking into account the question as well as the previous memory. Each iteration provides the module with newly relevant information about the input. In other words, the module has the ability to retrieve new information, in the form of input representations, which were thought to be irrelevant in previous iterations. - Answer Module: The answer module generates an answer from the final memory vector of the memory module.

Ensemble Clustering

Ensemble clustering, also called consensus clustering, has been attracting much attention in recent years, aiming to combine multiple base clustering algorithms into a better and more consensus clustering. Due to its good performance, ensemble clustering plays a vital role in many research areas, such as community detection and bioinformatics.

Natural Language ProcessingIntroduced 200018 papers

Universal Transformer

The Universal Transformer is a generalization of the Transformer architecture. Universal Transformers combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs. They also utilise a dynamic per-position halting mechanism.

CAG

Class activation guide

Class activation guide is a module which uses weak localization information from the instrument activation maps to guide the verb and target recognition. Image source: Nwoye et al.

LSGAN

LSGAN, or Least Squares GAN, is a type of generative adversarial network that adopts the least squares loss function for the discriminator. Minimizing the objective function of LSGAN yields minimizing the Pearson divergence. The objective function can be defined as: where and are the labels for fake data and real data and denotes the value that wants to believe for fake data.

CheXNet

CheXNet is a 121-layer DenseNet trained on ChestX-ray14 for pneumonia detection.

SCN

Self-Cure Network

Self-Cure Network, or SCN, is a method for suppressing uncertainties for large-scale facial expression recognition, prventing deep networks from overfitting uncertain facial images. Specifically, SCN suppresses the uncertainty from two different aspects: 1) a self-attention mechanism over mini-batch to weight each training sample with a ranking regularization, and 2) a careful relabeling mechanism to modify the labels of these samples in the lowest-ranked group.

PANet

Path Aggregation Network, or PANet, aims to boost information flow in a proposal-based instance segmentation framework. Specifically, the feature hierarchy is enhanced with accurate localization signals in lower layers by bottom-up path augmentation, which shortens the information path between lower layers and topmost feature. Additionally, adaptive feature pooling is employed, which links feature grid and all feature levels to make useful information in each feature level propagate directly to following proposal subnetworks. A complementary branch capturing different views for each proposal is created to further improve mask prediction.

TimeSformer

TimeSformer is a convolution-free approach to video classification built exclusively on self-attention over space and time. It adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Specifically, the method adapts the image model [Vision Transformer](https//www.paperswithcode.com/method/vision-transformer) (ViT) to video by extending the self-attention mechanism from the image space to the space-time 3D volume. As in ViT, each patch is linearly mapped into an embedding and augmented with positional information. This makes it possible to interpret the resulting sequence of vector

Population Based Training

Population Based Training, or PBT, is an optimization method for finding parameters and hyperparameters, and extends upon parallel search methods and sequential optimisation methods. It leverages information sharing across a population of concurrently running optimisation processes, and allows for online propagation/transfer of parameters and hyperparameters between members of the population based on their performance. Furthermore, unlike most other adaptation schemes, the method is capable of performing online adaptation of hyperparameters -- which can be particularly important in problems with highly non-stationary learning dynamics, such as reinforcement learning settings. PBT is decentralised and asynchronous, although it could also be executed semi-serially or with partial synchrony if there is a binding budget constraint.

Natural Language ProcessingIntroduced 200018 papers

PICARD

Parsing Incrementally for Constrained Auto-Regressive Decoding

Inception-ResNet-v2-B

Inception-ResNet-v2-B is an image model block for a 17 x 17 grid used in the Inception-ResNet-v2 architecture. It largely follows the idea of Inception modules - and grouped convolutions - but also includes residual connections.

Inception-ResNet-v2 Reduction-B

Inception-ResNet-v2 Reduction-B is an image model block used in the Inception-ResNet-v2 architecture.

SequentialIntroduced 200018 papers

EMF

Enhanced-Multimodal Fuzzy Framework

BCI MI framework to classifiy brain signals using a multimodal decission making phase, with an addtional differentiation of the signal.

Inception-ResNet-v2-C

Inception-ResNet-v2-C is an image model block for an 8 x 8 grid used in the Inception-ResNet-v2 architecture. It largely follows the idea of Inception modules - and grouped convolutions - but also includes residual connections.

GER

Gait Emotion Recognition

We present a novel classifier network called STEP, to classify perceived human emotion from gaits, based on a Spatial Temporal Graph Convolutional Network (ST-GCN) architecture. Given an RGB video of an individual walking, our formulation implicitly exploits the gait features to classify the perceived emotion of the human into one of four emotions: happy, sad, angry, or neutral. We train STEP on annotated real-world gait videos, augmented with annotated synthetic gaits generated using a novel generative network called STEP-Gen, built on an ST-GCN based Conditional Variational Autoencoder (CVAE). We incorporate a novel push-pull regularization loss in the CVAE formulation of STEP-Gen to generate realistic gaits and improve the classification accuracy of STEP. We also release a novel dataset (E-Gait), which consists of 4,227 human gaits annotated with perceived emotions along with thousands of synthetic gaits. In practice, STEP can learn the affective features and exhibits classification accuracy of 88\% on E-Gait, which is 14--30\% more accurate over prior methods.

Introduced 200018 papers

NAM

Neural Additive Model

Neural Additive Models (NAMs) make restrictions on the structure of neural networks, which yields a family of models that are inherently interpretable while suffering little loss in prediction accuracy when applied to tabular data. Methodologically, NAMs belong to a larger model family called Generalized Additive Models (GAMs). NAMs learn a linear combination of networks that each attend to a single input feature: each in the traditional GAM formulationis parametrized by a neural network. These networks are trained jointly using backpropagation and can learn arbitrarily complex shape functions. Interpreting NAMs is easy as the impact of a feature on the prediction does not rely on the other features and can be understood by visualizing its corresponding shape function (e.g., plotting vs. ).