Methods

APA

Adaptive Pseudo Augmentation

Res2Net

Res2Net is an image model that employs a variation on bottleneck residual blocks. The motivation is to be able to represent features at multiple scales. This is achieved through a novel building block for CNNs that constructs hierarchical residual-like connections within one single residual block. This represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.

DPT

Dense Prediction Transformer

Dense Prediction Transformers (DPT) are a type of vision transformer for dense prediction tasks. The input image is transformed into tokens (orange) either by extracting non-overlapping patches followed by a linear projection of their flattened representation (DPT-Base and DPT-Large) or by applying a ResNet-50 feature extractor (DPT-Hybrid). The image embedding is augmented with a positional embedding and a patch-independent readout token (red) is added. The tokens are passed through multiple transformer stages. The tokens are reassembled from different stages into an image-like representation at multiple resolutions (green). Fusion modules (purple) progressively fuse and upsample the representations to generate a fine-grained prediction.

Res2Net Block

A Res2Net Block is an image model block that constructs hierarchical residual-like connections within one single residual block. It was proposed as part of the Res2Net CNN architecture. The block represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The filters of channels is replaced with a set of smaller filter groups, each with channels. These smaller filter groups are connected in a hierarchical residual-like style to increase the number of scales that the output features can represent. Specifically, we divide input feature maps into several groups. A group of filters first extracts features from a group of input feature maps. Output features of the previous group are then sent to the next group of filters along with another group of input feature maps. This process repeats several times until all input feature maps are processed. Finally, feature maps from all groups are concatenated and sent to another group of filters to fuse information altogether. Along with any possible path in which input features are transformed to output features, the equivalent receptive field increases whenever it passes a filter, resulting in many equivalent feature scales due to combination effects. One way of thinking of these blocks is that they expose a new dimension, scale, alongside the existing dimensions of depth, width, and cardinality.

GraphsIntroduced 200025 papers

MoNet

Mixture model network

Mixture model network (MoNet) is a general framework allowing to design convolutional deep architectures on non-Euclidean domains such as graphs and manifolds. Image and description from: Geometric deep learning on graphs and manifolds using mixture model CNNs

VisualBERT

VisualBERT aims to reuse self-attention to implicitly align elements of the input text and regions in the input image. Visual embeddings are used to model images where the representations are represented by a bounding region in an image obtained from an object detector. These visual embeddings are constructed by summing three embeddings: 1) visual feature representation, 2) a segment embedding indicate whether it is an image embedding, and 3) position embedding. Essentially, image regions and language are combined with a Transformer to allow self-attention to discover implicit alignments between language and vision. VisualBERT is trained using COCO, which consists of images paired with captions. It is pre-trained using two objectives: masked language modeling objective and sentence-image prediction task. It can then be fine-tuned on different downstream tasks.

GeneralIntroduced 200025 papers

Location Sensitive Attention

Location Sensitive Attention is an attention mechanism that extends the additive attention mechanism to use cumulative attention weights from previous decoder time steps as an additional feature. This encourages the model to move forward consistently through the input, mitigating potential failure modes where some subsequences are repeated or ignored by the decoder. Starting with additive attention where is a sequential representation from a BiRNN encoder and is the -th state of a recurrent neural network (e.g. a LSTM or GRU): where and are vectors, and are matrices. We extend this to be location-aware by making it take into account the alignment produced at the previous step. First, we extract vectors for every position of the previous alignment by convolving it with a matrix : These additional vectors are then used by the scoring mechanism :

Dataset Pruning

Dataset pruning is an approach to reduce a large dataset to obtain a small dataset by removing less significant sample.

GeneralIntroduced 200025 papers

R2D2

Recurrent Replay Distributed DQN

Building on the recent successes of distributed training of RL agents, R2D2 is an RL approach that trains a RNN-based RL agents from distributed prioritized experience replay. Using a single network architecture and fixed set of hyperparameters, Recurrent Replay Distributed DQN quadrupled the previous state of the art on Atari-57, and matches the state of the art on DMLab-30. It was the first agent to exceed human-level performance in 52 of the 57 Atari games.

Reinforcement LearningIntroduced 200025 papers

Firefly algorithm

Metaheuristic algorithm

Reinforcement LearningIntroduced 200024 papers

ECO

The Educational Competition Optimizer

In recent research, metaheuristic strategies stand out as powerful tools for complex optimization, capturing widespread attention. This study proposes the Educational Competition Optimizer (ECO), an algorithm created for diverse optimization tasks. ECO draws inspiration from the competitive dynamics observed in real-world educational resource allocation scenarios, harnessing this principle to refine its search process. To further boost its efficiency, the algorithm divides the iterative process into three distinct phases: elementary, middle, and high school. Through this stepwise approach, ECO gradually narrows down the pool of potential solutions, mirroring the gradual competition witnessed within educational systems. This strategic approach ensures a smooth and resourceful transition between ECO's exploration and exploitation phases. The results indicate that ECO attains its peak optimization performance when configured with a population size of 40. Notably, the algorithm's optimization efficacy does not exhibit a strictly linear correlation with population size. To comprehensively evaluate ECO's effectiveness and convergence characteristics, we conducted a rigorous comparative analysis, comparing ECO against nine state-of-the-art metaheuristic algorithms. ECO's remarkable success in efficiently addressing complex optimization problems underscores its potential applicability across diverse real-world domains. The additional resources and open-source code for the proposed ECO can be accessed at https://aliasgharheidari.com/ECO.html and https://github.com/junbolian/ECO.

GeneralIntroduced 200024 papers

ENet Bottleneck

ENet Bottleneck is an image model block used in the ENet semantic segmentation architecture. Each block consists of three convolutional layers: a 1 × 1 projection that reduces the dimensionality, a main convolutional layer, and a 1 × 1 expansion. We place Batch Normalization and PReLU between all convolutions. If the bottleneck is downsampling, a max pooling layer is added to the main branch. Also, the first 1 × 1 projection is replaced with a 2 × 2 convolution with stride 2 in both dimensions. We zero pad the activations, to match the number of feature maps.

ENet

ENet is a semantic segmentation architecture which utilises a compact encoder-decoder architecture. Some design choices include: 1. Using the SegNet approach to downsampling y saving indices of elements chosen in max pooling layers, and using them to produce sparse upsampled maps in the decoder. 2. Early downsampling to optimize the early stages of the network and reduce the cost of processing large input frames. The first two blocks of ENet heavily reduce the input size, and use only a small set of feature maps. 3. Using PReLUs as an activation function 4. Using dilated convolutions 5. Using Spatial Dropout

ENet Dilated Bottleneck

ENet Dilated Bottleneck is an image model block used in the ENet semantic segmentation architecture. It is the same as a regular ENet Bottleneck but employs dilated convolutions instead.

RegNetY

RegNetY is a convolutional network design space with simple, regular models with parameters: depth , initial width , and slope , and generates a different block width for each block . The key restriction for the RegNet types of model is that there is a linear parameterisation of block widths (the design space only contains models with this linear structure): For RegNetX we have additional restrictions: we set (the bottleneck ratio), , and (the width multiplier). For RegNetY we make one change, which is to include Squeeze-and-Excitation blocks.

GeneralIntroduced 200024 papers

Multi-Head Linear Attention

Multi-Head Linear Attention is a type of linear multi-head self-attention module, proposed with the Linformer architecture. The main idea is to add two linear projection matrices when computing key and value. We first project the original -dimensional key and value layers and into -dimensional projected key and value layers. We then compute a dimensional context mapping using scaled-dot product attention: Finally, we compute context embeddings for each head using .

Highway networks

There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult with increasing depth and training of very deep networks remains an open problem. In this extended abstract, we introduce a new architecture designed to ease gradient-based training of very deep networks. We refer to networks with this architecture as highway networks, since they allow unimpeded information flow across several layers on "information highways". The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions, opening up the possibility of studying extremely deep and efficient architectures.

GeneralIntroduced 200024 papers

ENet Initial Block

The ENet Initial Block is an image model block used in the ENet semantic segmentation architecture. Max Pooling is performed with non-overlapping 2 × 2 windows, and the convolution has 13 filters, which sums up to 16 feature maps after concatenation. This is heavily inspired by Inception Modules.

Ghost Bottleneck

A Ghost BottleNeck is a skip connection block, similar to the basic residual block in ResNet in which several convolutional layers and shortcuts are integrated, but stacks Ghost Modules instead (two stacked Ghost modules). It was proposed as part of the GhostNet CNN architecture. The first Ghost module acts as an expansion layer increasing the number of channels. The ratio between the number of the output channels and that of the input is referred to as the expansion ratio. The second Ghost module reduces the number of channels to match the shortcut path. Then the shortcut is connected between the inputs and the outputs of these two Ghost modules. The batch normalization (BN) and ReLU nonlinearity are applied after each layer, except that ReLU is not used after the second Ghost module as suggested by MobileNetV2. The Ghost bottleneck described above is for stride=1. As for the case where stride=2, the shortcut path is implemented by a downsampling layer and a depthwise convolution with stride=2 is inserted between the two Ghost modules. In practice, the primary convolution in Ghost module here is pointwise convolution for its efficiency.

GeneralIntroduced 200023 papers

ROME

Rank-One Model Editing

CABiNet

Context Aggregated Bi-lateral Network for Semantic Segmentation

With the increasing demand of autonomous systems, pixelwise semantic segmentation for visual scene understanding needs to be not only accurate but also efficient for potential real-time applications. In this paper, we propose Context Aggregation Network, a dual branch convolutional neural network, with significantly lower computational costs as compared to the state-of-the-art, while maintaining a competitive prediction accuracy. Building upon the existing dual branch architectures for high-speed semantic segmentation, we design a high resolution branch for effective spatial detailing and a context branch with light-weight versions of global aggregation and local distribution blocks, potent to capture both long-range and local contextual dependencies required for accurate semantic segmentation, with low computational overheads. We evaluate our method on two semantic segmentation datasets, namely Cityscapes dataset and UAVid dataset. For Cityscapes test set, our model achieves state-of-the-art results with mIOU of 75.9%, at 76 FPS on an NVIDIA RTX 2080Ti and 8 FPS on a Jetson Xavier NX. With regards to UAVid dataset, our proposed network achieves mIOU score of 63.5% with high execution speed (15 FPS).

Selective Search

Selective Search is a region proposal algorithm for object detection tasks. It starts by over-segmenting the image based on intensity of the pixels using a graph-based segmentation method by Felzenszwalb and Huttenlocher. Selective Search then takes these oversegments as initial input and performs the following steps 1. Add all bounding boxes corresponding to segmented parts to the list of regional proposals 2. Group adjacent segments based on similarity 3. Go to step 1 At each iteration, larger segments are formed and added to the list of region proposals. Hence we create region proposals from smaller segments to larger segments in a bottom-up approach. This is what we mean by computing “hierarchical” segmentations using Felzenszwalb and Huttenlocher’s oversegments.

ESPNet

ESPNet is a convolutional neural network for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a convolutional module, efficient spatial pyramid (ESP), which is efficient in terms of computation, memory, and power.

ARCH

Animatable Reconstruction of Clothed Humans

Animatable Reconstruction of Clothed Humans is an end-to-end framework for accurate reconstruction of animation-ready 3D clothed humans from a monocular image. ARCH is a learned pose-aware model that produces detailed 3D rigged full-body human avatars from a single unconstrained RGB image. A Semantic Space and a Semantic Deformation Field are created using a parametric 3D body estimator. They allow the transformation of 2D/3D clothed humans into a canonical space, reducing ambiguities in geometry caused by pose variations and occlusions in training data. Detailed surface geometry and appearance are learned using an implicit function representation with spatial local features.

Natural Language ProcessingIntroduced 200023 papers

UNITER

UNiversal Image-TExt Representation Learning

UNITER or UNiversal Image-TExt Representation model is a large-scale pre-trained model for joint multimodal embedding. It is pre-trained using four image-text datasets COCO, Visual Genome, Conceptual Captions, and SBU Captions. It can power heterogeneous downstream V+L tasks with joint multimodal embeddings. UNITER takes the visual regions of the image and textual tokens of the sentence as inputs. A faster R-CNN is used in Image Embedder to extract the visual features of each region and a Text Embedder is used to tokenize the input sentence into WordPieces. It proposes WRA via the Optimal Transport to provide more fine-grained alignment between word tokens and image regions that is effective in calculating the minimum cost of transporting the contextualized image embeddings to word embeddings and vice versa. Four pretraining tasks were designed for this model. They are Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). This model is different from the previous models because it uses conditional masking on pre-training tasks.

Cascade Mask R-CNN

Cascade Mask R-CNN extends Cascade R-CNN to instance segmentation, by adding a mask head to the cascade. In the Mask R-CNN, the segmentation branch is inserted in parallel to the detection branch. However, the Cascade R-CNN has multiple detection branches. This raises the questions of 1) where to add the segmentation branch and 2) how many segmentation branches to add. The authors consider three strategies for mask prediction in the Cascade R-CNN. The first two strategies address the first question, adding a single mask prediction head at either the first or last stage of the Cascade R-CNN. Since the instances used to train the segmentation branch are the positives of the detection branch, their number varies in these two strategies. Placing the segmentation head later on the cascade leads to more examples. However, because segmentation is a pixel-wise operation, a large number of highly overlapping instances is not necessarily as helpful as for object detection, which is a patch-based operation. The third strategy addresses the second question, adding a segmentation branch to each cascade stage. This maximizes the diversity of samples used to learn the mask prediction task. At inference time, all three strategies predict the segmentation masks on the patches produced by the final object detection stage, irrespective of the cascade stage on which the segmentation mask is implemented and how many segmentation branches there are.

AudioIntroduced 200023 papers

Tacotron 2

Tacotron2

Tacotron 2 is a neural network architecture for speech synthesis directly from text. It consists of two components: - a recurrent sequence-to-sequence feature prediction network with attention which predicts a sequence of mel spectrogram frames from an input character sequence - a modified version of WaveNet which generates time-domain waveform samples conditioned on the predicted mel spectrogram frames In contrast to the original Tacotron, Tacotron 2 uses simpler building blocks, using vanilla LSTM and convolutional layers in the encoder and decoder instead of CBHG stacks and GRU recurrent layers. Tacotron 2 does not use a “reduction factor”, i.e., each decoder step corresponds to a single spectrogram frame. Location-sensitive attention is used instead of additive attention.

VGAE

Variational Graph Auto Encoder

GraphsIntroduced 200023 papers

Dueling Network

A Dueling Network is a type of Q-Network that has two streams to separately estimate (scalar) state-value and the advantages for each action. Both streams share a common convolutional feature learning module. The two streams are combined via a special aggregating layer to produce an estimate of the state-action value function Q as shown in the figure to the right. The last module uses the following mapping: This formulation is chosen for identifiability so that the advantage function has zero advantage for the chosen action, but instead of a maximum we use an average operator to increase the stability of the optimization.

Reinforcement LearningIntroduced 200023 papers

Neural Architecture Search

Neural Architecture Search (NAS) learns a modular architecture which can be transferred from a small dataset to a large dataset. The method does this by reducing the problem of learning best convolutional architectures to the problem of learning a small convolutional cell. The cell can then be stacked in series to handle larger images and more complex datasets. Note that this refers to the original method referred to as NAS - there is also a broader category of methods called "neural architecture search".

GeneralIntroduced 200023 papers

MobileViT

MobileViT is a vision transformer that is tuned to mobile phone

NICE

Non-linear Independent Component Estimation

NICE, or Non-Linear Independent Components Estimation is a framework for modeling complex high-dimensional densities. It is based on the idea that a good representation is one in which the data has a distribution that is easy to model. For this purpose, a non-linear deterministic transformation of the data is learned that maps it to a latent space so as to make the transformed data conform to a factorized distribution, i.e., resulting in independent latent variables. The transformation is parameterised so that computing the determinant of the Jacobian and inverse Jacobian is trivial, yet it maintains the ability to learn complex non-linear transformations, via a composition of simple building blocks, each based on a deep neural network. The training criterion is simply the exact log-likelihood. The transformation used in NICE is the affine coupling layer without the scale term, known as additive coupling layer:

GeneralIntroduced 200022 papers

DropPath

Just as dropout prevents co-adaptation of activations, DropPath prevents co-adaptation of parallel paths in networks such as FractalNets by randomly dropping operands of the join layers. This discourages the network from using one input path as an anchor and another as a corrective term (a configuration that, if not prevented, is prone to overfitting). Two sampling strategies are: - Local: a join drops each input with fixed probability, but we make sure at least one survives. - Global: a single path is selected for the entire network. We restrict this path to be a single column, thereby promoting individual columns as independently strong predictors.

GhostNet

A GhostNet is a type of convolutional neural network that is built using Ghost modules, which aim to generate more features by using fewer parameters (allowing for greater efficiency). GhostNet mainly consists of a stack of Ghost bottlenecks with the Ghost modules as the building block. The first layer is a standard convolutional layer with 16 filters, then a series of Ghost bottlenecks with gradually increased channels follow. These Ghost bottlenecks are grouped into different stages according to the sizes of their input feature maps. All the Ghost bottlenecks are applied with stride=1 except that the last one in each stage is with stride=2. At last a global average pooling and a convolutional layer are utilized to transform the feature maps to a 1280-dimensional feature vector for final classification. The squeeze and excite (SE) module is also applied to the residual layer in some ghost bottlenecks. In contrast to MobileNetV3, GhostNet does not use hard-swish nonlinearity function due to its large latency.

SequentialIntroduced 200022 papers

CNN BiLSTM

CNN Bidirectional LSTM

A CNN BiLSTM is a hybrid bidirectional LSTM and CNN architecture. In the original formulation applied to named entity recognition, it learns both character-level and word-level features. The CNN component is used to induce the character-level features. For each word the model employs a convolution and a max pooling layer to extract a new feature vector from the per-character feature vectors such as character embeddings and (optionally) character type.

RGCN

Relational Graph Convolution Network

An RGCN, or Relational Graph Convolution Network, is a an application of the GCN framework to modeling relational data, specifically to link prediction and entity classification tasks. See here for an in-depth explanation of RGCNs by DGL.

GraphsIntroduced 200022 papers

Soft-NMS

Non-maximum suppression is an integral part of the object detection pipeline. First, it sorts all detection boxes on the basis of their scores. The detection box with the maximum score is selected and all other detection boxes with a significant overlap (using a pre-defined threshold) with are suppressed. This process is recursively applied on the remaining boxes. As per the design of the algorithm, if an object lies within the predefined overlap threshold, it leads to a miss. Soft-NMS solves this problem by decaying the detection scores of all other objects as a continuous function of their overlap with M. Hence, no object is eliminated in this process.