8,725 machine learning methods and techniques
The Hard Sigmoid is an activation function used for neural networks of the form: Image Source: Rinat Maksutov
Holographic Reduced Representations are a simple mechanism to represent an associative array of key-value pairs in a fixed-size vector. Each individual key-value pair is the same size as the entire associative array; the array is represented by the sum of the pairs. Concretely, consider a complex vector key , which is the same size as the complex vector value x. The pair is "bound" together by element-wise complex multiplication, which multiplies the moduli and adds the phases of the elements: Given keys , , and input vectors , , , the associative array is: where we call a memory trace. Define the key inverse: To retrieve the item associated with key , we multiply the memory trace element-wise by the vector . For example: The product is exactly together with a noise term. If the phases of the elements of the key vector are randomly distributed, the noise term has zero mean. Source: Associative LSTMs
Distributional Generalization is a type of generalization that roughly states that outputs of a classifier at train and test time are close as distributions, as opposed to close in just their average error. This behavior is not captured by classical generalization, which would only consider the average error and not the distribution of errors over the input domain.
Precise RoI Pooling, or PrRoI Pooling, is a region of interest feature extractor that avoids any quantization of coordinates and has a continuous gradient on bounding box coordinates. Given the feature map before RoI/PrRoI Pooling (eg from Conv4 in ResNet-50), let be the feature at one discrete location on the feature map. Using bilinear interpolation, the discrete feature map can be considered continuous at any continuous coordinates : where is the interpolation coefficient. Then denote a bin of a RoI as , where and are the continuous coordinates of the top-left and bottom-right points, respectively. We perform pooling (e.g. average pooling) given and feature map by computing a two-order integral:
Enhanced Blockwise Classification
Traditional methods are based on block-wise regression. This framework, Enhanced Blockwise Classification (EBC), however, is based on the idea that aims to classify the count value within each block into several pre-defined bins. The enhancement comes from 3 aspects: discretization policy, label correction and loss function. Notice that the original block-wise classification concept was introduced by Liu et al. in Counting Objects by Blockwise Classification.
MODNet is a light-weight matting objective decomposition network that can process portrait matting from a single input image in real time. The design of MODNet benefits from optimizing a series of correlated sub-objectives simultaneously via explicit constraints. To overcome the domain shift problem, MODNet introduces a self-supervised strategy based on subobjective consistency (SOC) and a one-frame delay trick to smooth the results when applying MODNet to portrait video sequence. Given an input image , MODNet predicts human semantics , boundary details , and final alpha matte through three interdependent branches, , and , which are constrained by specific supervisions generated from the ground truth matte . Since the decomposed sub-objectives are correlated and help strengthen each other, we can optimize MODNet end-to-end.
Grid R-CNN is an object detection framework, where the traditional regression formulation is replaced by a grid point guided localization mechanism. Grid R-CNN divides the object bounding box region into grids and employs a fully convolutional network (FCN) to predict the locations of grid points. Owing to the position sensitive property of fully convolutional architecture, Grid R-CNN maintains the explicit spatial information and grid points locations can be obtained in pixel level. When a certain number of grid points at specified location are known, the corresponding bounding box is definitely determined. Guided by the grid points, Grid R-CNN can determine more accurate object bounding box than regression method which lacks the guidance of explicit spatial information.
Spatial Attention Module (SAM) is a feature extraction module for object detection used in ThunderNet. The ThunderNet SAM explicitly re-weights the feature map before RoI warping over the spatial dimensions. The key idea of SAM is to use the knowledge from RPN to refine the feature distribution of the feature map. RPN is trained to recognize foreground regions under the supervision of ground truths. Therefore, the intermediate features in RPN can be used to distinguish foreground features from background features. SAM accepts two inputs: the intermediate feature map from RPN and the thin feature map from the Context Enhancement Module . The output of SAM is defined as: Here is a dimension transformation to match the number of channels in both feature maps. The sigmoid function is used to constrain the values within . At last, is re-weighted by the generated feature map for better feature distribution. For computational efficiency, we simply apply a 1×1 convolution as , so the computational cost of CEM is negligible. The Figure to the right shows the structure of SAM. SAM has two functions. The first one is to refine the feature distribution by strengthening foreground features and suppressing background features. The second one is to stabilize the training of RPN as SAM enables extra gradient flow from R-CNN subnet to RPN. As a result, RPN receives additional supervision from RCNN subnet, which helps the training of RPN.
A TridentNet Block is a feature extractor used in object detection models. Instead of feeding in multi-scale inputs like the image pyramid, in a TridentNet block we adapt the backbone network for different scales. These blocks create multiple scale-specific feature maps. With the help of dilated convolutions, different branches of trident blocks have the same network structure and share the same parameters yet have different receptive fields. Furthermore, to avoid training objects with extreme scales, a scale-aware training scheme is employed to make each branch specific to a given scale range matching its receptive field. Weight sharing is used to prevent overfitting.
Proposes a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. A cross-modal pre-training model is defined based on the image-text retrieval task. The main goal is thus to learn two encoders that can embed image and text samples into the same space for effective image-text retrieval. To enforce such cross-modal embedding learning, we introduce contrastive learning with the InfoNCE loss into the BriVL model. Given text embedding, the learning objective aims to find the best image embedding from a batch of image embeddings. Similarly, for a given image embedding, the learning objective is to find the best text embedding from a batch of text embeddings. The pre-training model learns a cross-modal embedding space by jointly training the image and text encoders to maximize the cosine similarity of the image and text embeddings of the true pair for each sample in the batch while minimizing the cosine similarity of the embeddings of the other incorrect pairs.
Short-Term Dense Concatenate
STDC, or Short-Term Dense Concatenate, is a module for semantic segmentation to extract deep features with scalable receptive field and multi-scale information. It aims to remove structure redundancy in the BiSeNet architecture, specifically BiSeNet adds an extra path to encode spatial information which can be time-consuming,. Instead, STDC gradually reduces the dimension of feature maps and use the aggregation of them for image representation. We concatenate response maps from multiple continuous layers, each of which encodes input image/feature in different scales and respective fields, leading to multi-scale feature representation. To speed up, the filter size of layers is gradually reduced with negligible loss in segmentation performance.
A Neural Cache, or a Continuous Cache, is a module for language modelling which stores previous hidden states in memory cells. They are then used as keys to retrieve their corresponding word, that is the next word. There is no transformation applied to the storage during writing and reading. More formally it exploits the hidden representations to define a probability distribution over the words in the cache. As illustrated in the Figure, the cache stores pairs of a hidden representation, and the word which was generated based on this representation (the vector encodes the history ). At time , we then define a probability distribution over words stored in the cache based on the stored hidden representations and the current one as: where the scalar is a parameter which controls the flatness of the distribution. When is equal to zero, the probability distribution over the history is uniform, and the model is equivalent to a unigram cache model.
Cross-encoder Reranking
The focal self-attention is built to make Transformer layers scalable to high-resolution inputs. Instead of attending all tokens at fine-grain, the approach attends the fine-grain tokens only locally, but the summarized ones globally. As such, it can cover as many regions as standard self-attention but with much less cost. An image is first partitioned into patches, resulting in visual tokens. Then a patch embedding layer, consisting of a convolutional layer with filter and stride of same size, to project the patches into hidden features. This spatial feature map in then passed to four stages of focal Transformer blocks. Each focal Transformer block consists of focal Transformer layers. Patch embedding layers are used in between to reduce spatial size of feature map by factor 2, while feature dimension increased by 2.
Graph Contrastive Coding is a self-supervised graph neural network pre-training framework to capture the universal network topological properties across multiple networks. GCC's pre-training task is designed as subgraph instance discrimination in and across networks and leverages contrastive learning to empower graph neural networks to learn the intrinsic and transferable structural representations.
Mixture Normalization is normalization technique that relies on an approximation of the probability density function of the internal representations. Any continuous distribution can be approximated with arbitrary precision using a Gaussian Mixture Model (GMM). Hence, instead of computing one set of statistical measures from the entire population (of instances in the mini-batch) as Batch Normalization does, Mixture Normalization works on sub-populations which can be identified by disentangling modes of the distribution, estimated via GMM. While BN can only scale and/or shift the whole underlying probability density function, mixture normalization operates like a soft piecewise normalizing transform, capable of completely re-structuring the data distribution by independently scaling and/or shifting individual modes of distribution.
Feedback Memory is a type of attention module used in the Feedback Transformer architecture. It allows a transformer to to use the most abstract representations from the past directly as inputs for the current timestep. This means that the model does not form its representation in parallel, but sequentially token by token. More precisely, we replace the context inputs to attention modules with memory vectors that are computed over the past, i.e.: where a memory vector is computed by summing the representations of each layer at the -th time step: where are learnable scalar parameters. Here corresponds to token embeddings. The weighting of different layers by a softmax output gives the model more flexibility as it can average them or select one of them. This modification of the self-attention input adapts the computation of the Transformer from parallel to sequential, summarized in the Figure. Indeed, it gives the ability to formulate the representation based on past representations from any layer , while in a standard Transformer this is only true for . This change can be viewed as exposing all previous computations to all future computations, providing better representations of the input. Such capacity would allow much shallower models to capture the same level of abstraction as a deeper architecture.
DE-GAN: A Conditional Generative Adversarial Network for Document Enhancement
Documents often exhibit various forms of degradation, which make it hard to be read and substantially deteriorate the performance of an OCR system. In this paper, we propose an effective end-to-end framework named Document Enhancement Generative Adversarial Networks (DE-GAN) that uses the conditional GANs (cGANs) to restore severely degraded document images. To the best of our knowledge, this practice has not been studied within the context of generative adversarial deep networks. We demonstrate that, in different tasks (document clean up, binarization, deblurring and watermark removal), DE-GAN can produce an enhanced version of the degraded document with a high quality. In addition, our approach provides consistent improvements compared to state-of-the-art methods over the widely used DIBCO 2013, DIBCO 2017 and H-DIBCO 2018 datasets, proving its ability to restore a degraded document image to its ideal condition. The obtained results on a wide variety of degradation reveal the flexibility of the proposed model to be exploited in other document enhancement problems.
TD-Gammon is a game-learning architecture for playing backgammon. It involves the use of a learning algorithm and a feedforward neural network. Credit: Temporal Difference Learning and TD-Gammon
ThunderNet is a two-stage object detection model. The design of ThunderNet aims at the computationally expensive structures in state-of-the-art two-stage detectors. The backbone utilises a ShuffleNetV2 inspired network called SNet designed for object detection. In the detection part, ThunderNet follows the detection head design in Light-Head R-CNN, and further compresses the RPN and R-CNN subnet. To eliminate the performance degradation induced by small backbones and small feature maps, ThunderNet uses two new efficient architecture blocks, Context Enhancement Module (CEM) and Spatial Attention Module (SAM). CEM combines the feature maps from multiple scales to leverage local and global context information, while SAM uses the information learned in RPN to refine the feature distribution in RoI warping.
Sensor Dropout or SensD
A method that randomly mask out all features coming from a specific sensor in multi-sensor models for Earth observation. Depending on the fusion strategy, the mask out can be done at the input, feature or decision level.
Multiscale Attention ViT with Late fusion
Multiscale Attention ViT with Late fusion (MAVL) is a multi-modal network, trained with aligned image-text pairs, capable of performing targeted detection using human understandable natural language text queries. It utilizes multi-scale image features and uses deformable convolutions with late multi-modal fusion. The authors demonstrate excellent ability of MAVL as class-agnostic object detector when queried using general human understandable natural language command, such as "all objects", "all entities", etc.
Inverse Q-Learning
Inverse Q-Learning (IQ-Learn) is a a simple, stable & data-efficient framework for Imitation Learning (IL), that directly learns soft Q-functions from expert data. IQ-Learn enables non-adverserial imitation learning, working on both offline and online IL settings. It is performant even with very sparse expert data, and scales to complex image-based environments, surpassing prior methods by more than 3x. It is very simple to implement requiring 15 lines of code on top of existing RL methods. <span class="description-source"Source: IQ-Learn: Inverse soft Q-Learning for Imitation</span
SKNet is a type of convolutional neural network that employs selective kernel units, with selective kernel convolutions, in its architecture. This allows for a type of attention where the network can learn to attend to different receptive fields.
Accordion is a gradient communication scheduling algorithm that is generic across models while imposing low computational overheads. Accordion inspects the change in the gradient norms to detect critical regimes and adjusts the communication schedule dynamically. Accordion works for both adjusting the gradient compression rate or the batch size without additional parameter tuning.
Rank-based loss
Local Relation Network
The Local Relation Network (LR-Net) is a network built with local relation layers which represent a feature image extractor. This feature extractor adaptively determines aggregation weights based on the compositional relationship of local pixel pairs.
Pansharpening Network
We propose a deep network architecture for the pansharpening problem called PanNet. We incorporate domain-specific knowledge to design our PanNet architecture by focusing on the two aims of the pan-sharpening problem: spectral and spatial preservation. For spectral preservation, we add up-sampled multispectral images to the network output, which directly propagates the spectral information to the reconstructed image. To preserve the spatial structure, we train our network parameters in the high-pass filtering domain rather than the image domain. We show that the trained network generalizes well to images from different satellites without needing retraining. Experiments show significant improvement over state-of-the-art methods visually and in terms of standard quality metrics.
Low Variance Regularization
Method introduces a novel unlabeled debiasing technique which works on classification task to reduce the bias of the transformer based language models on downstream classification task. In their method authors use the classes as metric for regularization and punish the network if the embedding produced by the model are far from each other. by doing so the authors claim to be able to reduce the domain shift caused by any unwanted attribute information hence results in fair embedding.
Semi-Parametric Editing with a Retrieval-Augmented Counterfac- tual Model
Position-Sensitive RoIAlign is a positive sensitive version of RoIAlign - i.e. it performs selective alignment, allowing for the learning of position-sensitive region of interest aligning.
Parametric UMAP is a non-parametric graph-based dimensionality reduction algorithm that extends the second step of UMAP to a parametric optimization over neural network weights, learning a parametric relationship between data and embedding.
Implicit Graph Contrastive Learning
Please enter a description about the method here
MUSIQ, or Multi-scale Image Quality Transformer, is a Transformer-based model for multi-scale image quality assessment. It processes native resolution images with varying sizes and aspect ratios. In MUSIQ, we construct a multi-scale image representation as input, including the native resolution image and its ARP resized variants. Each image is split into fixed-size patches which are embedded by a patch encoding module (blue boxes). To capture 2D structure of the image and handle images of varying aspect ratios, the spatial embedding is encoded by hashing the patch position to within a grid of learnable embeddings (red boxes). Scale Embedding (green boxes) is introduced to capture scale information. The Transformer encoder takes the input tokens and performs multi-head self-attention. To predict the image quality, MUSIQ follows a common strategy in Transformers to add an [CLS] token to the sequence to represent the whole multi-scale input and the corresponding Transformer output is used as the final representation.
UNIMO is a multi-modal pre-training architecture that can effectively adapt to both single modal and multimodal understanding and generation tasks. UNIMO learns visual representations and textual representations simultaneously, and unifies them into the same semantic space via cross-modal contrastive learning (CMCL) based on a large-scale corpus of image collections, text corpus and image-text pairs. The CMCL aligns the visual representation and textual representation, and unifies them into the same semantic space based on image-text pairs.
Discriminative and Generative Network
Meta-augmentation helps generate more varied tasks for a single example in meta-learning. It can be distinguished from data augmentation in classic machine learning as follows. For data augmentation in classical machine learning, the aim is to generate more varied examples, within a single task. Meta-augmentation has the exact opposite aim: we wish to generate more varied tasks, for a single example, to force the learner to quickly learn a new task from feedback. In meta-augmentation, adding randomness discourages the base learner and model from learning trivial solutions that do not generalize to new tasks.
Adaptive Nesterov Momentum
Please enter a description about the method here
CRF-RNN is a formulation of a CRF as a Recurrent Neural Network. Specifically it formulates mean-field approximate inference for the Conditional Random Fields with Gaussian pairwise potentials as Recurrent Neural Networks.
Boundary-Aware Segmentation Network
BASNet, or Boundary-Aware Segmentation Network, is an image segmentation architecture that consists of a predict-refine architecture and a hybrid loss. The proposed BASNet comprises a predict-refine architecture and a hybrid loss, for highly accurate image segmentation. The predict-refine architecture consists of a densely supervised encoder-decoder network and a residual refinement module, which are respectively used to predict and refine a segmentation probability map. The hybrid loss is a combination of the binary cross entropy, structural similarity and intersection-over-union losses, which guide the network to learn three-level (i.e., pixel-, patch- and map- level) hierarchy representations.
Hierarchical Transferability Calibration Network
Hierarchical Transferability Calibration Network (HTCN) is an adaptive object detector that hierarchically (local-region/image/instance) calibrates the transferability of feature representations for harmonizing transferability and discriminability. The proposed model consists of three components: (1) Importance Weighted Adversarial Training with input Interpolation (IWAT-I), which strengthens the global discriminability by re-weighting the interpolated image-level features; (2) Context-aware Instance-Level Alignment (CILA) module, which enhances the local discriminability by capturing the complementary effect between the instance-level feature and the global context information for the instance-level feature alignment; (3) local feature masks that calibrate the local transferability to provide semantic guidance for the following discriminative pattern alignment.
Teacher-Tutor-Student Knowledge Distillation is a method for image virtual try-on models. It treats fake images produced by the parser-based method as "tutor knowledge", where the artifacts can be corrected by real "teacher knowledge", which is extracted from the real person images in a self-supervised way. Other than using real images as supervisions, knowledge distillation is formulated in the try-on problem as distilling the appearance flows between the person image and the garment image, enabling the finding of dense correspondences between them to produce high-quality results.
Decorrelated Batch Normalization (DBN) is a normalization technique which not just centers and scales activations but whitens them. ZCA whitening instead of PCA whitening is employed since PCA whitening causes a problem called stochastic axis swapping, which is detrimental to learning.
SimCLRv2 is a semi-supervised learning method for learning from few labeled examples while making best use of a large amount of unlabeled data. It is a modification of a recently proposed contrastive learning framework, SimCLR. It improves upon it in three major ways: 1. To fully leverage the power of general pre-training, larger ResNet models are explored. Unlike SimCLR and other previous work, whose largest model is ResNet-50 (4×), SimCLRv2 trains models that are deeper but less wide. The largest model trained is a 152 layer ResNet with 3× wider channels and selective kernels (SK), a channel-wise attention mechanism that improves the parameter efficiency of the network. By scaling up the model from ResNet-50 to ResNet-152 (3×+SK), a 29% relative improvement is obtained in top-1 accuracy when fine-tuned on 1% of labeled examples. 2. The capacity of the non-linear network (a.k.a. projection head) is increased, by making it deeper. Furthermore, instead of throwing away entirely after pre-training as in SimCLR, fine-tuning occurs from a middle layer. This small change yields a significant improvement for both linear evaluation and fine-tuning with only a few labeled examples. Compared to SimCLR with 2-layer projection head, by using a 3-layer projection head and fine-tuning from the 1st layer of projection head, it results in as much as 14% relative improvement in top-1 accuracy when fine-tuned on 1% of labeled examples. 3. The memory mechanism of MoCo v2 is incorporated, which designates a memory network (with a moving average of weights for stabilization) whose output will be buffered as negative examples. Since training is based on large mini-batch which already supplies many contrasting negative examples, this change yields an improvement of ∼1% for linear evaluation as well as when fine-tuning on 1% of labeled examples.
RotNet is a self-supervision approach that relies on predicting image rotations as the pretext task in order to learn image representations.
Fastformer is an type of Transformer which uses additive attention as a building block. Instead of modeling the pair-wise interactions between tokens, additive attention is used to model global contexts, and then each token representation is further transformed based on its interaction with global context representations.
ComplEx with N3 Regularizer
ComplEx model trained with a nuclear norm regularizer
Pyramid Vision Transformer v2
Pyramid Vision Transformer v2 (PVTv2) is a type of Vision Transformer for detection and segmentation tasks. It improves on PVTv1 through several design improvements: (1) overlapping patch embedding, (2) convolutional feed-forward networks, and (3) linear complexity attention layers that are orthogonal to the PVTv1 framework.
Local Patch Interaction, or LPI, is a module used for the XCiT layer to enable explicit communication across patches. LPI consists of two depth-wise 3×3 convolutional layers with Batch Normalization and GELU non-linearity in between. Due to its depth-wise structure, the LPI block has a negligible overhead in terms of parameters, as well as a limited overhead in terms of throughput and memory usage during inference.