8,725 machine learning methods and techniques
Matrix Non-Maximum Suppression
Matrix NMS, or Matrix Non-Maximum Suppression, performs non-maximum suppression with parallel matrix operations in one shot. It is motivated by Soft-NMS. Soft-NMS decays the other detection scores as a monotonic decreasing function of their overlaps. By decaying the scores according to IoUs recursively, higher IoU detections will be eliminated with a minimum score threshold. However, such process is sequential like traditional Greedy NMS and can not be implemented in parallel. Matrix NMS views this process from another perspective by considering how a predicted mask being suppressed. For , its decay factor is affected by: (a) The penalty of each prediction on , where and are the confidence scores; and (b) the probability of being suppressed. For (a), the penalty of each prediction on could be easily computed by iou . For (b), the probability of being suppressed is not so elegant to be computed. However, the probability usually has positive correlation with the IoUs. So here we directly approximate the probability by the most overlapped prediction on as To this end, the final decay factor becomes and the updated score is computed by decay The authors consider the two most simple decremented functions, denoted as linear iou iou , and Gaussian iou .
MobileViTv2 is a vision transformer that is tuned to mobile device. MobileViTv2 introduced a separable self-attention method to reduce cost than MobileViT
Meta Pseudo Labels is a semi-supervised learning method that uses a teacher network to generate pseudo labels on unlabeled data to teach a student network. The teacher receives feedback from the student to inform the teacher to generate better pseudo labels. This feedback signal is used as a reward to train the teacher throughout the course of the student’s learning.
A Multiscale Dilated Convolution Block is an Inception-style convolutional block motivated by the ideas that image features naturally occur at multiple scales, that a network’s expressivity is proportional to the range of functions it can represent divided by its total number of parameters, and by the desire to efficiently expand a network’s receptive field. The Multiscale Dilated Convolution (MDC) block applies a single filter at multiple dilation factors, then performs a weighted elementwise sum of each dilated filter’s output, allowing the network to simultaneously learn a set of features and the relevant scales at which those features occur with a minimal increase in parameters. This also rapidly expands the network’s receptive field without requiring an increase in depth or the number of parameters.
Graph Recurrent Imputation Network
GBST
GBST, or Gradient-based Subword Tokenization Module, is a soft gradient-based subword tokenization module that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. GBST learns a position-wise soft selection over candidate subword blocks by scoring them with a scoring network. In contrast to prior tokenization-free methods, GBST learns interpretable latent subwords, which enables easy inspection of lexical representations and is more efficient than other byte-based models.
Minibatch Discrimination is a discriminative technique for generative adversarial networks where we discriminate between whole minibatches of samples rather than between individual samples. This is intended to avoid collapse of the generator.
Pathology Language and Image Pre-Training
Pathology Language and Image Pre-Training (PLIP) is a vision-and-language foundation model created by fine-tuning CLIP on pathology images.
Co-Scale Conv-attentional Image Transformer
Co-Scale Conv-Attentional Image Transformer (CoaT) is a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other. Second, the conv-attentional mechanism is designed by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities.
Scattering Transform
A wavelet scattering transform computes a translation invariant representation, which is stable to deformation, using a deep convolution network architecture. It computes non-linear invariants with modulus and averaging pooling functions. It helps to eliminate the image variability due to translation and is stable to deformations. Image source: Bruna and Mallat
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo.
TextGrad is a powerful framework building automatic differentiation'' via text. TextGrad implements backpropagation through text feedback provided by LLMs, strongly building on the gradient metaphor
Spectral Clustering Spectral clustering aims to partition the data points into clusters using the spectrum of the graph Laplacians Given a dataset with data points, spectral clustering algorithm first constructs similarity matrix , where indicates the similarity between data points and via a similarity measure metric. Let , where is called graph Laplacian and is a diagonal matrix with . The objective function of spectral clustering can be formulated based on the graph Laplacian as follow: \begin{equation} \label{eq:SCobj} {\max{{U}} \operatorname{tr}\left({U}^{T} {L} {U}\right)}, \\ {\text { s.t. } \quad {U}^{T} {{U}={I}}}, \end{equation} where denotes the trace norm of a matrix. The rows of matrix are the low dimensional embedding of the original data points. Generally, spectral clustering computes as the bottom eigenvectors of , and finally applies -means on to obtain the clustering results. Large-scale Spectral Clustering To capture the relationship between all data points in , an similarity matrix is needed to be constructed in conventional spectral clustering, which costs time and memory and is not feasible for large-scale clustering tasks. Instead of a full similarity matrix, many accelerated spectral clustering methods are using a similarity sub-matrix to represent each data points by the cross-similarity between data points and a set of representative data points (i.e., landmarks) via some similarity measures, as \begin{equation} \label{eq: cross-similarity} B = \Phi(X,R), \end{equation} where () is a set of landmarks with the same dimension to , indicate a similarity measure metric, and is the similarity sub-matrix to represent the with respect to the . For large-scale spectral clustering using such similarity matrix, a symmetric similarity matrix can be designed as \begin{equation} \label{eq: WusedB } W=\left[\begin{array}{ll} \mathbf{0} & B ; \\ B^{T} & \mathbf{0} \end{array}\right]. \end{equation} The size of matrix is . Taking the advantage of the bipartite structure, some fast eigen-decomposition methods can then be used to obtain the spectral embedding. Finally, -means is conducted on the embedding to obtain clustering results. The clustering result is directly related to the quality of that consists of the similarities between data points and landmarks. Thus, the performance of landmark selection is crucial to the clustering result.
The Lovasz-Softmax loss is a loss function for multiclass semantic segmentation that incorporates the softmax operation in the Lovasz extension. The Lovasz extension is a means by which we can achieve direct optimization of the mean intersection-over-union loss in neural networks.
Recently, dense connections have attracted substantial attention in computer vision because they facilitate gradient flow and implicit deep supervision during training. Particularly, DenseNet that connects each layer to every other layer in a feed-forward fashion and has shown impressive performances in natural image classification tasks. We propose HyperDenseNet, a 3-D fully convolutional neural network that extends the definition of dense connectivity to multi-modal segmentation problems. Each imaging modality has a path, and dense connections occur not only between the pairs of layers within the same path but also between those across different paths. This contrasts with the existing multi-modal CNN approaches, in which modeling several modalities relies entirely on a single joint layer (or level of abstraction) for fusion, typically either at the input or at the output of the network. Therefore, the proposed network has total freedom to learn more complex combinations between the modalities, within and in-between all the levels of abstraction, which increases significantly the learning representation. We report extensive evaluations over two different and highly competitive multi-modal brain tissue segmentation challenges, iSEG 2017 and MRBrainS 2013, with the former focusing on six month infant data and the latter on adult images. HyperDenseNet yielded significant improvements over many state-of-the-art segmentation networks, ranking at the top on both benchmarks. We further provide a comprehensive experimental analysis of features re-use, which confirms the importance of hyper-dense connections in multi-modal representation learning.
Dynamic Range Activator
Recursive functions with heteroscedasticity, sparse and high-variance target distributions introduces a huge complexity that makes their accurate modeling with Neural Networks a difficult task. A main property of recursive maps (e.g factorial function), is their dramatic growth and drop. Learning this recursive behavior requires not only fitting high-frequency patterns within a bounded region but also successfully extrapolating those patterns beyond that region. In time series prediction tasks, capturing periodic even behavior is a challenge. Various methods have been employed to model periodic patterns effectively. However, these approaches typically deal with uni-modal data that also exhibit relatively low variance in both In-Distribution (ID) and Out-Of-Distribution (OOD) regions and do not generalize well to recursive problems with the high-variance observed in our context. Thus, to enable Transformers to capture such behavior and perform proper inference for multi-modal recursive problems, we enhance them by introducing the Dynamic Range Activator (DRA). The DRA is designed to handle the recursive and factorial growth properties inherent in enumerative problems with minimal computational overhead and can be integrated into existing neural networks without requiring significant architectural changes. DRA integrates both harmonic and hyperbolic components as follows, \begin{equation} \mathrm{DRA}(x) := x + a \sin^2\left(\frac{x}{b}\right) + c \cos(bx) + d \tanh(bx) \,, \end{equation} where are learnable parameters. It allows the function to simultaneously model periodic data (through sine and cosine) and rapid growth or attenuation (through the hyperbolic tangent) response.
Bi-Directional Graph Convolutional Network
2D Discrete Wavelet Transform
Generalized Focal Loss (GFL) is a loss function for object detection that combines Quality Focal Loss and Distribution Focal Loss into a general form.
An XCiT Layer is the main building block of the XCiT architecture which uses a [cross-covariance attention]() operator as its principal operation. The XCiT layer consists of three main blocks, each preceded by LayerNorm and followed by a residual connection: (i) the core cross-covariance attention (XCA) operation, (ii) the local patch interaction (LPI) module, and (iii) a feed-forward network (FFN). By transposing the query-key interaction, the computational complexity of XCA is linear in the number of data elements N, rather than quadratic as in conventional self-attention.
Spatial and Channel SE Blocks
To aggregate global spatial information, an SE block applies global pooling to the feature map. However, it ignores pixel-wise spatial information, which is important in dense prediction tasks. Therefore, Roy et al. proposed spatial and channel SE blocks (scSE). Like BAM, spatial SE blocks are used, complementing SE blocks, to provide spatial attention weights to focus on important regions. Given the input feature map , two parallel modules, spatial SE and channel SE, are applied to feature maps to encode spatial and channel information respectively. The channel SE module is an ordinary SE block, while the spatial SE module adopts convolution for spatial squeezing. The outputs from the two modules are fused. The overall process can be written as \begin{align} sc & = \sigma (W{2} \delta (W{1}\text{GAP}(X))) \end{align} \begin{align} X\text{chn} & = sc X \end{align} \begin{align} ss &= \sigma(\text{Conv}^{1\times 1}(X)) \end{align} \begin{align} X\text{spa} & = ss X \end{align} \begin{align} Y &= f(X\text{spa},X\text{chn}) \end{align} where denotes the fusion function, which can be maximum, addition, multiplication or concatenation. The proposed scSE block combines channel and spatial attention to enhance features as well as capturing pixel-wise spatial information. Segmentation tasks are greatly benefited as a result. The integration of an scSE block in F-CNNs makes a consistent improvement in semantic segmentation at negligible extra cost.
Siamese U-Net model with a pre-trained ResNet34 architecture as an encoder for data efficient Change Detection
Social-STGCNN is a method for human trajectory prediction. Pedestrian trajectories are not only influenced by the pedestrian itself but also by interaction with surrounding objects.
Symbolic rule learning methods find regularities in data that can be expressed in the form of 'if-then' rules based on symbolic representations of the data.
Composed Video Retrieval
The composed video retrieval (CoVR) task is a new task, where the goal is to find a video that matches both a query image and a query text. The query image represents a visual concept that the user is interested in, and the query text specifies how the concept should be modified or refined. For example, given an image of a fountain and the text during show at night, the CoVR task is to retrieve a video that shows the fountain at night with a show.
Big-Little Modules are blocks for image models that have two branches: each of which represents a separate block from a deep model and a less deep counterpart. They were proposed as part of the BigLittle-Net architecture. The two branches are fused with a linear combination and unit weights. These two branches are known as Big-Branch (more layers and channels at low resolutions) and Little-Branch (fewer layers and channels at high resolution).
VisuoSpatial Foresight
VisuoSpatial Foresight is a method for robotic fabric manipulation that leverages a combination of RGB and depth information to learn goal conditioned fabric manipulation policies for a variety of long horizon tasks.
Self-adaptive Training is a training algorithm that dynamically corrects problematic training labels by model predictions to improve generalization of deep learning for potentially corrupted training data. Accumulated predictions are used to augment the training dynamics. The use of an exponential-moving-average scheme alleviates the instability issue of model predictions, smooths out the training target during the training process and enables the algorithm to completely change the training labels if necessary.
Pixel Recurrent Neural Network
PixelRNNs are generative neural networks that sequentially predicts the pixels in an image along the two spatial dimensions. They model the discrete probability of the raw pixel values and encode the complete set of dependencies in the image. Variants include the Row LSTM and the Diagonal BiLSTM, that scale more easily to larger datasets. Pixel values are treated as discrete random variables by using a softmax layer in the conditional distributions. Masked convolutions are employed to allow PixelRNNs to model full dependencies between the color channels.
Supporting Clustering with Contrastive Learning
SCCL, or Supporting Clustering with Contrastive Learning, is a framework to leverage contrastive learning to promote better separation in unsupervised clustering. It combines the top-down clustering with the bottom-up instance-wise contrastive learning to achieve better inter-cluster distance and intra-cluster distance. During training, we jointly optimize a clustering loss over the original data instances and an instance-wise contrastive loss over the associated augmented pairs.
Cross-Covariance Image Transformers, or XCiT, is a type of vision transformer that aims to combine the accuracy of conventional transformers with the scalability of convolutional architectures. The self-attention operation underlying transformers yields global interactions between all tokens, i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. The authors propose a “transposed” version of self-attention called cross-covariance attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariances matrix between keys and queries.
Revision Network is a style transfer module that aims to revise the rough stylized image via generating residual details image , while the final stylized image is generated by combining and rough stylized image . This procedure ensures that the distribution of global style pattern in is properly kept. Meanwhile, learning to revise local style patterns with residual details image is easier for the Revision Network. As shown in the Figure, the Revision Network is designed as a simple yet effective encoder-decoder architecture, with only one down-sampling and one up-sampling layer. Further, a patch discriminator is used to help Revision Network to capture fine patch textures under adversarial learning setting. The patch discriminator is defined following SinGAN, where owns 5 convolution layers and 32 hidden channels. A relatively shallow is chosen to (1) avoid overfitting since we only have one style image and (2) control the receptive field to ensure D can only capture local patterns.
LayerDrop is a form of structured dropout for Transformer models which has a regularization effect during training and allows for efficient pruning at inference time. It randomly drops layers from the Transformer according to an "every other" strategy where pruning with a rate means dropping the layers at depth such that .
EmbraceNet: A robust deep learning architecture for multimodal classification
Deformable RoI Pooling adds an offset to each bin position in the regular bin partition of the RoI Pooling. Similarly, the offsets are learned from the preceding feature maps and the RoIs, enabling adaptive part localization for objects with different shapes.
DeepLabv2 is an architecture for semantic segmentation that build on DeepLab with an atrous spatial pyramid pooling scheme. Here we have parallel dilated convolutions with different rates applied in the input feature map, which are then fused together. As objects of the same class can have different sizes in the image, ASPP helps to account for different object sizes.
ZeRO-Offload is a sharded data parallel method for distributed training. It exploits both CPU memory and compute for offloading, while offering a clear path towards efficiently scaling on multiple GPUs by working with ZeRO-powered data parallelism. The symbiosis allows ZeRO-Offload to maintain a single copy of the optimizer states on the CPU memory regardless of the data parallel degree. Furthermore, it keeps the aggregate communication volume between GPU and CPU, as well as the aggregate CPU computation a constant regardless of data parallelism, allowing ZeRO-Offload to effectively utilize the linear increase in CPU compute with the increase in the data parallelism degree.
Asynchronous Proximal Policy Optimization
BigBiGAN is a type of BiGAN with a BigGAN image generator. The authors initially used ResNet as a baseline for the encoder followed by a 4-layer MLP with skip connections, but they experimented with RevNets and found they outperformed with increased network width, so opted for this type of encoder for the final architecture.
Graph-to-Tree MWP Solver
Gated Adaptive Network for Deep Automated Learning of Features
We propose a novel high-performance, interpretable, and parameter \& computationally efficient deep learning architecture for tabular data, Gated Adaptive Network for Deep Automated Learning of Features (GANDALF). GANDALF relies on a new tabular processing unit with a gating mechanism and in-built feature selection called Gated Feature Learning Unit (GFLU) as a feature representation learning unit. We demonstrate that GANDALF outperforms or stays at-par with SOTA approaches like XGBoost, SAINT, FT-Transformers, etc. by experiments on multiple established public benchmarks. We have made available the code at github.com/manujosephv/pytorchtabular under MIT License.
Virtual Data Augmentation, or VDA, is a framework for robustly fine-tuning pre-trained language model. Based on the original token embeddings, a multinomial mixture for augmenting virtual data is constructed, where a masked language model guarantees the semantic relevance and the Gaussian noise provides the augmentation diversity. Furthermore, a regularized training strategy is proposed to balance the two aspects.
FoveaBox is anchor-free framework for object detection. Instead of using predefined anchors to enumerate possible locations, scales and aspect ratios for the search of the objects, FoveaBox directly learns the object existing possibility and the bounding box coordinates without anchor reference. This is achieved by: (a) predicting category-sensitive semantic maps for the object existing possibility, and (b) producing category-agnostic bounding box for each position that potentially contains an object. The scales of target boxes are naturally associated with feature pyramid representations for each input image It is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-shelf convolutional network. The first subnet performs per pixel classification on the backbone’s output; the second subnet performs bounding box prediction for the corresponding position.
MNN
Mobile Neural Network (MNN) is a mobile inference engine tailored to mobile applications. The contributions of MNN include: (1) presenting a mechanism called pre-inference that manages to conduct runtime optimization; (2) delivering thorough kernel optimization on operators to achieve optimal computation performance; (3) introducing backend abstraction module which enables hybrid scheduling and keeps the engine lightweight.
DouZero is an AI system for the card game DouDizhu that enhances traditional Monte-Carlo methods with deep neural networks, action encoding, and parallel actors. The Q-network of DouZero consists of an LSTM to encode historical actions and six layers of MLP with hidden dimension of 512. The network predicts a value for a given state-action pair based on the concatenated representation of action and state.
PP-OCR is an OCR system that consists of three parts, text detection, detected boxes rectification and text recognition. The purpose of text detection is to locate the text area in the image. In PP-OCR, Differentiable Binarization (DB) is used as text detector which is based on a simple segmentation network. It integrates feature extraction and sequence modeling. It adopts the Connectionist Temporal Classification (CTC) loss to avoid the inconsistency between prediction and label.
Anti-Alias Downsampling (AA) aims to improve the shift-equivariance of deep networks. Max-pooling is inherently composed of two operations. The first operation is to densely evaluate the max operator and second operation is naive subsampling. AA is proposed as a low-pass filter between them to achieve practical anti-aliasing in any existing strided layer such as strided convolution. The smoothing factor can be adjusted by changing the blur kernel filter size, where a larger filter size results in increased blur.
Simulation as Augmentation
SimAug, or Simulation as Augmentation, is a data augmentation method for trajectory prediction. It augments the representation such that it is robust to the variances in semantic scenes and camera views. First, to deal with the gap between real and synthetic semantic scene, it represents each training trajectory by high-level scene semantic segmentation features, and defends the model from adversarial examples generated by whitebox attack methods. Second, to overcome the changes in camera views, it generates multiple views for the same trajectory, and encourages the model to focus on the “hardest” view to which the model has learned. The classification loss is adopted and the view with the highest loss is favored during training. Finally, the augmented trajectory is computed as a convex combination of the trajectories generated in previous steps. The trajectory prediction model is built on a multi-scale representation and the final model is trained to minimize the empirical vicinal risk over the distribution of augmented trajectories.
Mirror Descent Policy Optimization
Mirror Descent Policy Optimization (MDPO) is a policy gradient algorithm based on the idea of iteratively solving a trust-region problem that minimizes a sum of two terms: a linearization of the standard RL objective function and a proximity term that restricts two consecutive updates to be close to each other. It is based on Mirror Descent, which is a general trust region method that attempts to keep consecutive iterates close to each other.
Visual-Linguistic BERT
VL-BERT is pre-trained on a large-scale image-captions dataset together with text-only corpus. The input to the model are either words from the input sentences or regions-of-interest (RoI) from input images. It can be fine-tuned to fit most visual-linguistic downstream tasks. Its backbone is a multi-layer bidirectional Transformer encoder, modified to accommodate visual contents, and new type of visual feature embedding to the input feature embeddings. VL-BERT takes both visual and linguistic elements as input, represented as RoIs in images and subwords in input sentences. Four different types of embeddings are used to represent each input: token embedding, visual feature embedding, segment embedding, and sequence position embedding. VL-BERT is pre-trained using Conceptual Captions and text-only datasets. Two pre-training tasks are used: masked language modeling with visual clues, and masked RoI classification with linguistic clues.