Methods

8,725 machine learning methods and techniques

All Audio Computer Vision General Graphs Natural Language Processing Reinforcement Learning Sequential

Matrix NMS

Matrix Non-Maximum Suppression

Matrix NMS, or Matrix Non-Maximum Suppression, performs non-maximum suppression with parallel matrix operations in one shot. It is motivated by Soft-NMS. Soft-NMS decays the other detection scores as a monotonic decreasing function of their overlaps. By decaying the scores according to IoUs recursively, higher IoU detections will be eliminated with a minimum score threshold. However, such process is sequential like traditional Greedy NMS and can not be implemented in parallel. Matrix NMS views this process from another perspective by considering how a predicted mask being suppressed. For , its decay factor is affected by: (a) The penalty of each prediction on , where and are the confidence scores; and (b) the probability of being suppressed. For (a), the penalty of each prediction on could be easily computed by iou . For (b), the probability of being suppressed is not so elegant to be computed. However, the probability usually has positive correlation with the IoUs. So here we directly approximate the probability by the most overlapped prediction on as To this end, the final decay factor becomes and the updated score is computed by decay The authors consider the two most simple decremented functions, denoted as linear iou iou , and Gaussian iou .

Computer VisionIntroduced 20005 papers

MobileViTv2

MobileViTv2 is a vision transformer that is tuned to mobile device. MobileViTv2 introduced a separable self-attention method to reduce cost than MobileViT

Computer VisionIntroduced 20005 papers

Meta Pseudo Labels

Meta Pseudo Labels is a semi-supervised learning method that uses a teacher network to generate pseudo labels on unlabeled data to teach a student network. The teacher receives feedback from the student to inform the teacher to generate better pseudo labels. This feedback signal is used as a reward to train the teacher throughout the course of the student’s learning.

GeneralIntroduced 20005 papers

Multiscale Dilated Convolution Block

A Multiscale Dilated Convolution Block is an Inception-style convolutional block motivated by the ideas that image features naturally occur at multiple scales, that a network’s expressivity is proportional to the range of functions it can represent divided by its total number of parameters, and by the desire to efficiently expand a network’s receptive field. The Multiscale Dilated Convolution (MDC) block applies a single filter at multiple dilation factors, then performs a weighted elementwise sum of each dilated filter’s output, allowing the network to simultaneously learn a set of features and the relevant scales at which those features occur with a minimal increase in parameters. This also rapidly expands the network’s receptive field without requiring an increase in depth or the number of parameters.

Computer VisionIntroduced 20005 papers

GRIN

Graph Recurrent Imputation Network

SequentialIntroduced 20005 papers

Gradient-Based Subword Tokenization

GBST

GBST, or Gradient-based Subword Tokenization Module, is a soft gradient-based subword tokenization module that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. GBST learns a position-wise soft selection over candidate subword blocks by scoring them with a scoring network. In contrast to prior tokenization-free methods, GBST learns interpretable latent subwords, which enables easy inspection of lexical representations and is more efficient than other byte-based models.

Natural Language ProcessingIntroduced 20005 papers

Minibatch Discrimination

Minibatch Discrimination is a discriminative technique for generative adversarial networks where we discriminate between whole minibatches of samples rather than between individual samples. This is intended to avoid collapse of the generator.

Computer VisionIntroduced 20005 papers

PLIP

Pathology Language and Image Pre-Training

Pathology Language and Image Pre-Training (PLIP) is a vision-and-language foundation model created by fine-tuning CLIP on pathology images.

Computer VisionIntroduced 20005 papers

CoaT

Co-Scale Conv-attentional Image Transformer

Co-Scale Conv-Attentional Image Transformer (CoaT) is a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other. Second, the conv-attentional mechanism is designed by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities.

Computer VisionIntroduced 20005 papers

ScatNet

Scattering Transform

A wavelet scattering transform computes a translation invariant representation, which is stable to deformation, using a deep convolution network architecture. It computes non-linear invariants with modulus and averaging pooling functions. It helps to eliminate the image variability due to translation and is stable to deformations. Image source: Bruna and Mallat

Computer VisionIntroduced 20005 papers

InternVideo

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo.

Computer VisionIntroduced 20005 papers

TextGrad

TextGrad is a powerful framework building automatic differentiation'' via text. TextGrad implements backpropagation through text feedback provided by LLMs, strongly building on the gradient metaphor

GeneralIntroduced 20005 papers

Large-scale spectral clustering

Spectral Clustering Spectral clustering aims to partition the data points into clusters using the spectrum of the graph Laplacians Given a dataset with data points, spectral clustering algorithm first constructs similarity matrix , where indicates the similarity between data points and via a similarity measure metric. Let , where is called graph Laplacian and is a diagonal matrix with . The objective function of spectral clustering can be formulated based on the graph Laplacian as follow: \begin{equation} \label{eq:SCobj} {\max{{U}} \operatorname{tr}\left({U}^{T} {L} {U}\right)}, \\ {\text { s.t. } \quad {U}^{T} {{U}={I}}}, \end{equation} where denotes the trace norm of a matrix. The rows of matrix are the low dimensional embedding of the original data points. Generally, spectral clustering computes as the bottom eigenvectors of , and finally applies -means on to obtain the clustering results. Large-scale Spectral Clustering To capture the relationship between all data points in , an similarity matrix is needed to be constructed in conventional spectral clustering, which costs time and memory and is not feasible for large-scale clustering tasks. Instead of a full similarity matrix, many accelerated spectral clustering methods are using a similarity sub-matrix to represent each data points by the cross-similarity between data points and a set of representative data points (i.e., landmarks) via some similarity measures, as \begin{equation} \label{eq: cross-similarity} B = \Phi(X,R), \end{equation} where () is a set of landmarks with the same dimension to , indicate a similarity measure metric, and is the similarity sub-matrix to represent the with respect to the . For large-scale spectral clustering using such similarity matrix, a symmetric similarity matrix can be designed as \begin{equation} \label{eq: WusedB } W=\left[\begin{array}{ll} \mathbf{0} & B ; \\ B^{T} & \mathbf{0} \end{array}\right]. \end{equation} The size of matrix is . Taking the advantage of the bipartite structure, some fast eigen-decomposition methods can then be used to obtain the spectral embedding. Finally, -means is conducted on the embedding to obtain clustering results. The clustering result is directly related to the quality of that consists of the similarities between data points and landmarks. Thus, the performance of landmark selection is crucial to the clustering result.

GeneralIntroduced 20005 papers

Lovasz-Softmax

The Lovasz-Softmax loss is a loss function for multiclass semantic segmentation that incorporates the softmax operation in the Lovasz extension. The Lovasz extension is a means by which we can achieve direct optimization of the mean intersection-over-union loss in neural networks.

GeneralIntroduced 20005 papers

HyperDenseNet

Recently, dense connections have attracted substantial attention in computer vision because they facilitate gradient flow and implicit deep supervision during training. Particularly, DenseNet that connects each layer to every other layer in a feed-forward fashion and has shown impressive performances in natural image classification tasks. We propose HyperDenseNet, a 3-D fully convolutional neural network that extends the definition of dense connectivity to multi-modal segmentation problems. Each imaging modality has a path, and dense connections occur not only between the pairs of layers within the same path but also between those across different paths. This contrasts with the existing multi-modal CNN approaches, in which modeling several modalities relies entirely on a single joint layer (or level of abstraction) for fusion, typically either at the input or at the output of the network. Therefore, the proposed network has total freedom to learn more complex combinations between the modalities, within and in-between all the levels of abstraction, which increases significantly the learning representation. We report extensive evaluations over two different and highly competitive multi-modal brain tissue segmentation challenges, iSEG 2017 and MRBrainS 2013, with the former focusing on six month infant data and the latter on adult images. HyperDenseNet yielded significant improvements over many state-of-the-art segmentation networks, ranking at the top on both benchmarks. We further provide a comprehensive experimental analysis of features re-use, which confirms the importance of hyper-dense connections in multi-modal representation learning.

Computer VisionIntroduced 20005 papers

DRA

Dynamic Range Activator

Recursive functions with heteroscedasticity, sparse and high-variance target distributions introduces a huge complexity that makes their accurate modeling with Neural Networks a difficult task. A main property of recursive maps (e.g factorial function), is their dramatic growth and drop. Learning this recursive behavior requires not only fitting high-frequency patterns within a bounded region but also successfully extrapolating those patterns beyond that region. In time series prediction tasks, capturing periodic even behavior is a challenge. Various methods have been employed to model periodic patterns effectively. However, these approaches typically deal with uni-modal data that also exhibit relatively low variance in both In-Distribution (ID) and Out-Of-Distribution (OOD) regions and do not generalize well to recursive problems with the high-variance observed in our context. Thus, to enable Transformers to capture such behavior and perform proper inference for multi-modal recursive problems, we enhance them by introducing the Dynamic Range Activator (DRA). The DRA is designed to handle the recursive and factorial growth properties inherent in enumerative problems with minimal computational overhead and can be integrated into existing neural networks without requiring significant architectural changes. DRA integrates both harmonic and hyperbolic components as follows, \begin{equation} \mathrm{DRA}(x) := x + a \sin^2\left(\frac{x}{b}\right) + c \cos(bx) + d \tanh(bx) \,, \end{equation} where are learnable parameters. It allows the function to simultaneously model periodic data (through sine and cosine) and rapid growth or attenuation (through the hyperbolic tangent) response.

GeneralIntroduced 20004 papers

BiGCN

Bi-Directional Graph Convolutional Network

GraphsIntroduced 20004 papers

2D DWT

2D Discrete Wavelet Transform

Methods

Matrix NMS

MobileViTv2

Meta Pseudo Labels

Multiscale Dilated Convolution Block

GRIN

Gradient-Based Subword Tokenization

Minibatch Discrimination

PLIP

CoaT

ScatNet

InternVideo

TextGrad

Large-scale spectral clustering

Lovasz-Softmax

HyperDenseNet

DRA

BiGCN

2D DWT

Generalized Focal Loss

XCiT Layer

scSE

Siamese U-Net

Social-STGCNN

Symbolic rule learning

CoVR

Big-Little Module

VSF

Self-adaptive Training

PixelRNN

SCCL

XCiT

Revision Network

LayerDrop

EmbraceNet

Deformable RoI Pooling

DeepLabv2

ZeRO-Offload

APPO

BigBiGAN

Graph2Tree

GANDALF

Virtual Data Augmentation

FoveaBox

Mobile Neural Network

DouZero

PP-OCR

Anti-Alias Downsampling

SimAug

MDPO

VL-BERT

Methods

Matrix NMS

MobileViTv2

Meta Pseudo Labels

Multiscale Dilated Convolution Block

GRIN

Gradient-Based Subword Tokenization

Minibatch Discrimination

PLIP

CoaT

ScatNet

InternVideo

TextGrad

Large-scale spectral clustering

Lovasz-Softmax

HyperDenseNet

DRA

BiGCN

2D DWT

Generalized Focal Loss

XCiT Layer

scSE

Siamese U-Net

Social-STGCNN

Symbolic rule learning

CoVR

Big-Little Module

VSF

Self-adaptive Training