8,725 machine learning methods and techniques
Movement Pruning is a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. Magnitude pruning can be seen as utilizing zeroth-order information (absolute value) of the running model. In contrast, movement pruning methods are where importance is derived from first-order information. Intuitively, instead of selecting weights that are far from zero, we retain connections that are moving away from zero during the training process.
Hierarchical Average Precision training for Pertinent ImagE Retrieval
Compressed Memory is a secondary FIFO memory component proposed as part of the Compressive Transformer model. The Compressive Transformer keeps a fine-grained memory of past activations, which are then compressed into coarser compressed memories. For choices of compression functions the authors consider (1) max/mean pooling, where the kernel and stride is set to the compression rate ; (2) 1D convolution also with kernel & stride set to ; (3) dilated convolutions; (4) most-used where the memories are sorted by their average attention (usage) and the most-used are preserved.
Gradient-Based Decision Tree Ensembles
Residual Shuffle-Exchange Network
Residual Shuffle-Exchange Network is an efficient alternative to models using an attention mechanism that allows the modelling of long-range dependencies in sequences in O(n log n) time. This model achieved state-of-the-art performance on the MusicNet dataset for music transcription while being able to run inference on a single GPU fast enough to be suitable for real-time audio processing.
Deep Extreme Cut
DEXTR, or Deep Extreme Cut, obtains an object segmentation from its four extreme points: the left-most, right-most, top, and bottom pixels. The annotated extreme points are given as a guiding signal to the input of the network. To this end, we create a heatmap with activations in the regions of extreme points. We center a 2D Gaussian around each of the points, in order to create a single heatmap. The heatmap is concatenated with the RGB channels of the input image, to form a 4-channel input for the CNN. In order to focus on the object of interest, the input is cropped by the bounding box, formed from the extreme point annotations. To include context on the resulting crop, we relax the tight bounding box by several pixels. After the pre-processing step that comes exclusively from the extreme clicks, the input consists of an RGB crop including an object, plus its extreme points. ResNet-101 is chosen as backbone of the architecture. We remove the fully connected layers as well as the max pooling layers in the last two stages to preserve acceptable output resolution for dense prediction, and we introduce atrous convolutions in the last two stages to maintain the same receptive field. After the last ResNet-101 stage, we introduce a pyramid scene parsing module to aggregate global context to the final feature map. The output of the CNN is a probability map representing whether a pixel belongs to the object that we want to segment or not. The CNN is trained to minimize the standard cross entropy loss, which takes into account that different classes occur with different frequency in a dataset.
Contrastive Multiview Coding (CMC) is a self-supervised learning approach, based on CPC, that learns representations that capture information shared between multiple sensory views. The core idea is to set an anchor view and the sample positive and negative data points from the other view and maximise agreement between positive pairs in learning from two views. Contrastive learning is used to build the embedding.
Nearest-Neighbor Contrastive Learning of Visual Representations
Singular Value Decomposition Parameterization
Charformer is a type of Transformer model that learns a subword tokenization end-to-end as part of the model. Specifically it uses GBST that automatically learns latent subword representations from characters in a data-driven fashion. Following GBST, the soft subword sequence is passed through Transformer layers.
Kernel Density Matrices
Kernel density matrices provide a simpler yet effective mechanism for representing joint probability distributions of both continuous and discrete random variables. This abstraction allows the construction of differentiable models for density estimation, inference, and sampling, and enables their integration into end-to-end deep neural models.
In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at https://github.com/FlagAI-Open/FlagAI.
efficient channel attention
An ECA block has similar formulation to an SE block including a squeeze module for aggregating global spatial information and an efficient excitation module for modeling cross-channel interaction. Instead of indirect correspondence, an ECA block only considers direct interaction between each channel and its k-nearest neighbors to control model complexity. Overall, the formulation of an ECA block is: \begin{align} s = F\text{eca}(X, \theta) & = \sigma (\text{Conv1D}(\text{GAP}(X))) \end{align} \begin{align} Y & = s X \end{align} where denotes 1D convolution with a kernel of shape across the channel domain, to model local cross-channel interaction. The parameter decides the coverage of interaction, and in ECA the kernel size is adaptively determined from the channel dimensionality instead of by manual tuning, using cross-validation: \begin{equation} k = \psi(C) = \left | \frac{\log2(C)}{\gamma}+\frac{b}{\gamma}\right |\text{odd} \end{equation} where and are hyperparameters. indicates the nearest odd function of . Compared to SENet, ECANet has an improved excitation module, and provides an efficient and effective block which can readily be incorporated into various CNNs.
Involution is an atomic operation for deep neural networks that inverts the design principles of convolution. Involution kernels are distinct in the spatial extent but shared across channels. If involution kernels are parameterized as fixed-sized matrices like convolution kernels and updated using the back-propagation algorithm, the learned involution kernels are impeded from transferring between input images with variable resolutions. The authors argue for two benefits of involution over convolution: (i) involution can summarize the context in a wider spatial arrangement, thus overcome the difficulty of modeling long-range interactions well; (ii) involution can adaptively allocate the weights over different positions, so as to prioritize the most informative visual elements in the spatial domain.
DetNet is a backbone convolutional neural network for object detection. Different from traditional pre-trained models for ImageNet classification, DetNet maintains the spatial resolution of the features even though extra stages are included. DetNet attempts to stay efficient by employing a low complexity dilated bottleneck structure.
Optimal Transport Modeling
Tanh Exponential Activation Function
Lightweight or mobile neural networks used for real-time computer vision tasks contain fewer parameters than normal networks, which lead to a constrained performance. In this work, we proposed a novel activation function named Tanh Exponential Activation Function (TanhExp) which can improve the performance for these networks on image classification task significantly. The definition of TanhExp is . We demonstrate the simplicity, efficiency, and robustness of TanhExp on various datasets and network models and TanhExp outperforms its counterparts in both convergence speed and accuracy. Its behaviour also remains stable even with noise added and dataset altered. We show that without increasing the size of the network, the capacity of lightweight neural networks can be enhanced by TanhExp with only a few training epochs and no extra parameters added.
Strip Pooling Network
Spatial pooling usually operates on a small region which limits its capability to capture long-range dependencies and focus on distant regions. To overcome this, Hou et al. proposed strip pooling, a novel pooling method capable of encoding long-range context in either horizontal or vertical spatial domains. Strip pooling has two branches for horizontal and vertical strip pooling. The horizontal strip pooling part first pools the input feature in the horizontal direction: \begin{align} y^1 = \text{GAP}^w (X) \end{align} Then a 1D convolution with kernel size 3 is applied in to capture the relationship between different rows and channels. This is repeated times to make the output consistent with the input shape: \begin{align} yh = \text{Expand}(\text{Conv1D}(y^1)) \end{align} Vertical strip pooling is performed in a similar way. Finally, the outputs of the two branches are fused using element-wise summation to produce the attention map: \begin{align} s &= \sigma(Conv^{1\times 1}(y{v} + y{h})) \end{align} \begin{align} Y &= s X \end{align} The strip pooling module (SPM) is further developed in the mixed pooling module (MPM). Both consider spatial and channel relationships to overcome the locality of convolutional neural networks. SPNet achieves state-of-the-art results for several complex semantic segmentation benchmarks.
A Cyclical Learning Rate Policy combines a linear learning rate decay with warm restarts. Image: ESPNetv2
VL-T5 is a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation. The model learns to generate labels in text based on the visual and textual inputs. In contrast to other existing methods, the framework unifies tasks as generating text labels conditioned on multimodal inputs. This allows the model to tackle vision-and-language tasks with unified text generation objective. The models use text prefixes to adapt to different tasks.
Elastic Margin Loss for Deep Face Recognition
RTMDet: An Empirical Study of Designing Real-Time Object Detectors
Please enter a description about the method here
Gaussian Mixture Variational Autoencoder
GMVAE, or Gaussian Mixture Variational Autoencoder, is a stochastic regularization layer for transformers. A GMVAE layer is trained using a 700-dimensional internal representation of the first MLP layer. For every output from the first MLP layer, the GMVAE layer first computes a latent low-dimensional representation sampling from the GMVAE posterior distribution to then provide at the output a reconstruction sampled from a generative model.
Libra R-CNN is an object detection model that seeks to achieve a balanced training procedure. The authors motivation is that training in past detectors has suffered from imbalance during the training process, which generally consists in three levels – sample level, feature level, and objective level. To mitigate the adverse effects, Libra R-CNN integrates three novel components: IoU-balanced sampling, balanced feature pyramid, and balanced L1 loss, respectively for reducing the imbalance at sample, feature, and objective level.
CondConv, or Conditionally Parameterized Convolutions, are a type of convolution which learn specialized convolutional kernels for each example. In particular, we parameterize the convolutional kernels in a CondConv layer as a linear combination of experts , where are functions of the input learned through gradient descent. To efficiently increase the capacity of a CondConv layer, developers can increase the number of experts. This can be more computationally efficient than increasing the size of the convolutional kernel itself, because the convolutional kernel is applied at many different positions within the input, while the experts are combined only once per input.
A Contractive Autoencoder is an autoencoder that adds a penalty term to the classical reconstruction cost function. This penalty term corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. This penalty term results in a localized space contraction which in turn yields robust features on the activation layer. The penalty helps to carve a representation that better captures the local directions of variation dictated by the data, corresponding to a lower-dimensional non-linear manifold, while being more invariant to the vast majority of directions orthogonal to the manifold.
SNIPER is a multi-scale training approach for instance-level recognition tasks like object detection and instance-level segmentation. Instead of processing all pixels in an image pyramid, SNIPER selectively processes context regions around the ground-truth objects (a.k.a chips). This can help to speed up multi-scale training as it operates on low-resolution chips. Due to its memory-efficient design, SNIPER can benefit from Batch Normalization during training and it makes larger batch-sizes possible for instance-level recognition tasks on a single GPU.
Prompt Gradient Alignment
MeshGraphNet is a framework for learning mesh-based simulations using graph neural networks. The model can be trained to pass messages on a mesh graph and to adapt the mesh discretization during forward simulation. The model uses an Encode-Process-Decode architecture trained with one-step supervision, and can be applied iteratively to generate long trajectories at inference time. The encoder transforms the input mesh into a graph, adding extra world-space edges. The processor performs several rounds of message passing along mesh edges and world edges, updating all node and edge embeddings. The decoder extracts the acceleration for each node, which is used to update the mesh to produce .
Macaw is a generative question-answering (QA) system that is built on UnifiedQA, itself built on T5. Macaw has three interesting features. First, it often produces high-quality answers to questions far outside the domain it was trained on, sometimes surprisingly so. Second, Macaw allows different permutations (“an gles”) of inputs and outputs to be used. For example, we can give it a question and get an answer; or give it an answer and get a question; or give it a question and answer and get a set of multiple-choice (MC) options for that question. This multi-angle QA capability allows versatility in the way Macaw can be used, include recursively using outputs as new inputs to the system. Finally, Macaw also generates explanations as an optional output (or even input) element.
The NVAE Encoder Residual Cell is a residual connection block used in the NVAE architecture for the encoder. It applies two series of BN-Swish-Conv layers without changing the number of channels.
Hierarchical Multi-Task Learning
Multi-task learning (MTL) introduces an inductive bias, based on a-priori relations between tasks: the trainable model is compelled to model more general dependencies by using the abovementioned relation as an important data feature. Hierarchical MTL, in which different tasks use different levels of the deep neural network, provides more effective inductive bias compared to “flat” MTL. Also, hierarchical MTL helps to solve the vanishing gradient problem in deep learning.
A scalable second order optimization algorithm for deep learning. Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.
Residual Normal Distributions are used to help the optimization of VAEs, preventing optimization from entering an unstable region. This can happen due to sharp gradients caused in situations where the encoder and decoder produce distributions far away from each other. The residual distribution parameterizes relative to . Let be a Normal distribution for the th variable in in prior. Define , where and are the relative location and scale of the approximate posterior with respect to the prior. With this parameterization, when the prior moves, the approximate posterior moves accordingly, if not changed.
ConvBERT is a modification on the BERT architecture which uses a span-based dynamic convolution to replace self-attention heads to directly model local dependencies. Specifically a new mixed attention module replaces the self-attention modules in BERT, which leverages the advantages of convolution to better capture local dependency. Additionally, a new span-based dynamic convolution operation is used to utilize multiple input tokens to dynamically generate the convolution kernel. Lastly, ConvBERT also incorporates some new model designs including the bottleneck attention and grouped linear operator for the feed-forward module (reducing the number of parameters).
Temporal Jittering is a method used in deep learning for video, where we sample multiple training clips from each video with random start times during at every epoch.
Fawkes is an image cloaking system that helps individuals inoculate their images against unauthorized facial recognition models. Fawkes achieves this by helping users add imperceptible pixel-level changes ("cloaks") to their own photos before releasing them. When used to train facial recognition models, these "cloaked" images produce functional models that consistently cause normal images of the user to be misidentified.
Neo-fuzzy-neuron
Neo-fuzzy-neuron is a type of artificial neural network that combines the characteristics of both fuzzy logic and neural networks. It uses a fuzzy inference system to model non-linear relationships between inputs and outputs, and a feedforward neural network to learn the parameters of the fuzzy system. The combination of these two approaches provides a flexible and powerful tool for solving a wide range of problems in areas such as pattern recognition, control, and prediction.
Feature Pyramid Grid
Feature Pyramid Grids, or FPG, is a deep multi-pathway feature pyramid, that represents the feature scale-space as a regular grid of parallel bottom-up pathways which are fused by multi-directional lateral connections. It connects the backbone features, , of a ConvNet with a regular structure of parallel top-down pyramid pathways which are fused by multi-directional lateral connections, AcrossSame, AcrossUp, AcrossDown, and AcrossSkip. AcrossSkip are direct connections while all other types use convolutional and ReLU layers. On a high-level, FPG is a deep generalization of FPN from one to pathways under a dense lateral connectivity structure.
A LAPGAN, or Laplacian Generative Adversarial Network, is a type of generative adversarial network that has a Laplacian pyramid representation. In the sampling procedure following training, we have a set of generative convnet models {}, each of which captures the distribution of coefficients for natural images at a different level of the Laplacian pyramid. Sampling an image is akin to a reconstruction procedure, except that the generative models are used to produce the ’s: The recurrence starts by setting and using the model at the final level to generate a residual image using noise vector : . Models at all levels except the final are conditional generative models that take an upsampled version of the current image as a conditioning variable, in addition to the noise vector . The generative models {} are trained using the CGAN approach at each level of the pyramid. Specifically, we construct a Laplacian pyramid from each training image . At each level we make a stochastic choice (with equal probability) to either (i) construct the coefficients either using the standard Laplacian pyramid coefficient generation procedure or (ii) generate them using G\{k}l\{k} = u\left(I\{k+1}\right)z\{k}D\{k}h\{k}\tilde{h}\{k}l\{k}h\{k}\tilde{h}\{k}\tilde{h}\{K} = G\{K}\left(z\{K}\right)D\{K}h\{K}\tilde{h}\{K}$ as input. Breaking the generation into successive refinements is the key idea. We give up any “global” notion of fidelity; an attempt is never made to train a network to discriminate between the output of a cascade and a real image and instead the focus is on making each step plausible.
SpineNet is a convolutional neural network backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search.
Beneš Block with Residual Switch Units
The Beneš block is a computation-efficient alternative to dense attention, enabling the modelling of long-range dependencies in O(n log n) time. In comparison, dense attention which is commonly used in Transformers has O(n^2) complexity. In music, dependencies occur on several scales, including on a coarse scale which requires processing very long sequences. Beneš blocks have been used in Residual Shuffle-Exchange Networks to achieve state-of-the-art results in music transcription. Beneš blocks have a ‘receptive field’ of the size of the whole sequence, and it has no bottleneck. These properties hold for dense attention but have not been shown for many sparse attention and dilated convolutional architectures.
Harris Hawks optimization
HHO is a popular swarm-based, gradient-free optimization algorithm with several active and time-varying phases of exploration and exploitation. This algorithm initially published by the prestigious Journal of Future Generation Computer Systems (FGCS) in 2019, and from the first day, it has gained increasing attention among researchers due to its flexible structure, high performance, and high-quality results. The main logic of the HHO method is designed based on the cooperative behaviour and chasing styles of Harris' hawks in nature called "surprise pounce". Currently, there are many suggestions about how to enhance the functionality of HHO, and there are also several enhanced variants of the HHO in the leading Elsevier and IEEE transaction journals. From the algorithmic behaviour viewpoint, there are several effective features in HHO : Escaping energy parameter has a dynamic randomized time-varying nature, which can further improve and harmonize the exploratory and exploitive patterns of HHO. This factor also supports HHO to conduct a smooth transition between exploration and exploitation. Different exploration mechanisms with respect to the average location of hawks can increase the exploratory trends of HHO throughout initial iterations. Diverse LF-based patterns with short-length jumps enrich the exploitative behaviours of HHO when directing a local search. The progressive selection scheme supports search agents to progressively advance their position and only select a better position, which can improve the superiority of solutions and intensification powers of HHO throughout the optimization procedure. HHO shows a series of searching strategies and then, it selects the best movement step. This feature has also a constructive influence on the exploitation inclinations of HHO. The randomized jump strength can assist candidate solutions in harmonising the exploration and exploitation leanings. The application of adaptive and time-varying components allows HHO to handle difficulties of a feature space including local optimal solutions, multi-modality, and deceptive optima. 🔗 The source codes of HHO are publicly available at https://aliasgharheidari.com/HHO.html
The NVAE Generative Residual Cell is a skip connection block used as part of the NVAE architecture for the generator. The residual cell expands the number of channels times before applying the depthwise separable convolution, and then maps it back to channels. The design motivation was to help model long-range correlations in the data by increasing the receptive field of the network, which explains the expanding path but also the use of depthwise convolutions to keep a handle on parameter count.
Height-driven Attention Network
Height-driven Attention Network, or HANet, is a general add-on module for improving semantic segmentation for urban-scene images. It emphasizes informative features or classes selectively according to the vertical position of a pixel. The pixel-wise class distributions are significantly different from each other among horizontally segmented sections in the urban-scene images. Likewise, urban-scene images have their own distinct characteristics, but most semantic segmentation networks do not reflect such unique attributes in the architecture. The proposed network architecture incorporates the capability exploiting the attributes to handle the urban scene dataset effectively.
PULSE is a self-supervised photo upsampling algorithm. Instead of starting with the LR image and slowly adding detail, PULSE traverses the high-resolution natural image manifold, searching for images that downscale to the original LR image. This is formalized through the downscaling loss, which guides exploration through the latent space of a generative model. By leveraging properties of high-dimensional Gaussians, the authors aim to restrict the search space to guarantee realistic outputs.
Voxel RoI Pooling is a RoI feature extractor extracts RoI features directly from voxel features for further refinement. It starts by dividing a region proposal into regular sub-voxels. The center point is taken as the grid point of the corresponding sub-voxel. Since feature volumes are extremely sparse (non-empty voxels account for spaces), we cannot directly utilize max pooling over features of each sub-voxel. Instead, features are integrated from neighboring voxels into the grid points for feature extraction. Specifically, given a grid point , we first exploit voxel query to group a set of neighboring voxels Then, we aggregate the neighboring voxel features with a PointNet module as: where represents the relative coordinates, is the voxel feature of , and indicates an MLP. The max pooling operation is performed along the channels to obtain the aggregated feature vector Particularly, Voxel RoI pooling is exploited to extract voxel features from the 3D feature volumes out of the last two stages in the backbone network. And for each stage, two Manhattan distance thresholds are set to group voxels with multiple scales. Then, we concatenate the aggregated features pooled from different stages and scales to obtain the RoI features.
Channel-wise Cross Attention is a module for semantic segmentation used in the UCTransNet architecture. It is used to fuse features of inconsistent semantics between the Channel Transformer and U-Net decoder. It guides the channel and information filtration of the Transformer features and eliminates the ambiguity with the decoder features. Mathematically, we take the -th level Transformer output and i-th level decoder feature map as the inputs of Channel-wise Cross Attention. Spatial squeeze is performed by a global average pooling (GAP) layer, producing vector with its th channel . We use this operation to embed the global spatial information and then generate the attention mask: where and and being weights of two Linear layers and the ReLU operator . This operation in the equation above encodes the channel-wise dependencies. Following ECA-Net which empirically showed avoiding dimensionality reduction is important for learning channel attention, the authors use a single Linear layer and sigmoid function to build the channel attention map. The resultant vector is used to recalibrate or excite to , where the activation indicates the importance of each channel. Finally, the masked is concatenated with the up-sampled features of the -th level decoder.