8,725 machine learning methods and techniques
Deformable DETR is an object detection method that aims mitigates the slow convergence and high complexity issues of DETR. It combines the best of the sparse spatial sampling of deformable convolution, and the relation modeling capability of Transformers. Specifically, it introduces a deformable attention module, which attends to a small set of sampling locations as a pre-filter for prominent key elements out of all the feature map pixels. The module can be naturally extended to aggregating multi-scale features, without the help of FPN.
InfoGAN is a type of generative adversarial network that modifies the GAN objective to encourage it to learn interpretable and meaningful representations. This is done by maximizing the mutual information between a fixed small subset of the GAN’s noise variables and the observations. Formally, InfoGAN is defined as a minimax game with a variational regularization of mutual information and the hyperparameter : Where is an auxiliary distribution that approximates the posterior - the probability of the latent code given the data - and is the variational lower bound of the mutual information between the latent code and the observations. In the practical implementation, there is another fully-connected layer to output parameters for the conditional distribution (negligible computation ontop of regular GAN structures). Q is represented with a softmax non-linearity for a categorical latent code. For a continuous latent code, the authors assume a factored Gaussian.
Umbrella Reinforcement Learning
A computationally efficient approach for solving hard nonlinear problems of reinforcement learning (RL). It combines umbrella sampling, from computational physics/chemistry, with optimal control methods. The approach is realized on the basis of neural networks, with the use of policy gradient. It outperforms, by computational efficiency and implementation universality, the available state-of-the-art algorithms, in application to hard RL problems with sparse reward, state traps and lack of terminal states. The proposed approach uses an ensemble of simultaneously acting agents, with a modified reward which includes the ensemble entropy, yielding an optimal exploration-exploitation balance.
R-CNN, or Regions with CNN Features, is an object detection model that uses high-capacity CNNs to bottom-up region proposals in order to localize and segment objects. It uses selective search to identify a number of bounding-box object region candidates (“regions of interest”), and then extracts features from each region independently for classification.
SRGAN Residual Block is a residual block used in the SRGAN generator for image super-resolution. It is similar to standard residual blocks, although it uses a PReLU activation function to help training (preventing sparse gradients during GAN training).
Nesterov Accelerated Gradient is a momentum-based SGD optimizer that "looks ahead" to where the parameters will be to calculate the gradient ex post rather than ex ante: Like SGD with momentum is usually set to . and are usually less than . The intuition is that the standard momentum method first computes the gradient at the current location and then takes a big jump in the direction of the updated accumulated gradient. In contrast Nesterov momentum first makes a big jump in the direction of the previous accumulated gradient and then measures the gradient where it ends up and makes a correction. The idea being that it is better to correct a mistake after you have made it. Image Source: Geoff Hinton lecture notes
Cascade R-CNN is an object detection architecture that seeks to address problems with degrading performance with increased IoU thresholds (due to overfitting during training and inference-time mismatch between IoUs for which detector is optimal and the inputs). It is a multi-stage extension of the R-CNN, where detector stages deeper into the cascade are sequentially more selective against close false positives. The cascade of R-CNN stages are trained sequentially, using the output of one stage to train the next. This is motivated by the observation that the output IoU of a regressor is almost invariably better than the input IoU. Cascade R-CNN does not aim to mine hard negatives. Instead, by adjusting bounding boxes, each stage aims to find a good set of close false positives for training the next stage. When operating in this manner, a sequence of detectors adapted to increasingly higher IoUs can beat the overfitting problem, and thus be effectively trained. At inference, the same cascade procedure is applied. The progressively improved hypotheses are better matched to the increasing detector quality at each stage.
V-trace is an off-policy actor-critic reinforcement learning algorithm that helps tackle the lag between when actions are generated by the actors and when the learner estimates the gradient. Consider a trajectory generated by the actor following some policy . We can define the -steps V-trace target for , our value approximation at state as: Where is a temporal difference algorithm for , and and are truncated importance sampling weights. We assume that the truncation levels are such that .
HiFi-GAN is a generative adversarial network for speech synthesis. HiFi-GAN consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator and discriminators are trained adversarially, along with two additional losses for improving training stability and model performance. The generator is a fully convolutional neural network. It uses a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms. Every transposed convolution is followed by a multi-receptive field fusion (MRF) module. For the discriminator, a multi-period discriminator (MPD) is used consisting of several sub-discriminators each handling a portion of periodic signals of input audio. Additionally, to capture consecutive patterns and long-term dependencies, the multi-scale discriminator (MSD) proposed in MelGAN is used, which consecutively evaluates audio samples at different levels.
Content-based attention is an attention mechanism based on cosine similarity: It was utilised in Neural Turing Machines as part of the Addressing Mechanism. We produce a normalized attention weighting by taking a softmax over these attention alignment scores.
Lipschitz Constant Constraint
Please enter a description about the method here
Bottleneck Attention Module
Park et al. proposed the bottleneck attention module (BAM), aiming to efficiently improve the representational capability of networks. It uses dilated convolution to enlarge the receptive field of the spatial attention sub-module, and build a bottleneck structure as suggested by ResNet to save computational cost. For a given input feature map , BAM infers the channel attention and spatial attention in two parallel streams, then sums the two attention maps after resizing both branch outputs to . The channel attention branch, like an SE block, applies global average pooling to the feature map to aggregate global information, and then uses an MLP with channel dimensionality reduction. In order to utilize contextual information effectively, the spatial attention branch combines a bottleneck structure and dilated convolutions. Overall, BAM can be written as \begin{align} sc &= \text{BN}(W2(W1\text{GAP}(X)+b1)+b2) \end{align} \begin{align} ss &= BN(Conv2^{1 \times 1}(DC2^{3\times 3}(DC1^{3 \times 3}(Conv1^{1 \times 1}(X))))) \end{align} \begin{align} s &= \sigma(\text{Expand}(ss)+\text{Expand}(sc)) \end{align} \begin{align} Y &= s X+X \end{align} where , denote weights and biases of fully connected layers respectively, and are convolution layers used for channel reduction. denotes a dilated convolution with kernel, applied to utilize contextual information effectively. expands the attention maps and to . BAM can emphasize or suppress features in both spatial and channel dimensions, as well as improving the representational power. Dimensional reduction applied to both channel and spatial attention branches enables it to be integrated with any convolutional neural network with little extra computational cost. However, although dilated convolutions enlarge the receptive field effectively, it still fails to capture long-range contextual information as well as encoding cross-domain relationships.
style-based recalibration module
SRM combines style transfer with an attention mechanism. Its main contribution is style pooling which utilizes both mean and standard deviation of the input features to improve its capability to capture global information. It also adopts a lightweight channel-wise fully-connected (CFC) layer, in place of the original fully-connected layer, to reduce the computational requirements. Given an input feature map , SRM first collects global information by using style pooling () which combines global average pooling and global standard deviation pooling. Then a channel-wise fully connected () layer (i.e. fully connected per channel), batch normalization and sigmoid function are used to provide the attention vector. Finally, as in an SE block, the input features are multiplied by the attention vector. Overall, an SRM can be written as: \begin{align} s = F\text{srm}(X, \theta) & = \sigma (\text{BN}(\text{CFC}(\text{SP}(X)))) \end{align} \begin{align} Y & = s X \end{align} The SRM block improves both squeeze and excitation modules, yet can be added after each residual unit like an SE block.
Difficulty-Aware Rejection Tuning
🎯 DART-Math Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving 📝 Paper@arXiv | 🤗 Datasets&Models@HF | 🐱 Code@GitHub 🐦 Thread@X(Twitter) | 🐶 中文博客@知乎 | 📊 Leaderboard@PapersWithCode | 📑 BibTeX Datasets: DART-Math DART-Math datasets are the state-of-the-art and data-efficient open-source instruction tuning datasets for mathematical reasoning. DART-Math-Hard contains \585k mathematical QA pair samples constructed by applying DARS-Prop2Diff to the query set from MATH and GSK8K training sets, achieves SOTA on many challenging mathematical reasoning benchmarks. It introduces a deliberate bias towards hard queries, opposite to vanilla rejection sampling. Performance produced by DART-Math-Hard is usually but not necessarily slightly better (\1% absolutely) than DART-Math-Uniform, which contains \591k samples constructed by applying DARS-Uniform. Comparison between Mathematical Instruction Tuning Datasets Most of previous datasets are constructed with ChatGPT, and many of them are not open-source, especially for ones of the best performance. | Math SFT Dataset | of Samples | MATH | GSM8K | College | Synthesis Agent(s) | Open-Source | | :--------------------------------------------------------------------------------- | -----------: | -----------------------------------------------------------------: | ---------------------------------------------: | -----------------------------------------------------------------------------------------------------------: | :---------------------- | :-------------------------------------------------------------------------: | | WizardMath | 96k | 32.3 | 80.4 | 23.1 | GPT-4 | ✗ | | MetaMathQA | 395k | 29.8 | 76.5 | 19.3 | GPT-3.5 | ✓ | | MMIQC | 2294k | 37.4 | 75.4 | 28.5 | GPT-4+GPT-3.5+Human | ✓ | | Orca-Math | 200k | -- | -- | -- | GPT-4 | ✓ | | Xwin-Math-V1.1 | 1440k | 45.5 | 84.9 | 27.6 | GPT-4 | ✗ | | KPMath-Plus | 1576k | 46.8 | 82.1 | -– | GPT-4 | ✗ | | MathScaleQA | 2021k | 35.2 | 74.8 | 21.8 | GPT-3.5+Human | ✗ | | DART-Math-Uniform | 591k | 43.5 | 82.6 | 26.9 | DeepSeekMath-7B-RL | ✓ | | DART-Math-Hard | 585k | 45.5 | 81.1 | 29.4 | DeepSeekMath-7B-RL | ✓ | <supMATH and GSM8K are in-domain, while College(Math) is out-of-domain. Performance here are of models fine-tuned from Mistral-7B, except for Xwin-Math-V1.1 based on Llama2-7B. Bold/Italic means the best/second best score here.</sup Dataset Construction: DARS - Difficulty-Aware Rejection Sampling Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries. Motivated by the observation above, we propose to Difficulty-Aware Rejection Sampling (DARS), to collect more responses for more difficult queries. Specifically, we introduce two strategies to increase the number of correct responses for difficult queries: 1) Uniform, which involves sampling responses for each query until each query accumulates correct responses, where is a preset hyperparameter determined by the desired size of the synthetic dataset; 2) Prop2Diff, where we continue sampling responses until the number of correct responses for each query is proportional to its difficulty score. The most challenging queries will receive responses and kp is a hyperparameter. This method introduces a deliberate bias in the opposite direction to vanilla rejection sampling, towards more difficult queries, inspired by previous works that demonstrate difficult samples can be more effective to enhance model capabilities (Sorscher et al., 2022; Liu et al., 2024b). See Figure 1 (Right) for examples of DART-Math-Uniform by DARS-Uniform and DART-Math-Hard by DARS-Prop2Diff. Citation If you find our data, model or code useful for your work, please kindly cite our paper: latex @article{tong2024dartmath, title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving}, author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He}, year={2024}, eprint={2407.13690}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.13690}, }
Extreme Value Machine
In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https://github.com/OFA-Sys/OFA.
Evolved Sign Momentum
The Lion optimizer is discovered by symbolic program search. It is more memory-efficient than most adaptive optimizers as it only needs to momentum. The update of Lion is produced by the sign function.
Region-based Fully Convolutional Network
Region-based Fully Convolutional Networks, or R-FCNs, are a type of region-based object detector. In contrast to previous region-based object detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, R-FCN is fully convolutional with almost all computation shared on the entire image. To achieve this, R-FCN utilises position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection.
Spatial-Channel Token Distillation
The Spatial-Channel Token Distillation method is proposed to improve the spatial and channel mixing from a novel knowledge distillation (KD) perspective. To be specific, we design a special KD mechanism for MLP-like Vision Models called Spatial-channel Token Distillation (STD), which improves the information mixing in both the spatial and channel dimensions of MLP blocks. Instead of modifying the mixing operations themselves, STD adds spatial and channel tokens to image patches. After forward propagation, the tokens are concatenated for distillation with the teachers’ responses as targets. Each token works as an aggregator of its dimension. The objective of them is to encourage each mixing operation to extract maximal task-related information from their specific dimension.
CodeT5 is a Transformer-based model for code understanding and generation based on the T5 architecture. It utilizes an identifier-aware pre-training objective that considers the crucial token type information (identifiers) from code. Specifically, the denoising Seq2Seq objective of T5 is extended with two identifier tagging and prediction tasks to enable the model to better leverage the token type information from programming languages, which are the identifiers assigned by developers. To improve the natural language-programming language alignment, a bimodal dual learning objective is used for a bidirectional conversion between natural language and programming language.
COLA is a self-supervised pre-training approach for learning a general-purpose representation of audio. It is based on contrastive learning: it learns a representation which assigns high similarity to audio segments extracted from the same recording while assigning lower similarity to segments from different recordings.
Temporal Adaptive Module
TAM is designed to capture complex temporal relationships both efficiently and flexibly, It adopts an adaptive kernel instead of self-attention to capture global contextual information, with lower time complexity than GLTR. TAM has two branches, a local branch and a global branch. Given the input feature map , global spatial average pooling is first applied to the feature map to ensure TAM has a low computational cost. Then the local branch in TAM employs several 1D convolutions with ReLU nonlinearity across the temporal domain to produce location-sensitive importance maps for enhancing frame-wise features. The local branch can be written as \begin{align} s &= \sigma(\text{Conv1D}(\delta(\text{Conv1D}(\text{GAP}(X))))) \end{align} \begin{align} X^1 &= s X \end{align} Unlike the local branch, the global branch is location invariant and focuses on generating a channel-wise adaptive kernel based on global temporal information in each channel. For the -th channel, the kernel can be written as \begin{align} \Thetac = \text{Softmax}(\text{FC}2(\delta(\text{FC}1(\text{GAP}(X)c)))) \end{align} where and is the adaptive kernel size. Finally, TAM convolves the adaptive kernel with : \begin{align} Y = \Theta \otimes X^1 \end{align} With the help of the local branch and global branch, TAM can capture the complex temporal structures in video and enhance per-frame features at low computational cost. Due to its flexibility and lightweight design, TAM can be added to any existing 2D CNNs.
Kollen-Pollack Learning
Retrace is an off-policy Q-value estimation algorithm which has guaranteed convergence for a target and behaviour policy . With off-policy rollout for TD learning, we must use importance sampling for the update: This product term can lead to high variance, so Retrace modifies to have importance weights truncated by no more than a constant :
Hou et al. proposed coordinate attention, a novel attention mechanism which embeds positional information into channel attention, so that the network can focus on large important regions at little computational cost. The coordinate attention mechanism has two consecutive steps, coordinate information embedding and coordinate attention generation. First, two spatial extents of pooling kernels encode each channel horizontally and vertically. In the second step, a shared convolutional transformation function is applied to the concatenated outputs of the two pooling layers. Then coordinate attention splits the resulting tensor into two separate tensors to yield attention vectors with the same number of channels for horizontal and vertical coordinates of the input along. This can be written as \begin{align} z^h &= \text{GAP}^h(X) \end{align} \begin{align} z^w &= \text{GAP}^w(X) \end{align} \begin{align} f &= \delta(\text{BN}(\text{Conv}1^{1\times 1}([z^h;z^w]))) \end{align} \begin{align} f^h, f^w &= \text{Split}(f) \end{align} \begin{align} s^h &= \sigma(\text{Conv}h^{1\times 1}(f^h)) \end{align} \begin{align} s^w &= \sigma(\text{Conv}w^{1\times 1}(f^w)) \end{align} \begin{align} Y &= X s^h s^w \end{align} where and denote pooling functions for vertical and horizontal coordinates, and and represent corresponding attention weights. Using coordinate attention, the network can accurately obtain the position of a targeted object. This approach has a larger receptive field than BAM and CBAM. Like an SE block, it also models cross-channel relationships, effectively enhancing the expressive power of the learned features. Due to its lightweight design and flexibility, it can be easily used in classical building blocks of mobile networks.
Vision-and-Language BERT
Vision-and-Language BERT (ViLBERT) is a BERT-based model for learning task-agnostic joint representations of image content and natural language. ViLBERT extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.
A Ghost Module is an image block for convolutional neural network that aims to generate more features by using fewer parameters. Specifically, an ordinary convolutional layer in deep neural networks is split into two parts. The first part involves ordinary convolutions but their total number is controlled. Given the intrinsic feature maps from the first part, a series of simple linear operations are applied for generating more feature maps. Given the widely existing redundancy in intermediate feature maps calculated by mainstream CNNs, ghost modules aim to reduce them. In practice, given the input data , where is the number of input channels and and are the height and width of the input data, respectively, the operation of an arbitrary convolutional layer for producing feature maps can be formulated as where is the convolution operation, is the bias term, is the output feature map with channels, and is the convolution filters in this layer. In addition, and are the height and width of the output data, and is the kernel size of convolution filters , respectively. During this convolution procedure, the required number of FLOPs can be calculated as , which is often as large as hundreds of thousands since the number of filters and the channel number are generally very large (e.g. 256 or 512). Here, the number of parameters (in and ) to be optimized is explicitly determined by the dimensions of input and output feature maps. The output feature maps of convolutional layers often contain much redundancy, and some of them could be similar with each other. We point out that it is unnecessary to generate these redundant feature maps one by one with large number of FLOPs and parameters. Suppose that the output feature maps are ghosts of a handful of intrinsic feature maps with some cheap transformations. These intrinsic feature maps are often of smaller size and produced by ordinary convolution filters. Specifically, intrinsic feature maps are generated using a primary convolution: where is the utilized filters, and the bias term is omitted for simplicity. The hyper-parameters such as filter size, stride, padding, are the same as those in the ordinary convolution to keep the spatial size (ie and ) of the output feature maps consistent. To further obtain the desired feature maps, we apply a series of cheap linear operations on each intrinsic feature in to generate ghost features according to the following function: where is the -th intrinsic feature map in , in the above function is the -th (except the last one) linear operation for generating the -th ghost feature map , that is to say, can have one or more ghost feature maps . The last is the identity mapping for preserving the intrinsic feature maps. we can obtain feature maps as the output data of a Ghost module. Note that the linear operations operate on each channel whose computational cost is much less than the ordinary convolution. In practice, there could be several different linear operations in a Ghost module, eg and linear kernels, which will be analyzed in the experiment part.
WaveGAN is a generative adversarial network for unsupervised synthesis of raw-waveform audio (as opposed to image-like spectrograms). The WaveGAN architecture is based off DCGAN. The DCGAN generator uses the transposed convolution operation to iteratively upsample low-resolution feature maps into a high-resolution image. WaveGAN modifies this transposed convolution operation to widen its receptive field, using a longer one-dimensional filters of length 25 instead of two-dimensional filters of size 5x5, and upsampling by a factor of 4 instead of 2 at each layer. The discriminator is modified in a similar way, using length-25 filters in one dimension and increasing stride from 2 to 4. These changes result in WaveGAN having the same number of parameters, numerical operations, and output dimensionality as DCGAN. An additional layer is added afterwards to allow for more audio samples. Further changes include: 1. Flattening 2D convolutions into 1D (e.g. 5x5 2D conv becomes length-25 1D). 2. Increasing the stride factor for all convolutions (e.g. stride 2x2 becomes stride 4). 3. Removing batch normalization from the generator and discriminator. 4. Training using the WGAN-GP strategy.
Phase Shuffle is a technique for removing pitched noise artifacts that come from using transposed convolutions in audio generation models. Phase shuffle is an operation with hyperparameter . It randomly perturbs the phase of each layer’s activations by − to samples before input to the next layer. In the original application in WaveGAN, the authors only apply phase shuffle to the discriminator, as the latent vector already provides the generator a mechanism to manipulate the phase of a resultant waveform. Intuitively speaking, phase shuffle makes the discriminator’s job more challenging by requiring invariance to the phase of the input waveform.
Context Optimization
CoOp, or Context Optimization, is an automated prompt engineering method that avoids manual prompt tuning by modeling context words with continuous vectors that are end-to-end learned from data. The context could be shared among all classes or designed to be class-specific. During training, we simply minimize the prediction error using the cross-entropy loss with respect to the learnable context vectors, while keeping the pre-trained parameters fixed. The gradients can be back-propagated all the way through the text encoder, distilling the rich knowledge encoded in the parameters for learning task-relevant context.
Sigmoid Linear Unit
Sigmoid Linear Units, or SiLUs, are activation functions for neural networks. The activation of the SiLU is computed by the sigmoid function multiplied by its input, or See Gaussian Error Linear Units (GELUs) where the SiLU was originally coined, and see Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning and Swish: a Self-Gated Activation Function where the SiLU was experimented with later.
Beta-VAE is a type of variational autoencoder that seeks to discover disentangled latent factors. It modifies VAEs with an adjustable hyperparameter that balances latent channel capacity and independence constraints with reconstruction accuracy. The idea is to maximize the probability of generating the real data while keeping the distance between the real and estimated distributions small, under a threshold . We can use the Kuhn-Tucker conditions to write this as a single equation: where the KKT multiplier is the regularization coefficient that constrains the capacity of the latent channel and puts implicit independence pressure on the learnt posterior due to the isotropic nature of the Gaussian prior . We write this again using the complementary slackness assumption to get the Beta-VAE formulation:
MoCo v2 is an improved version of the Momentum Contrast self-supervised learning algorithm. Motivated by the findings presented in the SimCLR paper, authors: - Replace the 1-layer fully connected layer with a 2-layer MLP head with ReLU for the unsupervised training stage. - Include blur augmentation. - Use cosine learning rate schedule. These modifications enable MoCo to outperform the state-of-the-art SimCLR with a smaller batch size and fewer epochs.
-step Returns are used for value function estimation in reinforcement learning. Specifically, for steps we can write the complete return as: We can then write an -step backup, in the style of TD learning, as: Multi-step returns often lead to faster learning with suitably tuned . Image Credit: Sutton and Barto, Reinforcement Learning
Capsule Network
Capsule Network is a machine learning system that is a type of artificial neural network that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization.
Spatial-Reduction Attention, or SRA, is a multi-head attention module used in the Pyramid Vision Transformer architecture which reduces the spatial scale of the key and value before the attention operation. This reduces the computational/memory overhead. Details of the SRA in the stage can be formulated as follows: where Concat is the concatenation operation. , , , and are linear projection parameters. is the head number of the attention layer in Stage . Therefore, the dimension of each head (i.e. is equal to is the operation for reducing the spatial dimension of the input sequence ( or ), which is written as: Here, represents a input sequence, and denotes the reduction ratio of the attention layers in Stage Reshape is an operation of reshaping the input sequence to a sequence of size . is a linear projection that reduces the dimension of the input sequence to . refers to layer normalization.
RotatE is a method for generating graph embeddings which is able to model and infer various relation patterns including: symmetry/antisymmetry, inversion, and composition. Specifically, the RotatE model defines each relation as a rotation from the source entity to the target entity in the complex vector space. The RotatE model is trained using a self-adversarial negative sampling technique.
Zoneout is a method for regularizing RNNs. At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization. But by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward stochastic depth networks.
A Selective Kernel Convolution is a convolution that enables neurons to adaptively adjust their RF sizes among multiple kernels with different kernel sizes. Specifically, the SK convolution has three operators – Split, Fuse and Select. Multiple branches with different kernel sizes are fused using softmax attention that is guided by the information in these branches. Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer
Guided Language to Image Diffusion for Generation and Editing
GLIDE is a generative model based on text-guided diffusion models for more photorealistic image generation. Guided diffusion is applied to text-conditional image synthesis and the model is able to handle free-form prompts. The diffusion model uses a text encoder to condition on natural language descriptions. The model is provided with editing capabilities in addition to zero-shot generation, allowing for iterative improvement of model samples to match more complex prompts. The model is fine-tuned to perform image inpainting.
Stacked Hourglass Networks are a type of convolutional neural network for pose estimation. They are based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.
Reduction-A is an image model block used in the Inception-v4 architecture.
SpatialDropout is a type of dropout for convolutional networks. For a given convolution feature tensor of size ×height×width, we perform only dropout trials and extend the dropout value across the entire feature map. Therefore, adjacent pixels in the dropped-out feature map are either all 0 (dropped-out) or all active as illustrated in the figure to the right.
Orthogonal Regularization is a regularization technique for convolutional neural networks, introduced with generative modelling as the task in mind. Orthogonality is argued to be a desirable quality in ConvNet filters, partially because multiplication by an orthogonal matrix leaves the norm of the original matrix unchanged. This property is valuable in deep or recurrent networks, where repeated matrix multiplication can result in signals vanishing or exploding. To try to maintain orthogonality throughout training, Orthogonal Regularization encourages weights to be orthogonal by pushing them towards the nearest orthogonal manifold. The objective function is augmented with the cost: Where indicates a sum across all filter banks, is a filter bank, and is the identity matrix
Causal convolutions are a type of convolution used for temporal data which ensures the model cannot violate the ordering in which we model the data: the prediction emitted by the model at timestep cannot depend on any of the future timesteps . For images, the equivalent of a causal convolution is a masked convolution which can be implemented by constructing a mask tensor and doing an element-wise multiplication of this mask with the convolution kernel before applying it. For 1-D data such as audio one can more easily implement this by shifting the output of a normal convolution by a few timesteps.