5,489 machine learning methods and techniques
The Truncation Trick is a latent sampling procedure for generative adversarial networks, where we sample from a truncated normal (where values which fall outside a range are resampled to fall inside that range). The original implementation was in Megapixel Size Image Creation with GAN. In BigGAN, the authors find this provides a boost to the Inception Score and FID.
A Projection Discriminator is a type of discriminator for generative adversarial networks. It is motivated by a probabilistic model in which the distribution of the conditional variable given is discrete or uni-modal continuous distributions. If we look at the original solution for the loss function in the vanilla GANs, we can decompose it into the sum of two log-likelihood ratios: We can model the log likelihood ratio and by some parametric functions and respectively. If we make a standing assumption that and are simple distributions like those that are Gaussian or discrete log linear on the feature space, then the parametrization of the following form becomes natural: where is the embedding matrix of , is a vector output function of , and is a scalar function of the same that appears in . The learned parameters {} are trained to optimize the adversarial loss. This model of the discriminator is the projection.
Two Time-scale Update Rule
The Two Time-scale Update Rule (TTUR) is an update rule for generative adversarial networks trained with stochastic gradient descent. TTUR has an individual learning rate for both the discriminator and the generator. The main premise is that the discriminator converges to a local minimum when the generator is fixed. If the generator changes slowly enough, then the discriminator still converges, since the generator perturbations are small. Besides ensuring convergence, the performance may also improve since the discriminator must first learn new patterns before they are transferred to the generator. In contrast, a generator which is overly fast, drives the discriminator steadily into new regions without capturing its gathered information.
DropBlock is a structured form of dropout directed at regularizing convolutional networks. In DropBlock, units in a contiguous region of a feature map are dropped together. As DropBlock discards features in a correlated area, the networks must look elsewhere for evidence to fit the data.
Off-Diagonal Orthogonal Regularization is a modified form of orthogonal regularization originally used in BigGAN. The original orthogonal regularization is known to be limiting so the authors explore several variants designed to relax the constraint while still imparting the desired smoothness to the models. They opt for a modification where they remove diagonal terms from the regularization, and aim to minimize the pairwise cosine similarity between filters but does not constrain their norm: where denotes a matrix with all elements set to 1. The authors sweep values and select .
Approximate Bayesian Computation
Class of methods in Bayesian Statistics where the posterior distribution is approximated over a rejection scheme on simulations because the likelihood function is intractable. Different parameters get sampled and simulated. Then a distance function is calculated to measure the quality of the simulation compared to data from real observations. Only simulations that fall below a certain threshold get accepted. Image source: Kulkarni et al.
Correlation Alignment for Deep Domain Adaptation
Feedback Alignment
Xavier Initialization, or Glorot Initialization, is an initialization scheme for neural networks. Biases are initialized be 0 and the weights at each layer are initialized as: Where is a uniform distribution and is the size of the previous layer (number of columns in ) and is the size of the current layer.
Parameterized ReLU
A Parametric Rectified Linear Unit, or PReLU, is an activation function that generalizes the traditional rectified unit with a slope for negative values. Formally: The intuition is that different layers may require different types of nonlinearity. Indeed the authors find in experiments with convolutional neural networks that PReLus for the initial layer have more positive slopes, i.e. closer to linear. Since the filters of the first layers are Gabor-like filters such as edge or texture detectors, this shows a circumstance where positive and negative responses of filters are respected. In contrast the authors find deeper layers have smaller coefficients, suggesting the model becomes more discriminative at later layers (while it wants to retain more information at earlier layers).
Target Policy Smoothing is a regularization strategy for the value function in reinforcement learning. Deterministic policies can overfit to narrow peaks in the value estimate, making them highly susceptible to functional approximation error, increasing the variance of the target. To reduce this variance, target policy smoothing adds a small amount of random noise to the target policy and averages over mini-batches - approximating a SARSA-like expectation/integral. The modified target update is: where the added noise is clipped to keep the target close to the original action. The outcome is an algorithm reminiscent of Expected SARSA, where the value estimate is instead learned off-policy and the noise added to the target policy is chosen independently of the exploration policy. The value estimate learned is with respect to a noisy policy defined by the parameter .
Contrastive Predictive Coding (CPC) learns self-supervised representations by predicting the future in latent space by using powerful autoregressive models. The model uses a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples. First, a non-linear encoder maps the input sequence of observations to a sequence of latent representations , potentially with a lower temporal resolution. Next, an autoregressive model summarizes all in the latent space and produces a context latent representation . A density ratio is modelled which preserves the mutual information between and as follows: where stands for ’proportional to’ (i.e. up to a multiplicative constant). Note that the density ratio can be unnormalized (does not have to integrate to 1). The authors use a simple log-bilinear model: Any type of autoencoder and autoregressive can be used. An example the authors opt for is strided convolutional layers with residual blocks and GRUs. The autoencoder and autoregressive models are trained to minimize an InfoNCE loss (see components).
Exponential Decay is a learning rate schedule where we decay the learning rate with more iterations using an exponential function: Image Credit: Suki Lau
\begin{equation} DiceLoss\left( y, \overline{p} \right) = 1 - \dfrac{\left( 2y\overline{p} + 1 \right)} {\left( y+\overline{p } + 1 \right)} \end{equation}
Self-Organizing Map
The Self-Organizing Map (SOM), commonly also known as Kohonen network (Kohonen 1982, Kohonen 2001) is a computational method for the visualization and analysis of high-dimensional data, especially experimentally acquired information. Extracted from scholarpedia Sources: Image: scholarpedia Paper: Kohonen, T. Self-organized formation of topologically correct feature maps. Biol. Cybern. 43, 59–69 (1982) Book: Self-Organizing Maps
Minimum Description Length
Minimum Description Length provides a criterion for the selection of models, regardless of their complexity, without the restrictive assumption that the data form a sample from a 'true' distribution. Extracted from scholarpedia Source: Paper: J. Rissanen (1978) Modeling by the shortest data description. Automatica 14, 465-47190005-5) Book: P. D. Grünwald (2007) The Minimum Description Length Principle, MIT Press, June 2007, 570 pages
ReLU6 is a modification of the rectified linear unit where we limit the activation to a maximum size of . This is due to increased robustness when used with low-precision computation. Image Credit: PyTorch
Additive Angular Margin Loss
ArcFace, or Additive Angular Margin Loss, is a loss function used in face recognition tasks. The softmax is traditionally used in these tasks. However, the softmax loss function does not explicitly optimise the feature embedding to enforce higher similarity for intraclass samples and diversity for inter-class samples, which results in a performance gap for deep face recognition under large intra-class appearance variations. The ArcFace loss transforms the logits , where is the angle between the weight and the feature . The individual weight is fixed by normalization. The embedding feature is fixed by normalization and re-scaled to . The normalisation step on features and weights makes the predictions only depend on the angle between the feature and the weight. The learned embedding features are thus distributed on a hypersphere with a radius of . Finally, an additive angular margin penalty is added between and to simultaneously enhance the intra-class compactness and inter-class discrepancy. Since the proposed additive angular margin penalty is equal to the geodesic distance margin penalty in the normalised hypersphere, the method is named ArcFace: The authors select face images from 8 different identities containing enough samples (around 1,500 images/class) to train 2-D feature embedding networks with the softmax and ArcFace loss, respectively. As the Figure shows, the softmax loss provides roughly separable feature embedding but produces noticeable ambiguity in decision boundaries, while the proposed ArcFace loss can obviously enforce a more evident gap between the nearest classes. Other alternatives to enforce intra-class compactness and inter-class distance include Supervised Contrastive Learning.
Surrogate Lagrangian Relaxation
Please enter a description about the method here
Adversarial Model Perturbation
Based on the understanding that the flat local minima of the empirical risk cause the model to generalize better. Adversarial Model Perturbation (AMP) improves generalization via minimizing the AMP loss, which is obtained from the empirical risk by applying the worst norm-bounded perturbation on each point in the parameter space.
A Highway Network is an architecture designed to ease gradient-based training of very deep networks. They allow unimpeded information flow across several layers on "information highways". The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions.
Hard Swish is a type of activation function based on Swish, but replaces the computationally expensive sigmoid with a piecewise linear analogue:
Features Explanation Method
Regularization is a regularization technique applied to the weights of a neural network. We minimize a loss function compromising both the primary loss function and a penalty on the Norm of the weights: where is a value determining the strength of the penalty. In contrast to weight decay, regularization promotes sparsity; i.e. some parameters have an optimal value of zero. Image Source: Wikipedia/media/File:Sparsityl1.png)
How do I file a dispute with Expedia? To file a dispute with Expedia, call +1(888) (829) (0881) OR +1(805) (330) (4056), or use their Help Center to submit your case with complete booking information. When addressing your issue, ask about special discount offers—Expedia may provide travel vouchers, promo codes, or exclusive deals to resolve the dispute and retain customer satisfaction. How do I file a dispute with Expedia? To file a dispute with Expedia, call +1(888) (829) (0881) OR +1(805) (330) (4056), or use their Help Center to submit your case with complete booking information. When addressing your issue, ask about special discount offers—Expedia may provide travel vouchers, promo codes, or exclusive deals to resolve the dispute and retain customer satisfaction. How do I file a dispute with Expedia? To file a dispute with Expedia, call +1(888) (829) (0881) OR +1(805) (330) (4056), or use their Help Center to submit your case with complete booking information. When addressing your issue, ask about special discount offers—Expedia may provide travel vouchers, promo codes, or exclusive deals to resolve the dispute and retain customer satisfaction. How do I file a dispute with Expedia? To file a dispute with Expedia, call +1(888) (829) (0881) OR +1(805) (330) (4056), or use their Help Center to submit your case with complete booking information. When addressing your issue, ask about special discount offers—Expedia may provide travel vouchers, promo codes, or exclusive deals to resolve the dispute and retain customer satisfaction.
Weight Normalization is a normalization method for training neural networks. It is inspired by batch normalization, but it is a deterministic method that does not share batch normalization's property of adding noise to the gradients. It reparameterizes each -dimentional weight vector in terms of a parameter vector and a scalar parameter and to perform stochastic gradient descent with respect to those parameters instead. Weight vectors are expressed in terms of the new parameters using: where is a -dimensional vector, is a scalar, and denotes the Euclidean norm of . This reparameterization has the effect of fixing the Euclidean norm of the weight vector : we now have , independent of the parameters .
FixMatch is an algorithm that first generates pseudo-labels using the model's predictions on weakly-augmented unlabeled images. For a given image, the pseudo-label is only retained if the model produces a high-confidence prediction. The model is then trained to predict the pseudo-label when fed a strongly-augmented version of the same image. Description from: FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence Image credit: FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
DropConnect generalizes Dropout by randomly dropping the weights rather than the activations with probability . DropConnect is similar to Dropout as it introduces dynamic sparsity within the model, but differs in that the sparsity is on the weights , rather than the output vectors of a layer. In other words, the fully connected layer with DropConnect becomes a sparsely connected layer in which the connections are chosen at random during the training stage. Note that this is not equivalent to setting to be a fixed sparse matrix during training. For a DropConnect layer, the output is given as: Here is the output of a layer, is the input to a layer, are weight parameters, and is a binary matrix encoding the connection information where . Each element of the mask is drawn independently for each example during training, essentially instantiating a different connectivity for each example seen. Additionally, the biases are also masked out during training.
Procrustes
SAGA is a method in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem.
Adaptive Label Smoothing
Chain-of-thought prompting
Chain-of-thought prompts contain a series of intermediate reasoning steps, and they are shown to significantly improve the ability of large language models to perform certain tasks that involve complex reasoning (e.g., arithmetic, commonsense reasoning, symbolic reasoning, etc.)
The Huber loss function describes the penalty incurred by an estimation procedure f. Huber (1964) defines the loss function piecewise by[1] L δ ( a ) = { 1 2 a 2 for | a | ≤ δ , δ ⋅ ( | a | − 1 2 δ ) , otherwise. {\displaystyle L{\delta }(a)={\begin{cases}{\frac {1}{2}}{a^{2}}&{\text{for }}|a|\leq \delta ,\\\delta \cdot \left(|a|-{\frac {1}{2}}\delta \right),&{\text{otherwise.}}\end{cases}}} This function is quadratic for small values of a, and linear for large values, with equal values and slopes of the different sections at the two points where | a | = δ |a|=\delta . The variable a often refers to the residuals, that is to the difference between the observed and predicted values a = y − f ( x ) a=y-f(x), so the former can be expanded to[2] L δ ( y , f ( x ) ) = { 1 2 ( y − f ( x ) ) 2 for | y − f ( x ) | ≤ δ , δ ⋅ ( | y − f ( x ) | − 1 2 δ ) , otherwise. {\displaystyle L{\delta }(y,f(x))={\begin{cases}{\frac {1}{2}}(y-f(x))^{2}&{\text{for }}|y-f(x)|\leq \delta ,\\\delta \ \cdot \left(|y-f(x)|-{\frac {1}{2}}\delta \right),&{\text{otherwise.}}\end{cases}}} The Huber loss is the convolution of the absolute value function with the rectangular function, scaled and translated. Thus it "smoothens out" the former's corner at the origin. .. math:: \ell(x, y) = L = \{l1, ..., lN\}^T with .. math:: ln = \begin{cases} 0.5 (xn - yn)^2, & \text{if } |xn - yn| < delta \\ delta (|xn - yn| - 0.5 delta), & \text{otherwise } \end{cases}
Layer-wise Adaptive Rate Scaling, or LARS, is a large batch optimization technique. There are two notable differences between LARS and other adaptive algorithms such as Adam or RMSProp: first, LARS uses a separate learning rate for each layer and not for each weight. And second, the magnitude of the update is controlled with respect to the weight norm for better control of training speed.
Elastic Weight Consolidation
The methon to overcome catastrophic forgetting in neural network while continual learning
Adaptive Softmax is a speedup technique for the computation of probability distributions over words. The adaptive softmax is inspired by the class-based hierarchical softmax, where the word classes are built to minimize the computation time. Adaptive softmax achieves efficiency by explicitly taking into account the computation time of matrix-multiplication on parallel systems and combining it with a few important observations, namely keeping a shortlist of frequent words in the root node and reducing the capacity of rare words.
Barlow Twins is a self-supervised learning method that applies redundancy-reduction — a principle first proposed in neuroscience — to self supervised learning. The objective function measures the cross-correlation matrix between the embeddings of two identical networks fed with distorted versions of a batch of samples, and tries to make this matrix close to the identity. This causes the embedding vectors of distorted version of a sample to be similar, while minimizing the redundancy between the components of these vectors. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors.
Local SGD is a distributed training technique that runs SGD independently in parallel on different workers and averages the sequences only once in a while.
Artemisinin Optimization based on Malaria Therapy: Algorithm and Applications to Medical Image Segmentation
This study proposes an efficient metaheuristic algorithm called the Artemisinin Optimization (AO) algorithm. This algorithm draws inspiration from the process of artemisinin medicine therapy for malaria, which involves the comprehensive eradication of malarial parasites within the human body. AO comprises three optimization stages: a comprehensive eliminations phase simulating global exploration, a local clearance phase for local exploitation, and a post-consolidation phase to enhance the algorithm's ability to escape local optima. In the experimental, this paper conducts a qualitative analysis experiment on the AO, explaining its characteristics in searching for the optimal solution. Subsequently, AO is then tested on the classical IEEE CEC 2014, and the latest IEEE CEC 2022 benchmark function sets to assess its adaptability. Comparative analyses are conducted against eight well-established algorithms and eight high-performance improved algorithms. Statistical analyses of convergence curves and qualitative metrics revealed AO's robust competitiveness. Lastly, the AO is incorporated into breast cancer pathology image segmentation applications. Using 15 authentic medical images at six threshold levels, AO's segmentation performance is compared against eight distinguished algorithms. Experimental results demonstrated AO's superiority in terms of image segmentation accuracy, Feature Similarity Index (FSIM), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM) over the contrast algorithms. These results emphasize AO's efficiency and its potential in real-world optimization applications. The source codes
Early exiting using confidence measures
Exit whenever the model is confident enough allowing early exiting from hidden layers
CBHG is a building block used in the Tacotron text-to-speech model. It consists of a bank of 1-D convolutional filters, followed by highway networks and a bidirectional gated recurrent unit (BiGRU). The module is used to extract representations from sequences. The input sequence is first convolved with sets of 1-D convolutional filters, where the -th set contains filters of width (i.e. ). These filters explicitly model local and contextual information (akin to modeling unigrams, bigrams, up to K-grams). The convolution outputs are stacked together and further max pooled along time to increase local invariances. A stride of 1 is used to preserve the original time resolution. The processed sequence is further passed to a few fixed-width 1-D convolutions, whose outputs are added with the original input sequence via residual connections. Batch normalization is used for all convolutional layers. The convolution outputs are fed into a multi-layer highway network to extract high-level features. Finally, a bidirectional GRU RNN is stacked on top to extract sequential features from both forward and backward context.
Wasserstein Gradient Penalty Loss, or WGAN-GP Loss, is a loss used for generative adversarial networks that augments the Wasserstein loss with a gradient norm penalty for random samples to achieve Lipschitz continuity: It was introduced as part of the WGAN-GP overall model.
To ask a question on Expedia, you can utilize their Help Center +1-888-829-0881, call customer service, use live chat, or reach out via social media. You can also find answers to frequently asked questions on their website. Detailed Methods: Expedia Help Center: Visit the Expedia website or app and navigate to the Help Center. This section contains FAQs, articles, and troubleshooting tips. Call Customer Service: Dial +1-844-Expedia or +1-888-829-0881 to speak with a representative How do I ask questions on Expedia? To ask a question on Expedia, you can visit their Help Center on the website or app, call them at +1-888-829-0881use the live chat, or reach out via social media. You can also find answers to common questions in the FAQ section. Ways to Ask a Question on Expedia · Phone Support: Call +1-888-829-0881or +1-888-829-0881to speak with a live Expedia agent. · Live Chat: Use the chat feature To ask a question on Expedia +1-888-829-0881, go to their Help Center, use the search bar, or start a live chat. You can also call their customer. To ask a question at Expedia, visit their Help Center on the website or app. You can also call +1-888-829-0881) , use the live chat feature, or reach out via To communicate with Expedia, you can call their customer service hotline (+1-888-829-0881), use their live chat feature on their website To ask questions on Expedia at +1-888-829-0881) (OTA) // +1-888-829-0881 [LIVE PERSON], visit their Help Center to search for answers
Embedding Dropout is equivalent to performing dropout on the embedding matrix at a word level, where the dropout is broadcast across all the word vector’s embedding. The remaining non-dropped-out word embeddings are scaled by where is the probability of embedding dropout. As the dropout occurs on the embedding matrix that is used for a full forward and backward pass, this means that all occurrences of a specific word will disappear within that pass, equivalent to performing variational dropout on the connection between the one-hot embedding and the embedding lookup. Source: Merity et al, Regularizing and Optimizing LSTM Language Models
Weight Tying improves the performance of language models by tying (sharing) the weights of the embedding and softmax layers. This method also massively reduces the total number of parameters in the language models that it is applied to. Language models are typically comprised of an embedding layer, followed by a number of Transformer or LSTM layers, which are finally followed by a softmax layer. Embedding layers learn word representations, such that similar words (in meaning) are represented by vectors that are near each other (in cosine distance). [Press & Wolf, 2016] showed that the softmax matrix, in which every word also has a vector representation, also exhibits this property. This leads them to propose to share the softmax and embedding matrices, which is done today in nearly all language models. This method was independently introduced by Press & Wolf, 2016 and Inan et al, 2016. Additionally, the Press & Wolf paper proposes Three-way Weight Tying, a method for NMT models in which the embedding matrix for the source language, the embedding matrix for the target language, and the softmax matrix for the target language are all tied. That method has been adopted by the Attention Is All You Need model and many other neural machine translation models.
Probability Guided Maxout
A regularization criterion that, differently from dropout and its variants, is deterministic rather than random. It grounds on the empirical evidence that feature descriptors with larger L2-norm and highly-active nodes are strongly correlated to confident class predictions. Thus, the criterion guides towards dropping a percentage of the most active nodes of the descriptors, proportionally to the estimated class probability
Activation Normalization is a type of normalization used for flow-based generative models; specifically it was introduced in the GLOW architecture. An ActNorm layer performs an affine transformation of the activations using a scale and bias parameter per channel, similar to batch normalization. These parameters are initialized such that the post-actnorm activations per-channel have zero mean and unit variance given an initial minibatch of data. This is a form of data dependent initilization. After initialization, the scale and bias are treated as regular trainable parameters that are independent of the data.