29 machine learning methods and techniques
WaveNet is an audio generative model based on the PixelCNN architecture. In order to deal with long-range temporal dependencies needed for raw audio generation, architectures are developed based on dilated causal convolutions, which exhibit very large receptive fields. The joint probability of a waveform is factorised as a product of conditional probabilities as follows: Each audio sample is therefore conditioned on the samples at all previous timesteps.
The Griffin-Lim Algorithm (GLA) is a phase reconstruction method based on the redundancy of the short-time Fourier transform. It promotes the consistency of a spectrogram by iterating two projections, where a spectrogram is said to be consistent when its inter-bin dependency owing to the redundancy of STFT is retained. GLA is based only on the consistency and does not take any prior knowledge about the target signal into account. This algorithm expects to recover a complex-valued spectrogram, which is consistent and maintains the given amplitude , by the following alternative projection procedure: where is a complex-valued spectrogram updated through the iteration, is the metric projection onto a set , and is the iteration index. Here, is the set of consistent spectrograms, and is the set of spectrograms whose amplitude is the same as the given one. The metric projections onto these sets and are given by: where represents STFT, is the pseudo inverse of STFT (iSTFT), and are element-wise multiplication and division, respectively, and division by zero is replaced by zero. GLA is obtained as an algorithm for the following optimization problem: where is the Frobenius norm. This equation minimizes the energy of the inconsistent components under the constraint on amplitude which must be equal to the given one. Although GLA has been widely utilized because of its simplicity, GLA often involves many iterations until it converges to a certain spectrogram and results in low reconstruction quality. This is because the cost function only requires the consistency, and the characteristics of the target signal are not taken into account.
WaveGAN is a generative adversarial network for unsupervised synthesis of raw-waveform audio (as opposed to image-like spectrograms). The WaveGAN architecture is based off DCGAN. The DCGAN generator uses the transposed convolution operation to iteratively upsample low-resolution feature maps into a high-resolution image. WaveGAN modifies this transposed convolution operation to widen its receptive field, using a longer one-dimensional filters of length 25 instead of two-dimensional filters of size 5x5, and upsampling by a factor of 4 instead of 2 at each layer. The discriminator is modified in a similar way, using length-25 filters in one dimension and increasing stride from 2 to 4. These changes result in WaveGAN having the same number of parameters, numerical operations, and output dimensionality as DCGAN. An additional layer is added afterwards to allow for more audio samples. Further changes include: 1. Flattening 2D convolutions into 1D (e.g. 5x5 2D conv becomes length-25 1D). 2. Increasing the stride factor for all convolutions (e.g. stride 2x2 becomes stride 4). 3. Removing batch normalization from the generator and discriminator. 4. Training using the WGAN-GP strategy.
Phase Shuffle is a technique for removing pitched noise artifacts that come from using transposed convolutions in audio generation models. Phase shuffle is an operation with hyperparameter . It randomly perturbs the phase of each layer’s activations by − to samples before input to the next layer. In the original application in WaveGAN, the authors only apply phase shuffle to the discriminator, as the latent vector already provides the generator a mechanism to manipulate the phase of a resultant waveform. Intuitively speaking, phase shuffle makes the discriminator’s job more challenging by requiring invariance to the phase of the input waveform.
End-to-End Neural Diarization
End-to-End Neural Diarization is a neural network for speaker diarization in which a neural network directly outputs speaker diarization results given a multi-speaker recording. To realize such an end-to-end model, the speaker diarization problem is formulated as a multi-label classification problem and a permutation-free objective function is introduced to directly minimize diarization errors. The EEND method can explicitly handle speaker overlaps during training and inference. Just by feeding multi-speaker recordings with corresponding speaker segment labels, the model can be adapted to real conversations.
Tacotron2
Tacotron 2 is a neural network architecture for speech synthesis directly from text. It consists of two components: - a recurrent sequence-to-sequence feature prediction network with attention which predicts a sequence of mel spectrogram frames from an input character sequence - a modified version of WaveNet which generates time-domain waveform samples conditioned on the predicted mel spectrogram frames In contrast to the original Tacotron, Tacotron 2 uses simpler building blocks, using vanilla LSTM and convolutional layers in the encoder and decoder instead of CBHG stacks and GRU recurrent layers. Tacotron 2 does not use a “reduction factor”, i.e., each decoder step corresponds to a single spectrogram frame. Location-sensitive attention is used instead of additive attention.
WaveGlow is a flow-based generative model that generates audio by sampling from a distribution. Specifically samples are taken from a zero mean spherical Gaussian with the same number of dimensions as our desired output, and those samples are put through a series of layers that transforms the simple distribution to one which has the desired distribution.
FastSpeech2 is a text-to-speech model that aims to improve upon FastSpeech by better solving the one-to-many mapping problem in TTS, i.e., multiple speech variations corresponding to the same text. It attempts to solve this problem by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, in FastSpeech 2, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference. The encoder converts the phoneme embedding sequence into the phoneme hidden sequence, and then the variance adaptor adds different variance information such as duration, pitch and energy into the hidden sequence, finally the mel-spectrogram decoder converts the adapted hidden sequence into mel-spectrogram sequence in parallel. FastSpeech 2 uses a feed-forward Transformer block, which is a stack of self-attention and 1D-convolution as in FastSpeech, as the basic structure for the encoder and mel-spectrogram decoder.
Iterative Pseudo-Labeling
Iterative Pseudo-Labeling (IPL) is a semi-supervised algorithm for speech recognition which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine tunes an existing model at each iteration using both labeled data and a subset of unlabeled data.
MelGAN is a non-autoregressive feed-forward convolutional architecture to perform audio waveform generation in a GAN setup. The architecture is a fully convolutional feed-forward network with mel-spectrogram as input and raw waveform as output. Since the mel-spectrogram is at a 256× lower temporal resolution, the authors use a stack of transposed convolutional layers to upsample the input sequence. Each transposed convolutional layer is followed by a stack of residual blocks with dilated convolutions. Unlike traditional GANs, the MelGAN generator does not use a global noise vector as input. To deal with 'checkerboard artifacts' in audio, instead of using PhaseShuffle, MelGAN uses kernel-size as a multiple of stride. Weight normalization is used for normalization. A window-based discriminator, similar to a PatchGAN is used for the discriminator.
Differentiable Digital Signal Processing
SepFormer is Transformer-based neural network for speech separation. The SepFormer learns short and long-term dependencies with a multi-scale approach that employs transformers. It is mainly composed of multi-head attention and feed-forward layers. A dual-path framework (introduced by DPRNN) is adopted and RNNs are replaced with a multiscale pipeline composed of transformers that learn both short and long-term dependencies. The dual-path framework enables the mitigation of the quadratic complexity of transformers, as transformers in the dual-path framework process smaller chunks. The model is based on the learned-domain masking approach and employs an encoder, a decoder, and a masking network, as shown in the figure. The encoder is fully convolutional, while the decoder employs two Transformers embedded inside the dual-path processing block. The decoder finally reconstructs the separated signals in the time domain by using the masks predicted by the masking network.
FastPitch is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The architecture of FastPitch is shown in the Figure. It is based on FastSpeech and composed mainly of two feed-forward Transformer (FFTr) stacks. The first one operates in the resolution of input tokens, the second one in the resolution of the output frames. Let be the sequence of input lexical units, and be the sequence of target mel-scale spectrogram frames. The first FFTr stack produces the hidden representation . The hidden representation is used to make predictions about the duration and average pitch of every character with a 1-D CNN where and . Next, the pitch is projected to match the dimensionality of the hidden representation and added to . The resulting sum is discretely upsampled and passed to the output FFTr, which produces the output mel-spectrogram sequence Ground truth and are used during training, and predicted and are used during inference. The model optimizes mean-squared error (MSE) between the predicted and ground-truth modalities
Glow-TTS is a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech. The model is directly trained to maximize the log-likelihood of speech with the alignment. Enforcing hard monotonic alignments helps enable robust TTS, which generalizes to long utterances, and employing flows enables fast, diverse, and controllable speech synthesis.
WaveGrad is a conditional model for waveform generation through estimating gradients of the data density. This model is built on the prior work on score matching and diffusion probabilistic models. It starts from Gaussian white noise and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. WaveGrad is non-autoregressive, and requires only a constant number of generation steps during inference. It can use as few as 6 iterations to generate high fidelity audio samples.
WaveGrad DBlocks are used to downsample the temporal dimension of noisy waveform in WaveGrad. They are similar to UBlocks except that only one residual block is included. The dilation factors are 1, 2, 4 in the main branch. Orthogonal initialization is used.
Bridge-net is an audio model block used in the ClariNet text-to-speech architecture. Bridge-net maps frame-level hidden representation to sample-level through several convolution blocks and transposed convolution layers interleaved with softsign non-linearities.
Residual Shuffle-Exchange Network
Residual Shuffle-Exchange Network is an efficient alternative to models using an attention mechanism that allows the modelling of long-range dependencies in sequences in O(n log n) time. This model achieved state-of-the-art performance on the MusicNet dataset for music transcription while being able to run inference on a single GPU fast enough to be suitable for real-time audio processing.
Beneš Block with Residual Switch Units
The Beneš block is a computation-efficient alternative to dense attention, enabling the modelling of long-range dependencies in O(n log n) time. In comparison, dense attention which is commonly used in Transformers has O(n^2) complexity. In music, dependencies occur on several scales, including on a coarse scale which requires processing very long sequences. Beneš blocks have been used in Residual Shuffle-Exchange Networks to achieve state-of-the-art results in music transcription. Beneš blocks have a ‘receptive field’ of the size of the whole sequence, and it has no bottleneck. These properties hold for dense attention but have not been shown for many sparse attention and dilated convolutional architectures.
VoiceFilter-Lite is a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. In this architecture, the voice filtering model operates as a frame-by-frame frontend signal processor to enhance the features consumed by the speech recognizer, without reconstructing audio signals from the features. The key contributions are (1) A system to perform speech separation directly on ASR input features; (2) An asymmetric loss function to penalize oversuppression during training, to make the model harmless under various acoustic environments, (3) An adaptive suppression strength mechanism to adapt to different noise conditions.
Phase Gradient Heap Integration
Z. Průša, P. Balazs and P. L. Søndergaard, "A Noniterative Method for Reconstruction of Phase From STFT Magnitude," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 1154-1164, May 2017, doi: 10.1109/TASLP.2017.2678166. Abstract: A noniterative method for the reconstruction of the short-time fourier transform (STFT) phase from the magnitude is presented. The method is based on the direct relationship between the partial derivatives of the phase and the logarithm of the magnitude of the un-sampled STFT with respect to the Gaussian window. Although the theory holds in the continuous setting only, the experiments show that the algorithm performs well even in the discretized setting (discrete Gabor transform) with low redundancy using the sampled Gaussian window, the truncated Gaussian window and even other compactly supported windows such as the Hann window. Due to the noniterative nature, the algorithm is very fast and it is suitable for long audio signals. Moreover, solutions of iterative phase reconstruction algorithms can be improved considerably by initializing them with the phase estimate provided by the present algorithm. We present an extensive comparison with the state-of-the-art algorithms in a reproducible manner. URL: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7890450&isnumber=7895265
AccoMontage is a model for accompaniment arrangement, a type of music generation task involving intertwined constraints of melody, harmony, texture, and music structure. AccoMontage generates piano accompaniments for folk/pop songs based on a lead sheet (i.e. a melody with chord progression). It first retrieves phrase montages from a database while recombining them structurally using dynamic programming. Second, chords of the retrieved phrases are manipulated to match the lead sheet via style transfer. Lastly, the system offers controls over the generation process. In contrast to pure deep learning approaches, AccoMontage uses a hybrid pathway, in which rule-based optimization and deep learning are both leveraged.
Directional Sparse FIltering
Please enter a description about the method here
SpecGAN is a generative adversarial network method for spectrogram-based, frequency-domain audio generation. The problem is suited for GANs designed for image generation. The model can be approximately inverted. To process audio into suitable spectrograms, the authors perform the short-time Fourier transform with 16 ms windows and 8ms stride, resulting in 128 frequency bins, linearly spaced from 0 to 8 kHz. They take the magnitude of the resultant spectra and scale amplitude values logarithmically to better-align with human perception. They then normalize each frequency bin to have zero mean and unit variance. They clip the spectra to standard deviations and rescale to . They then use the DCGAN approach on the result spectra.
Auditory Cortex ResNet
The Auditory Cortex ResNet, briefly AUCO ResNet, is proposed and tested. It is a deep neural network architecture especially designed for audio classification trained end-to-end. It is inspired by the architectural organization of rat's auditory cortex, containing also innovations 2 and 3. The network outperforms the state-of-the-art accuracies on a reference audio benchmark dataset without any kind of preprocessing, imbalanced data handling and, most importantly, any kind of data augmentation.
FRILL is a non-semantic speech embedding model trained via knowledge distillation that is fast enough to be run in real-time on a mobile device. The fastest model runs at 0.9 ms, which is 300x faster than TRILL and 25x faster than TRILL-distilled.
FastSpeech 2s is a text-to-speech model that abandons mel-spectrograms as intermediate output completely and directly generates speech waveform from text during inference. In other words there is no cascaded mel-spectrogram generation (acoustic model) and waveform generation (vocoder). FastSpeech 2s generates waveform conditioning on intermediate hidden, which makes it more compact in inference by discarding the mel-spectrogram decoder. Two main design changes are made to the waveform decoder. First, considering that the phase information is difficult to predict using a variance predictor, adversarial training is used in the waveform decoder to force it to implicitly recover the phase information by itself. Secondly, the mel-spectrogram decoder of FastSpeech 2 is leveraged, which is trained on the full text sequence to help on the text feature extraction. As shown in the Figure, the waveform decoder is based on the structure of WaveNet including non-causal convolutions and gated activation. The waveform decoder takes a sliced hidden sequence corresponding to a short audio clip as input and upsamples it with transposed 1D-convolution to match the length of audio clip. The discriminator in the adversarial training adopts the same structure in Parallel WaveGAN, which consists of ten layers of non-causal dilated 1-D convolutions with leaky ReLU activation function. The waveform decoder is optimized by the multi-resolution STFT loss and the LSGAN discriminator loss following Parallel WaveGAN. In inference, the mel-spectrogram decoder is discarded and only the waveform decoder is used to synthesize speech audio.