TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/WaveRNN

WaveRNN

SequentialIntroduced 200026 papers
Source Paper

Description

WaveRNN is a single-layer recurrent neural network for audio generation that is designed efficiently predict 16-bit raw audio samples.

The overall computation in the WaveRNN is as follows (biases omitted for brevity):

x_t=[c_t−1,f_t−1,c_t]\mathbf{x}\_{t} = \left[\mathbf{c}\_{t−1},\mathbf{f}\_{t−1}, \mathbf{c}\_{t}\right]x_t=[c_t−1,f_t−1,c_t]

u_t=σ(R_uh_t−1+I∗_ux_t)\mathbf{u}\_{t} = \sigma\left(\mathbf{R}\_{u}\mathbf{h}\_{t-1} + \mathbf{I}^{*}\_{u}\mathbf{x}\_{t}\right)u_t=σ(R_uh_t−1+I∗_ux_t)

r_t=σ(R_rh_t−1+I∗_rx_t)\mathbf{r}\_{t} = \sigma\left(\mathbf{R}\_{r}\mathbf{h}\_{t-1} + \mathbf{I}^{*}\_{r}\mathbf{x}\_{t}\right)r_t=σ(R_rh_t−1+I∗_rx_t)

e_t=τ(r_t⊙(R_eh_t−1)+I∗_ex_t)\mathbf{e}\_{t} = \tau\left(\mathbf{r}\_{t} \odot \left(\mathbf{R}\_{e}\mathbf{h}\_{t-1}\right) + \mathbf{I}^{*}\_{e}\mathbf{x}\_{t} \right)e_t=τ(r_t⊙(R_eh_t−1)+I∗_ex_t)

h_t=u_t⋅h_t−1+(1−u_t)⋅e_t\mathbf{h}\_{t} = \mathbf{u}\_{t} \cdot \mathbf{h}\_{t-1} + \left(1-\mathbf{u}\_{t}\right) \cdot \mathbf{e}\_{t}h_t=u_t⋅h_t−1+(1−u_t)⋅e_t

y_c,y_f=split(h_t)\mathbf{y}\_{c}, \mathbf{y}\_{f} = \text{split}\left(\mathbf{h}\_{t}\right)y_c,y_f=split(h_t)

P(c_t)=softmax(O_2relu(O_1y_c))P\left(\mathbf{c}\_{t}\right) = \text{softmax}\left(\mathbf{O}\_{2}\text{relu}\left(\mathbf{O}\_{1}\mathbf{y}\_{c}\right)\right)P(c_t)=softmax(O_2relu(O_1y_c))

P(f_t)=softmax(O_4relu(O_3y_f))P\left(\mathbf{f}\_{t}\right) = \text{softmax}\left(\mathbf{O}\_{4}\text{relu}\left(\mathbf{O}\_{3}\mathbf{y}\_{f}\right)\right)P(f_t)=softmax(O_4relu(O_3y_f))

where the ∗*∗ indicates a masked matrix whereby the last coarse input c_t\mathbf{c}\_{t}c_t is only connected to the fine part of the states u_t\mathbf{u}\_{t}u_t, r_t\mathbf{r}\_{t}r_t, e_t\mathbf{e}\_{t}e_t and h_t\mathbf{h}\_{t}h_t and thus only affects the fine output y_f\mathbf{y}\_{f}y_f. The coarse and fine parts c_t\mathbf{c}\_{t}c_t and f_t\mathbf{f}\_{t}f_t are encoded as scalars in [0,255]\left[0, 255\right][0,255] and scaled to the interval [−1,1]\left[−1, 1\right][−1,1]. The matrix R\mathbf{R}R formed from the matrices R_u\mathbf{R}\_{u}R_u, R_r\mathbf{R}\_{r}R_r, R_e\mathbf{R}\_{e}R_e is computed as a single matrix-vector product to produce the contributions to all three gates u_t\mathbf{u}\_{t}u_t, mathbfr_tmathbf{r}\_{t}mathbfr_t and e_t\mathbf{e}\_{t}e_t (a variant of the GRU cell. σ\sigmaσ and τ\tauτ are the standard sigmoid and tanh non-linearities.

Each part feeds into a softmax layer over the corresponding 8 bits and the prediction of the 8 fine bits is conditioned on the 8 coarse bits. The resulting Dual Softmax layer allows for efficient prediction of 16-bit samples using two small output spaces (2 8 values each) instead of a single large output space (with 2 16 values).

Papers Using This Method

Exploratory Evaluation of Speech Content Masking2024-01-08An End-to-End Multi-Module Audio Deepfake Generation System for ADD Challenge 20232023-07-03Evince the artifacts of Spoof Speech by blending Vocal Tract and Voice Source Features2022-12-05SIMD-size aware weight regularization for fast neural vocoding on CPU2022-11-02Perfectly Secure Steganography Using Minimum Entropy Coupling2022-10-24Adaptive re-calibration of channel-wise features for Adversarial Audio Classification2022-10-21WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration2022-10-03R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS2022-06-30NatiQ: An End-to-end Text-to-Speech System for Arabic2022-06-15VocBench: A Neural Vocoder Benchmark for Speech Synthesis2021-12-06On-device neural speech synthesis2021-09-17Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear Prediction2021-05-20High-Fidelity and Low-Latency Universal Neural Vocoder based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling2021-05-20Enhancing into the codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders2021-02-12FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge2020-11-25TFGAN: Time and Frequency Domain Based Generative Adversarial Network for High-fidelity Speech Synthesis2020-11-24Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis2020-11-10Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion2020-08-13Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions2020-08-09Audiovisual Speech Synthesis using Tacotron22020-08-03