TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Tacotron 2

Tacotron 2

Tacotron2

AudioIntroduced 200023 papers
Source Paper

Description

Tacotron 2 is a neural network architecture for speech synthesis directly from text. It consists of two components:

  • a recurrent sequence-to-sequence feature prediction network with attention which predicts a sequence of mel spectrogram frames from an input character sequence
  • a modified version of WaveNet which generates time-domain waveform samples conditioned on the predicted mel spectrogram frames

In contrast to the original Tacotron, Tacotron 2 uses simpler building blocks, using vanilla LSTM and convolutional layers in the encoder and decoder instead of CBHG stacks and GRU recurrent layers. Tacotron 2 does not use a “reduction factor”, i.e., each decoder step corresponds to a single spectrogram frame. Location-sensitive attention is used instead of additive attention.

Papers Using This Method

Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems2024-09-04An overview of text-to-speech systems and media applications2023-10-22Energy-Based Models For Speech Synthesis2023-10-19Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration2023-05-25ArmanTTS single-speaker Persian dataset2023-04-07Facial Landmark Predictions with Applications to Metaverse2022-09-29Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention2022-01-25ITAcotron 2: Transfering English Speech Synthesis Architectures and Speech Features to Italian2021-11-01Neural Sequence-to-Sequence Speech Synthesis Using a Hidden Semi-Markov Model Based Structured Attention Mechanism2021-08-31Neural HMMs are all you need (for high-quality attention-free TTS)2021-08-30Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis2021-06-15VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention2021-02-12Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech2021-01-01Using previous acoustic context to improve Text-to-Speech synthesis2020-12-07Learning Speaker Embedding from Text-to-Speech2020-10-21Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling2020-10-08SpeedySpeech: Efficient Neural Speech Synthesis2020-08-09One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech2020-08-03Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis2020-05-12Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis2020-02-06