Description
Tacotron is an end-to-end generative text-to-speech model that takes a character sequence as input and outputs the corresponding spectrogram. The backbone of Tacotron is a seq2seq model with attention. The Figure depicts the model, which includes an encoder, an attention-based decoder, and a post-processing net. At a high-level, the model takes characters as input and produces spectrogram frames, which are then converted to waveforms.
Papers Using This Method
Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech2024-10-29Enhancing Kurdish Text-to-Speech with Native Corpus Training: A High-Quality WaveGlow Vocoder Approach2024-09-10Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems2024-09-04Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation2024-04-03An overview of text-to-speech systems and media applications2023-10-22Energy-Based Models For Speech Synthesis2023-10-19The DeepZen Speech Synthesis System for Blizzard Challenge 20232023-08-30Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration2023-05-25A Virtual Simulation-Pilot Agent for Training of Air Traffic Controllers2023-04-16ArmanTTS single-speaker Persian dataset2023-04-07Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language2022-12-16Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features2022-11-01Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation2022-10-31Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments2022-10-31Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS2022-10-24Facial Landmark Predictions with Applications to Metaverse2022-09-29Self-supervised learning for robust voice cloning2022-04-07Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis2022-02-16Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention2022-01-25Word-Level Style Control for Expressive, Non-attentive Speech Synthesis2021-11-19