TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/FastSpeech 2

FastSpeech 2

AudioIntroduced 200020 papers
Source Paper

Description

FastSpeech2 is a text-to-speech model that aims to improve upon FastSpeech by better solving the one-to-many mapping problem in TTS, i.e., multiple speech variations corresponding to the same text. It attempts to solve this problem by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, in FastSpeech 2, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference.

The encoder converts the phoneme embedding sequence into the phoneme hidden sequence, and then the variance adaptor adds different variance information such as duration, pitch and energy into the hidden sequence, finally the mel-spectrogram decoder converts the adapted hidden sequence into mel-spectrogram sequence in parallel. FastSpeech 2 uses a feed-forward Transformer block, which is a stack of self-attention and 1D-convolution as in FastSpeech, as the basic structure for the encoder and mel-spectrogram decoder.

Papers Using This Method

AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis2025-04-12AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation2024-12-13Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems2024-09-04AttentionStitch: How Attention Solves the Speech Editing Problem2024-03-05Back Transcription as a Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors2023-10-25Energy-Based Models For Speech Synthesis2023-10-19DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation2023-10-11Towards Robust FastSpeech 2 by Modelling Residual Multimodality2023-06-02The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech2023-06-01LibriS2S: A German-English Speech-to-Speech Translation Corpus2022-04-22Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech2022-03-31ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis2022-03-20Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus2021-12-20Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data2021-11-15PortaSpeech: Portable and High-Quality Generative Text-to-Speech2021-09-30One TTS Alignment To Rule Them All2021-08-23Digital Einstein Experience: Fast Text-to-Speech for Conversational AI2021-07-21Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech2021-03-06Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators2020-10-27FastSpeech 2: Fast and High-Quality End-to-End Text to Speech2020-06-08