TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/FastPitch

FastPitch

AudioIntroduced 20009 papers
Source Paper

Description

FastPitch is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The architecture of FastPitch is shown in the Figure. It is based on FastSpeech and composed mainly of two feed-forward Transformer (FFTr) stacks. The first one operates in the resolution of input tokens, the second one in the resolution of the output frames. Let x=(x_1,…,x_n)x=\left(x\_{1}, \ldots, x\_{n}\right)x=(x_1,…,x_n) be the sequence of input lexical units, and y=(y_1,…,y_t)\mathbf{y}=\left(y\_{1}, \ldots, y\_{t}\right)y=(y_1,…,y_t) be the sequence of target mel-scale spectrogram frames. The first FFTr stack produces the hidden representation h=FFTr⁡(x)\mathbf{h}=\operatorname{FFTr}(\mathbf{x})h=FFTr(x). The hidden representation hhh is used to make predictions about the duration and average pitch of every character with a 1-D CNN

d^= DurationPredictor (h),p^=PitchPredictor⁡(h)\hat{\mathbf{d}}=\text { DurationPredictor }(\mathbf{h}), \quad \hat{\mathbf{p}}=\operatorname{PitchPredictor}(\mathbf{h})d^= DurationPredictor (h),p^​=PitchPredictor(h)

where d^∈Nn\hat{\mathbf{d}} \in \mathbb{N}^{n}d^∈Nn and p^∈Rn\hat{\mathbf{p}} \in \mathbb{R}^{n}p^​∈Rn. Next, the pitch is projected to match the dimensionality of the hidden representation h∈h \inh∈ Rn×d\mathbb{R}^{n \times d}Rn×d and added to h\mathbf{h}h. The resulting sum g\mathbf{g}g is discretely upsampled and passed to the output FFTr, which produces the output mel-spectrogram sequence

g=h+PitchEmbedding⁡(p)\mathbf{g}=\mathbf{h}+\operatorname{PitchEmbedding}(\mathbf{p})g=h+PitchEmbedding(p) y^=FFTr⁡([g_1,…,g_1⏟_d_1,…g_n,…,g_n⏟d_n])\hat{\mathbf{y}}=\operatorname{FFTr}\left([\underbrace{g\_{1}, \ldots, g\_{1}}\_{d\_{1}}, \ldots \underbrace{g\_{n}, \ldots, g\_{n}}_{d\_{n}}]\right)y^​=FFTr​[g_1,…,g_1​_d_1,…d_ng_n,…,g_n​​]​

Ground truth p\mathbf{p}p and d\mathbf{d}d are used during training, and predicted p^\hat{\mathbf{p}}p^​ and d^\hat{\mathbf{d}}d^ are used during inference. The model optimizes mean-squared error (MSE) between the predicted and ground-truth modalities

L=∥y^−y∥_22+α∥p^−p∥_22+γ∥d^−d∥_22\mathcal{L}=\|\hat{\mathbf{y}}-\mathbf{y}\|\_{2}^{2}+\alpha\|\hat{\mathbf{p}}-\mathbf{p}\|\_{2}^{2}+\gamma\|\hat{\mathbf{d}}-\mathbf{d}\|\_{2}^{2}L=∥y^​−y∥_22+α∥p^​−p∥_22+γ∥d^−d∥_22

Papers Using This Method

Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch2024-10-09Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation2024-03-07Incremental FastPitch: Chunk-based High Quality Text to Speech2024-01-03Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning2023-11-07Towards Building Text-To-Speech Systems for the Next Billion Users2022-11-17Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch2022-04-12Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows2022-03-03Exploring Timbre Disentanglement in Non-Autoregressive Cross-Lingual Text-to-Speech2021-10-14FastPitch: Parallel Text-to-speech with Pitch Prediction2020-06-11