Papers With Code 2 | ML Benchmarks, SotA Results & Code

Description

FastPitch is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The architecture of FastPitch is shown in the Figure. It is based on FastSpeech and composed mainly of two feed-forward Transformer (FFTr) stacks. The first one operates in the resolution of input tokens, the second one in the resolution of the output frames. Let $x=\left(x\_{1}, \ldots, x\_{n}\right)$ be the sequence of input lexical units, and $\mathbf{y}=\left(y\_{1}, \ldots, y\_{t}\right)$ be the sequence of target mel-scale spectrogram frames. The first FFTr stack produces the hidden representation $\mathbf{h}=\operatorname{FFTr}(\mathbf{x})$ . The hidden representation $h$ is used to make predictions about the duration and average pitch of every character with a 1-D CNN

\hat{\mathbf{d}}=\text { DurationPredictor }(\mathbf{h}), \quad \hat{\mathbf{p}}=\operatorname{PitchPredictor}(\mathbf{h})

where $\hat{\mathbf{d}} \in \mathbb{N}^{n}$ and $\hat{\mathbf{p}} \in \mathbb{R}^{n}$ . Next, the pitch is projected to match the dimensionality of the hidden representation $h \in$ $\mathbb{R}^{n \times d}$ and added to $\mathbf{h}$ . The resulting sum $\mathbf{g}$ is discretely upsampled and passed to the output FFTr, which produces the output mel-spectrogram sequence

\mathbf{g}=\mathbf{h}+\operatorname{PitchEmbedding}(\mathbf{p})

\hat{\mathbf{y}}=\operatorname{FFTr}\left([\underbrace{g\_{1}, \ldots, g\_{1}}\_{d\_{1}}, \ldots \underbrace{g\_{n}, \ldots, g\_{n}}_{d\_{n}}]\right)

Ground truth $\mathbf{p}$ and $\mathbf{d}$ are used during training, and predicted $\hat{\mathbf{p}}$ and $\hat{\mathbf{d}}$ are used during inference. The model optimizes mean-squared error (MSE) between the predicted and ground-truth modalities

\mathcal{L}=\|\hat{\mathbf{y}}-\mathbf{y}\|\_{2}^{2}+\alpha\|\hat{\mathbf{p}}-\mathbf{p}\|\_{2}^{2}+\gamma\|\hat{\mathbf{d}}-\mathbf{d}\|\_{2}^{2}

Description

\hat{\mathbf{d}}=\text { DurationPredictor }(\mathbf{h}), \quad \hat{\mathbf{p}}=\operatorname{PitchPredictor}(\mathbf{h})

\mathbf{g}=\mathbf{h}+\operatorname{PitchEmbedding}(\mathbf{p})

\hat{\mathbf{y}}=\operatorname{FFTr}\left([\underbrace{g\_{1}, \ldots, g\_{1}}\_{d\_{1}}, \ldots \underbrace{g\_{n}, \ldots, g\_{n}}_{d\_{n}}]\right)

\mathcal{L}=\|\hat{\mathbf{y}}-\mathbf{y}\|\_{2}^{2}+\alpha\|\hat{\mathbf{p}}-\mathbf{p}\|\_{2}^{2}+\gamma\|\hat{\mathbf{d}}-\mathbf{d}\|\_{2}^{2}

FastPitch

Description

Papers Using This Method

FastPitch

Description

Papers Using This Method