FastPitch is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The architecture of FastPitch is shown in the Figure. It is based on FastSpeech and composed mainly of two feed-forward Transformer (FFTr) stacks. The first one operates in the resolution of input tokens, the second one in the resolution of the output frames. Let x=(x_1,…,x_n) be the sequence of input lexical units, and y=(y_1,…,y_t) be the sequence of target mel-scale spectrogram frames. The first FFTr stack produces the hidden representation h=FFTr(x). The hidden representation h is used to make predictions about the duration and average pitch of every character with a 1-D CNN
d^= DurationPredictor (h),p^=PitchPredictor(h)
where d^∈Nn and p^∈Rn. Next, the pitch is projected to match the dimensionality of the hidden representation h∈Rn×d and added to h. The resulting sum g is discretely upsampled and passed to the output FFTr, which produces the output mel-spectrogram sequence
Ground truth p and d are used during training, and predicted p^ and d^ are used during inference. The model optimizes mean-squared error (MSE) between the predicted and ground-truth modalities