TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/CBHG

CBHG

GeneralIntroduced 200065 papers
Source Paper

Description

CBHG is a building block used in the Tacotron text-to-speech model. It consists of a bank of 1-D convolutional filters, followed by highway networks and a bidirectional gated recurrent unit (BiGRU).

The module is used to extract representations from sequences. The input sequence is first convolved with KKK sets of 1-D convolutional filters, where the kkk-th set contains C_kC\_{k}C_k filters of width kkk (i.e. k=1,2,…,Kk = 1, 2, \dots , Kk=1,2,…,K). These filters explicitly model local and contextual information (akin to modeling unigrams, bigrams, up to K-grams). The convolution outputs are stacked together and further max pooled along time to increase local invariances. A stride of 1 is used to preserve the original time resolution. The processed sequence is further passed to a few fixed-width 1-D convolutions, whose outputs are added with the original input sequence via residual connections. Batch normalization is used for all convolutional layers. The convolution outputs are fed into a multi-layer highway network to extract high-level features. Finally, a bidirectional GRU RNN is stacked on top to extract sequential features from both forward and backward context.

Papers Using This Method

Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech2024-10-29Enhancing Kurdish Text-to-Speech with Native Corpus Training: A High-Quality WaveGlow Vocoder Approach2024-09-10Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems2024-09-04Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation2024-04-03An overview of text-to-speech systems and media applications2023-10-22Energy-Based Models For Speech Synthesis2023-10-19The DeepZen Speech Synthesis System for Blizzard Challenge 20232023-08-30Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration2023-05-25A Virtual Simulation-Pilot Agent for Training of Air Traffic Controllers2023-04-16ArmanTTS single-speaker Persian dataset2023-04-07Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language2022-12-16Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features2022-11-01Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation2022-10-31Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments2022-10-31Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS2022-10-24Facial Landmark Predictions with Applications to Metaverse2022-09-29Self-supervised learning for robust voice cloning2022-04-07Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis2022-02-16Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention2022-01-25Word-Level Style Control for Expressive, Non-attentive Speech Synthesis2021-11-19