Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages

Akshit Arora, Rohan Badlani, Sungwon Kim, Rafael Valle, Bryan Catanzaro

2024-01-24Voice Cloning

Abstract

In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN vocoders for all submissions. RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first on Track 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS) of 3.62.

Related Papers

Pronunciation Deviation Analysis Through Voice Cloning and Acoustic Comparison2025-07-15 De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks2025-07-03 Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes2025-05-29 Voice Adaptation for Swiss German2025-05-28 VoiceMark: Zero-Shot Voice Cloning-Resistant Watermarking Approach Leveraging Speaker-Specific Latents2025-05-27 Phir Hera Fairy: An English Fairytaler is a Strong Faker of Fluent Speech in Low-Resource Indian Languages2025-05-27 CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning2025-05-25 Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection2025-05-22