Abstract
This work investigates the performance of Voice Adaptation models for Swiss German dialects, i.e., translating Standard German text to Swiss German dialect speech. For this, we preprocess a large dataset of Swiss podcasts, which we automatically transcribe and annotate with dialect classes, yielding approximately 5000 hours of weakly labeled training material. We fine-tune the XTTSv2 model on this dataset and show that it achieves good scores in human and automated evaluations and can correctly render the desired dialect. Our work shows a step towards adapting Voice Cloning technology to underrepresented languages. The resulting model achieves CMOS scores of up to -0.28 and SMOS scores of 3.8.
Related Papers
Pronunciation Deviation Analysis Through Voice Cloning and Acoustic Comparison2025-07-15De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks2025-07-03Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes2025-05-29VoiceMark: Zero-Shot Voice Cloning-Resistant Watermarking Approach Leveraging Speaker-Specific Latents2025-05-27Phir Hera Fairy: An English Fairytaler is a Strong Faker of Fluent Speech in Low-Resource Indian Languages2025-05-27CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning2025-05-25Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection2025-05-22MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling2025-05-21