TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Speaking Style Conversion in the Waveform Domain Using Dis...

Speaking Style Conversion in the Waveform Domain Using Discrete Self-Supervised Units

Gallil Maimon, Yossi Adi

2022-12-19RhythmVoice Conversion
PaperPDFCode(official)

Abstract

We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre of a recording to a target speaker in a textless manner. Unlike DISSC, most voice conversion (VC) methods focus primarily on timbre, and ignore people's unique speaking style (prosody). The proposed approach uses a pretrained, self-supervised model for encoding speech to discrete units, which makes it simple, effective, and fast to train. All conversion modules are only trained on reconstruction like tasks, thus suitable for any-to-many VC with no paired data. We introduce a suite of quantitative and qualitative evaluation metrics for this setup, and empirically demonstrate that DISSC significantly outperforms the evaluated baselines. Code and samples are available at https://pages.cs.huji.ac.il/adiyoss-lab/dissc/.

Results

TaskDatasetMetricValueModel
Voice ConversionVCTKPhone Length Error (PLE)0.023DISSC
Voice ConversionVCTKTotal Length Error (TLE)0.832DISSC
Voice ConversionVCTKWord Length Error (WLE)0.056DISSC
2D ClassificationVCTKPhone Length Error (PLE)0.023DISSC
2D ClassificationVCTKTotal Length Error (TLE)0.832DISSC
2D ClassificationVCTKWord Length Error (WLE)0.056DISSC
1 Image, 2*2 StitchiVCTKPhone Length Error (PLE)0.023DISSC
1 Image, 2*2 StitchiVCTKTotal Length Error (TLE)0.832DISSC
1 Image, 2*2 StitchiVCTKWord Length Error (WLE)0.056DISSC

Related Papers

Exploring Adapter Design Tradeoffs for Low Resource Music Generation2025-06-26CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment2025-06-25Let Your Video Listen to Your Music!2025-06-23From Generality to Mastery: Composer-Style Symbolic Music Generation via Large-Scale Pre-training2025-06-20DanceChat: Large Language Model-Guided Music-to-Dance Generation2025-06-12RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding2025-06-12Training-Free Voice Conversion with Factorized Optimal Transport2025-06-11Rhythm Features for Speaker Identification2025-06-07