TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Low-Latency Real-Time Non-Parallel Voice Conversion based ...

Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear Prediction

Patrick Lumban Tobing, Tomoki Toda

2021-05-20Voice Conversion
PaperPDFCodeCode(official)

Abstract

This paper presents a low-latency real-time (LLRT) non-parallel voice conversion (VC) framework based on cyclic variational autoencoder (CycleVAE) and multiband WaveRNN with data-driven linear prediction (MWDLP). CycleVAE is a robust non-parallel multispeaker spectral model, which utilizes a speaker-independent latent space and a speaker-dependent code to generate reconstructed/converted spectral features given the spectral features of an input speaker. On the other hand, MWDLP is an efficient and a high-quality neural vocoder that can handle multispeaker data and generate speech waveform for LLRT applications with CPU. To accommodate LLRT constraint with CPU, we propose a novel CycleVAE framework that utilizes mel-spectrogram as spectral features and is built with a sparse network architecture. Further, to improve the modeling performance, we also propose a novel fine-tuning procedure that refines the frame-rate CycleVAE network by utilizing the waveform loss from the MWDLP network. The experimental results demonstrate that the proposed framework achieves high-performance VC, while allowing for LLRT usage with a single-core of $2.1$--$2.7$ GHz CPU on a real-time factor of $0.87$--$0.95$, including input/output, feature extraction, on a frame shift of $10$ ms, a window length of $27.5$ ms, and $2$ lookup frames.

Related Papers

RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding2025-06-12Training-Free Voice Conversion with Factorized Optimal Transport2025-06-11CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition2025-06-06Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion2025-06-04StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion2025-06-03Unsupervised Rhythm and Voice Conversion to Improve ASR on Dysarthric Speech2025-06-02SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction2025-06-02LinearVC: Linear transformations of self-supervised features through the lens of voice conversion2025-06-02