Cross-modal information fusion for voice spoofing detection

Junxiao Xue, Hao Zhou, Huawei Song, Bin Wu, Lei Shi

2023-02-01journal 2023 2Voice Anti-spoofing Automatic Speech Recognition Speaker Verification fake voice detection Speech Synthesis

Paper PDF Code

Abstract

In recent years, speaker verification systems have been used in many production scenarios. Unfortunately, they are still very vulnerable to different kinds of spoofing attacks, such as speech synthesis attacks, replay attacks, etc. Researchers have proposed many methods to defend against these attacks, but in the existing methods, researchers just focus on speech features. In recent studies, researchers have found that speech contains a large amount of face information. In fact, we can determine the speaker's gender, age, mouth shape, and other information by voice. These information can help us distinguish spoofing attacks. Inspired by this phenomenon, we propose a generalized framework named GACMNet. To cope with different attack scenarios, we instantiated two different models. Our framework is mainly divided into data pre-processing phase, feature extraction phase, feature fusion phase, and classification phase. Specifically, our framework consists of two branches. On the one hand, we extract face features in speech by a convolutional neural network. On the other hand, we use a densely connected network to extract speech features. For the more, we designed a global attention-based information fusion mechanism to distinguish the importance of each part of the features. Our solution was proven to be effective in two large scenarios. Compared to the existing methods, our model improves the tandem decision cost function (t-DCF) and equal error rate (EER) scores by 9% and 11% in the logical access scenario, respectively, our model improves the EER score by 10% in the physical access scenario.

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks2025-07-17 WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14 VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08 Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08 A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06 DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03