TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Temporal-Channel Modeling in Multi-head Self-Attention for...

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng

2024-06-25Synthetic Speech DetectionAudio Deepfake Detection
PaperPDFCode(official)

Abstract

Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.

Results

TaskDatasetMetricValueModel
3D ReconstructionASVspoof 202121DF EER2.14TCM-Add
3D ReconstructionASVspoof 202121LA EER2.99TCM-Add
Speaker VerificationASVspoof 202121DF EER2.14TCM-Add
Speaker VerificationASVspoof 202121LA EER2.99TCM-Add
3DASVspoof 202121DF EER2.14TCM-Add
3DASVspoof 202121LA EER2.99TCM-Add
DeepFake DetectionASVspoof 202121DF EER2.14TCM-Add
DeepFake DetectionASVspoof 202121LA EER2.99TCM-Add
3D Shape Reconstruction from VideosASVspoof 202121DF EER2.14TCM-Add
3D Shape Reconstruction from VideosASVspoof 202121LA EER2.99TCM-Add

Related Papers

IndieFake Dataset: A Benchmark Dataset for Audio Deepfake Detection2025-06-23Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models2025-06-17A Data-Driven Diffusion-based Approach for Audio Deepfake Explanations2025-06-03Are Mamba-based Audio Foundation Models the Best Fit for Non-Verbal Emotion Recognition?2025-06-02Source Tracing of Synthetic Speech Systems Through Paralinguistic Pre-Trained Representations2025-06-01Rehearsal with Auxiliary-Informed Sampling for Audio Deepfake Detection2025-05-30Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes2025-05-29EnvSDD: Benchmarking Environmental Sound Deepfake Detection2025-05-25