TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/AVTENet: Audio-Visual Transformer-based Ensemble Network E...

AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection

Ammarah Hashmi, Sahibzada Adil Shahzad, Chia-Wen Lin, Yu Tsao, Hsin-Min Wang

2023-10-19DeepFake DetectionFace Swapping
PaperPDF

Abstract

Forged content shared widely on social media platforms is a major social problem that requires increased regulation and poses new challenges to the research community. The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous work on detecting AI-generated fake videos only utilizes visual modality or audio modality. While there are some methods in the literature that exploit audio and visual modalities to detect forged videos, they have not been comprehensively evaluated on multi-modal datasets of deepfake videos involving acoustic and visual manipulations. Moreover, these existing methods are mostly based on CNN and suffer from low detection accuracy. Inspired by the recent success of Transformer in various fields, to address the challenges posed by deepfake technology, in this paper, we propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation to achieve effective video forgery detection. Specifically, the proposed model integrates several purely transformer-based variants that capture video, audio, and audio-visual salient cues to reach a consensus in prediction. For evaluation, we use the recently released benchmark multi-modal audio-video FakeAVCeleb dataset. For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset. Experimental results show that our best model outperforms all existing methods and achieves state-of-the-art performance on Testset-I and Testset-II of the FakeAVCeleb dataset.

Results

TaskDatasetMetricValueModel
3D ReconstructionFakeAVCelebAccuracy (%)98.57Avtenet
3DFakeAVCelebAccuracy (%)98.57Avtenet
DeepFake DetectionFakeAVCelebAccuracy (%)98.57Avtenet
3D Shape Reconstruction from VideosFakeAVCelebAccuracy (%)98.57Avtenet

Related Papers

SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16CorrDetail: Visual Detail Enhanced Self-Correction for Face Forgery Detection2025-07-07Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection2025-07-03DDL: A Dataset for Interpretable Deepfake Detection and Localization in Real-World Scenarios2025-06-29Post-training for Deepfake Speech Detection2025-06-26Pay Less Attention to Deceptive Artifacts: Robust Detection of Compressed Deepfakes on Online Social Networks2025-06-25IndieFake Dataset: A Benchmark Dataset for Audio Deepfake Detection2025-06-23