TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal I...

AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection

Sahibzada Adil Shahzad, Ammarah Hashmi, Yan-Tsung Peng, Yu Tsao, Hsin-Min Wang

2023-11-05Self-Supervised LearningDeepFake DetectionFace SwappingVideo Forensics
PaperPDF

Abstract

Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multi-modal models that can exploit both pieces of information simultaneously. Previous methods mainly adopt uni-modal video forensics and use supervised pre-training for forgery detection. This study proposes a new method based on a multi-modal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multi-modal video forgery detection. We use the transformer-based SSL pre-trained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multi-scale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets.

Results

TaskDatasetMetricValueModel
3D ReconstructionFakeAVCelebAccuracy (%)99.29AV-Lip-Sync+
3DFakeAVCelebAccuracy (%)99.29AV-Lip-Sync+
DeepFake DetectionFakeAVCelebAccuracy (%)99.29AV-Lip-Sync+
3D Shape Reconstruction from VideosFakeAVCelebAccuracy (%)99.29AV-Lip-Sync+

Related Papers

A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08CorrDetail: Visual Detail Enhanced Self-Correction for Face Forgery Detection2025-07-07Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection2025-07-03World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model2025-07-01