Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization

Zhixi Cai, Kalin Stefanov, Abhinav Dhall, Munawar Hayat

2022-04-13Benchmarking DeepFake Detection Temporal Forgery Localization

Abstract

Due to its high societal impact, deepfake detection is getting active attention in the computer vision community. Most deepfake detection methods rely on identity, facial attributes, and adversarial perturbation-based spatio-temporal modifications at the whole video or random locations while keeping the meaning of the content intact. However, a sophisticated deepfake may contain only a small segment of video/audio manipulation, through which the meaning of the content can be, for example, completely inverted from a sentiment perspective. We introduce a content-driven audio-visual deepfake dataset, termed Localized Audio Visual DeepFake (LAV-DF), explicitly designed for the task of learning temporal forgery localization. Specifically, the content-driven audio-visual manipulations are performed strategically to change the sentiment polarity of the whole video. Our baseline method for benchmarking the proposed dataset is a 3DCNN model, termed as Boundary Aware Temporal Forgery Detection (BA-TFD), which is guided via contrastive, boundary matching, and frame classification loss functions. Our extensive quantitative and qualitative analysis demonstrates the proposed method's strong performance for temporal forgery localization and deepfake detection tasks.

Results

Task	Dataset	Metric	Value	Model
3D Reconstruction	LAV-DF	AUC	0.99	BA-TFD
3D	LAV-DF	AUC	0.99	BA-TFD
DeepFake Detection	LAV-DF	AUC	0.99	BA-TFD
3D Shape Reconstruction from Videos	LAV-DF	AUC	0.99	BA-TFD

Related Papers

Visual Place Recognition for Large-Scale UAV Applications2025-07-20 Training Transformers with Enforced Lipschitz Constants2025-07-17 Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17 MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17 SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks2025-07-17 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16 DCR: Quantifying Data Contamination in LLMs Evaluation2025-07-15 A Multi-View High-Resolution Foot-Ankle Complex Point Cloud Dataset During Gait for Occlusion-Robust 3D Completion2025-07-15