TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MARLIN: Masked Autoencoder for facial video Representation...

MARLIN: Masked Autoencoder for facial video Representation LearnINg

Zhixi Cai, Shreya Ghosh, Kalin Stefanov, Abhinav Dhall, Jianfei Cai, Hamid Rezatofighi, Reza Haffari, Munawar Hayat

2022-11-12CVPR 2023 1Emotion ClassificationAction ClassificationRepresentation LearningAttributeSentiment AnalysisFacial Attribute ClassificationDeepFake DetectionFacial Expression RecognitionFace SwappingFacial Expression Recognition (FER)Multimodal Sentiment AnalysisUnconstrained Lip-synchronization
PaperPDFCode(official)

Abstract

This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our code and models are available at https://github.com/ControlNet/MARLIN .

Results

TaskDatasetMetricValueModel
Facial Recognition and ModellingLRS2FID3.452Wav2Lip + ViT + MARLIN
Facial Recognition and ModellingLRS2LSE-C5.528Wav2Lip + ViT + MARLIN
Facial Recognition and ModellingLRS2LSE-D7.127Wav2Lip + ViT + MARLIN
Facial Recognition and ModellingCelebV-HQAUC0.9561MARLIN
Facial Recognition and ModellingCelebV-HQAccuracy93.9MARLIN
Image GenerationLRS2FID3.452Wav2Lip + ViT + MARLIN
Image GenerationLRS2LSE-C5.528Wav2Lip + ViT + MARLIN
Image GenerationLRS2LSE-D7.127Wav2Lip + ViT + MARLIN
3D ReconstructionFaceForensics++AUC0.9377MARLIN (ViT-L)
3D ReconstructionFaceForensics++AUC0.9305MARLIN (ViT-B)
3D ReconstructionFaceForensics++AUC0.8863MARLIN (ViT-S)
VideoCelebV-HQAUC0.9406MARLIN
VideoCelebV-HQAccuracy95.48MARLIN
Sentiment AnalysisCMU-MOSEIAccuracy74.83MARLIN (ViT-L)
Sentiment AnalysisCMU-MOSEIAccuracy73.7MARLIN (ViT-B)
Sentiment AnalysisCMU-MOSEIAccuracy72.69MARLIN (ViT-S)
Talking Head GenerationLRS2FID3.452Wav2Lip + ViT + MARLIN
Talking Head GenerationLRS2LSE-C5.528Wav2Lip + ViT + MARLIN
Talking Head GenerationLRS2LSE-D7.127Wav2Lip + ViT + MARLIN
Face GenerationLRS2FID3.452Wav2Lip + ViT + MARLIN
Face GenerationLRS2LSE-C5.528Wav2Lip + ViT + MARLIN
Face GenerationLRS2LSE-D7.127Wav2Lip + ViT + MARLIN
Text ClassificationCMU-MOSEIAccuracy80.63MARLIN (ViT-L)
Text ClassificationCMU-MOSEIAccuracy80.6MARLIN (ViT-B)
Text ClassificationCMU-MOSEIAccuracy80.38MARLIN (ViT-S)
Face ReconstructionLRS2FID3.452Wav2Lip + ViT + MARLIN
Face ReconstructionLRS2LSE-C5.528Wav2Lip + ViT + MARLIN
Face ReconstructionLRS2LSE-D7.127Wav2Lip + ViT + MARLIN
Face ReconstructionCelebV-HQAUC0.9561MARLIN
Face ReconstructionCelebV-HQAccuracy93.9MARLIN
3DFaceForensics++AUC0.9377MARLIN (ViT-L)
3DFaceForensics++AUC0.9305MARLIN (ViT-B)
3DFaceForensics++AUC0.8863MARLIN (ViT-S)
3DLRS2FID3.452Wav2Lip + ViT + MARLIN
3DLRS2LSE-C5.528Wav2Lip + ViT + MARLIN
3DLRS2LSE-D7.127Wav2Lip + ViT + MARLIN
3DCelebV-HQAUC0.9561MARLIN
3DCelebV-HQAccuracy93.9MARLIN
DeepFake DetectionFaceForensics++AUC0.9377MARLIN (ViT-L)
DeepFake DetectionFaceForensics++AUC0.9305MARLIN (ViT-B)
DeepFake DetectionFaceForensics++AUC0.8863MARLIN (ViT-S)
3D Face ModellingLRS2FID3.452Wav2Lip + ViT + MARLIN
3D Face ModellingLRS2LSE-C5.528Wav2Lip + ViT + MARLIN
3D Face ModellingLRS2LSE-D7.127Wav2Lip + ViT + MARLIN
3D Face ModellingCelebV-HQAUC0.9561MARLIN
3D Face ModellingCelebV-HQAccuracy93.9MARLIN
3D Face ReconstructionLRS2FID3.452Wav2Lip + ViT + MARLIN
3D Face ReconstructionLRS2LSE-C5.528Wav2Lip + ViT + MARLIN
3D Face ReconstructionLRS2LSE-D7.127Wav2Lip + ViT + MARLIN
3D Face ReconstructionCelebV-HQAUC0.9561MARLIN
3D Face ReconstructionCelebV-HQAccuracy93.9MARLIN
Emotion ClassificationCMU-MOSEIAccuracy80.63MARLIN (ViT-L)
Emotion ClassificationCMU-MOSEIAccuracy80.6MARLIN (ViT-B)
Emotion ClassificationCMU-MOSEIAccuracy80.38MARLIN (ViT-S)
ClassificationCMU-MOSEIAccuracy80.63MARLIN (ViT-L)
ClassificationCMU-MOSEIAccuracy80.6MARLIN (ViT-B)
ClassificationCMU-MOSEIAccuracy80.38MARLIN (ViT-S)
10-shot image generationLRS2FID3.452Wav2Lip + ViT + MARLIN
10-shot image generationLRS2LSE-C5.528Wav2Lip + ViT + MARLIN
10-shot image generationLRS2LSE-D7.127Wav2Lip + ViT + MARLIN
3D Shape Reconstruction from VideosFaceForensics++AUC0.9377MARLIN (ViT-L)
3D Shape Reconstruction from VideosFaceForensics++AUC0.9305MARLIN (ViT-B)
3D Shape Reconstruction from VideosFaceForensics++AUC0.8863MARLIN (ViT-S)

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16