MARLIN: Masked Autoencoder for facial video Representation LearnINg

Zhixi Cai, Shreya Ghosh, Kalin Stefanov, Abhinav Dhall, Jianfei Cai, Hamid Rezatofighi, Reza Haffari, Munawar Hayat

2022-11-12CVPR 2023 1Emotion Classification Action Classification Representation Learning Attribute Sentiment Analysis Facial Attribute Classification DeepFake Detection Facial Expression Recognition Face Swapping Facial Expression Recognition (FER)Multimodal Sentiment Analysis Unconstrained Lip-synchronization

Paper PDF Code(official)

Abstract

This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our code and models are available at https://github.com/ControlNet/MARLIN .

Results

Task	Dataset	Metric	Value	Model
Facial Recognition and Modelling	LRS2	FID	3.452	Wav2Lip + ViT + MARLIN
Facial Recognition and Modelling	LRS2	LSE-C	5.528	Wav2Lip + ViT + MARLIN
Facial Recognition and Modelling	LRS2	LSE-D	7.127	Wav2Lip + ViT + MARLIN
Facial Recognition and Modelling	CelebV-HQ	AUC	0.9561	MARLIN
Facial Recognition and Modelling	CelebV-HQ	Accuracy	93.9	MARLIN
Image Generation	LRS2	FID	3.452	Wav2Lip + ViT + MARLIN
Image Generation	LRS2	LSE-C	5.528	Wav2Lip + ViT + MARLIN
Image Generation	LRS2	LSE-D	7.127	Wav2Lip + ViT + MARLIN
3D Reconstruction	FaceForensics++	AUC	0.9377	MARLIN (ViT-L)
3D Reconstruction	FaceForensics++	AUC	0.9305	MARLIN (ViT-B)
3D Reconstruction	FaceForensics++	AUC	0.8863	MARLIN (ViT-S)
Video	CelebV-HQ	AUC	0.9406	MARLIN
Video	CelebV-HQ	Accuracy	95.48	MARLIN
Sentiment Analysis	CMU-MOSEI	Accuracy	74.83	MARLIN (ViT-L)
Sentiment Analysis	CMU-MOSEI	Accuracy	73.7	MARLIN (ViT-B)
Sentiment Analysis	CMU-MOSEI	Accuracy	72.69	MARLIN (ViT-S)
Talking Head Generation	LRS2	FID	3.452	Wav2Lip + ViT + MARLIN
Talking Head Generation	LRS2	LSE-C	5.528	Wav2Lip + ViT + MARLIN
Talking Head Generation	LRS2	LSE-D	7.127	Wav2Lip + ViT + MARLIN
Face Generation	LRS2	FID	3.452	Wav2Lip + ViT + MARLIN
Face Generation	LRS2	LSE-C	5.528	Wav2Lip + ViT + MARLIN
Face Generation	LRS2	LSE-D	7.127	Wav2Lip + ViT + MARLIN
Text Classification	CMU-MOSEI	Accuracy	80.63	MARLIN (ViT-L)
Text Classification	CMU-MOSEI	Accuracy	80.6	MARLIN (ViT-B)
Text Classification	CMU-MOSEI	Accuracy	80.38	MARLIN (ViT-S)
Face Reconstruction	LRS2	FID	3.452	Wav2Lip + ViT + MARLIN
Face Reconstruction	LRS2	LSE-C	5.528	Wav2Lip + ViT + MARLIN
Face Reconstruction	LRS2	LSE-D	7.127	Wav2Lip + ViT + MARLIN
Face Reconstruction	CelebV-HQ	AUC	0.9561	MARLIN
Face Reconstruction	CelebV-HQ	Accuracy	93.9	MARLIN
3D	FaceForensics++	AUC	0.9377	MARLIN (ViT-L)
3D	FaceForensics++	AUC	0.9305	MARLIN (ViT-B)
3D	FaceForensics++	AUC	0.8863	MARLIN (ViT-S)
3D	LRS2	FID	3.452	Wav2Lip + ViT + MARLIN
3D	LRS2	LSE-C	5.528	Wav2Lip + ViT + MARLIN
3D	LRS2	LSE-D	7.127	Wav2Lip + ViT + MARLIN
3D	CelebV-HQ	AUC	0.9561	MARLIN
3D	CelebV-HQ	Accuracy	93.9	MARLIN
DeepFake Detection	FaceForensics++	AUC	0.9377	MARLIN (ViT-L)
DeepFake Detection	FaceForensics++	AUC	0.9305	MARLIN (ViT-B)
DeepFake Detection	FaceForensics++	AUC	0.8863	MARLIN (ViT-S)
3D Face Modelling	LRS2	FID	3.452	Wav2Lip + ViT + MARLIN
3D Face Modelling	LRS2	LSE-C	5.528	Wav2Lip + ViT + MARLIN
3D Face Modelling	LRS2	LSE-D	7.127	Wav2Lip + ViT + MARLIN
3D Face Modelling	CelebV-HQ	AUC	0.9561	MARLIN
3D Face Modelling	CelebV-HQ	Accuracy	93.9	MARLIN
3D Face Reconstruction	LRS2	FID	3.452	Wav2Lip + ViT + MARLIN
3D Face Reconstruction	LRS2	LSE-C	5.528	Wav2Lip + ViT + MARLIN
3D Face Reconstruction	LRS2	LSE-D	7.127	Wav2Lip + ViT + MARLIN
3D Face Reconstruction	CelebV-HQ	AUC	0.9561	MARLIN
3D Face Reconstruction	CelebV-HQ	Accuracy	93.9	MARLIN
Emotion Classification	CMU-MOSEI	Accuracy	80.63	MARLIN (ViT-L)
Emotion Classification	CMU-MOSEI	Accuracy	80.6	MARLIN (ViT-B)
Emotion Classification	CMU-MOSEI	Accuracy	80.38	MARLIN (ViT-S)
Classification	CMU-MOSEI	Accuracy	80.63	MARLIN (ViT-L)
Classification	CMU-MOSEI	Accuracy	80.6	MARLIN (ViT-B)
Classification	CMU-MOSEI	Accuracy	80.38	MARLIN (ViT-S)
10-shot image generation	LRS2	FID	3.452	Wav2Lip + ViT + MARLIN
10-shot image generation	LRS2	LSE-C	5.528	Wav2Lip + ViT + MARLIN
10-shot image generation	LRS2	LSE-D	7.127	Wav2Lip + ViT + MARLIN
3D Shape Reconstruction from Videos	FaceForensics++	AUC	0.9377	MARLIN (ViT-L)
3D Shape Reconstruction from Videos	FaceForensics++	AUC	0.9305	MARLIN (ViT-B)
3D Shape Reconstruction from Videos	FaceForensics++	AUC	0.8863	MARLIN (ViT-S)

MARLIN: Masked Autoencoder for facial video Representation LearnINg

Abstract

Results

Related Papers

MARLIN: Masked Autoencoder for facial video Representation LearnINg

Abstract

Results

Related Papers