VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong

2021-04-22NeurIPS 2021 12Video Retrieval Image Classification Action Classification Audio Classification Zero-Shot Video Retrieval Self-Supervised Learning Text to Video Retrieval General Classification Action Recognition Retrieval Action Recognition In Videos Temporal Action Localization

Paper PDF Code(official)Code Code Code Code

Abstract

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, we study a modality-agnostic, single-backbone Transformer by sharing weights among the three modalities. We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks. Especially, VATT's vision Transformer achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600, 72.7% on Kinetics-700, and 41.1% on Moments in Time, new records while avoiding supervised pre-training. Transferring to image classification leads to 78.7% top-1 accuracy on ImageNet compared to 64.7% by training the same Transformer from scratch, showing the generalizability of our model despite the domain gap between videos and images. VATT's audio Transformer also sets a new record on waveform-based audio event recognition by achieving the mAP of 39.4% on AudioSet without any supervised pre-training. VATT's source code is publicly available.

Results

Task	Dataset	Metric	Value	Model
Video	MiT	Top 1 Accuracy	41.1	VATT-Large
Video	MiT	Top 5 Accuracy	67.7	VATT-Large
Video	Kinetics-400	Acc@1	82.1	VATT-Large
Video	Kinetics-400	Acc@5	95.5	VATT-Large
Video	Kinetics-600	Top-1 Accuracy	83.6	VATT-Large
Video	Kinetics-600	Top-5 Accuracy	96.6	VATT-Large
Audio Classification	AudioSet	AUC	0.971	VATT-Base
Audio Classification	AudioSet	Test mAP	0.394	VATT-Base
Audio Classification	AudioSet	d-prime	2.895	VATT-Base
Classification	AudioSet	AUC	0.971	VATT-Base
Classification	AudioSet	Test mAP	0.394	VATT-Base
Classification	AudioSet	d-prime	2.895	VATT-Base
Zero-Shot Video Retrieval	MSR-VTT	text-to-video Median Rank	49	VATT-MBS
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	29.7	VATT-MBS
Zero-Shot Video Retrieval	YouCook2	text-to-video Mean Rank	13	VATT-MBS
Zero-Shot Video Retrieval	YouCook2	text-to-video R@10	45.5	VATT-MBS

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Abstract

Results

Related Papers

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Abstract

Results

Related Papers