Self-Supervised MultiModal Versatile Networks

Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman

2020-06-29NeurIPS 2020 12Audio Classification Self-Supervised Audio Classification Action Recognition In Videos Self-Supervised Action Recognition

Paper PDF Code(official)

Abstract

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, AudioSet and ESC-50 when compared to previous self-supervised work. Our models are publicly available.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	UCF101 (finetuned)	3-fold Accuracy	91.5	MMV
Activity Recognition	UCF101	3-fold Accuracy	95.2	MMV TSM-50x2
Activity Recognition	Kinetics-600	Top-1 Accuracy	55.5	MMV
Activity Recognition	HMDB51 (finetuned)	Top-1 Accuracy	70.1	MMV
Audio Classification	AudioSet	Test mAP	0.309	MMV
Action Recognition	UCF101 (finetuned)	3-fold Accuracy	91.5	MMV
Action Recognition	UCF101	3-fold Accuracy	95.2	MMV TSM-50x2
Action Recognition	Kinetics-600	Top-1 Accuracy	55.5	MMV
Action Recognition	HMDB51 (finetuned)	Top-1 Accuracy	70.1	MMV
Classification	AudioSet	Test mAP	0.309	MMV

Self-Supervised MultiModal Versatile Networks

Abstract

Results

Related Papers

Self-Supervised MultiModal Versatile Networks

Abstract

Results

Related Papers