TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Self-Supervised MultiModal Versatile Networks

Self-Supervised MultiModal Versatile Networks

Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman

2020-06-29NeurIPS 2020 12Audio ClassificationSelf-Supervised Audio ClassificationAction Recognition In VideosSelf-Supervised Action Recognition
PaperPDFCode(official)

Abstract

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, AudioSet and ESC-50 when compared to previous self-supervised work. Our models are publicly available.

Results

TaskDatasetMetricValueModel
Activity RecognitionUCF101 (finetuned)3-fold Accuracy91.5MMV
Activity RecognitionUCF1013-fold Accuracy95.2MMV TSM-50x2
Activity RecognitionKinetics-600Top-1 Accuracy55.5MMV
Activity RecognitionHMDB51 (finetuned)Top-1 Accuracy70.1MMV
Audio ClassificationAudioSetTest mAP0.309MMV
Action RecognitionUCF101 (finetuned)3-fold Accuracy91.5MMV
Action RecognitionUCF1013-fold Accuracy95.2MMV TSM-50x2
Action RecognitionKinetics-600Top-1 Accuracy55.5MMV
Action RecognitionHMDB51 (finetuned)Top-1 Accuracy70.1MMV
ClassificationAudioSetTest mAP0.309MMV

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Feature Hallucination for Self-supervised Action Recognition2025-06-25Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons2025-06-24Fully Few-shot Class-incremental Audio Classification Using Multi-level Embedding Extractor and Ridge Regression Classifier2025-06-23Adaptive Differential Denoising for Respiratory Sounds Classification2025-06-03Spectrotemporal Modulation: Efficient and Interpretable Feature Representation for Classifying Speech, Music, and Environmental Sounds2025-05-29Patient-Aware Feature Alignment for Robust Lung Sound Classification:Cohesion-Separation and Global Alignment Losses2025-05-28