TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Audiovisual Masked Autoencoders

Audiovisual Masked Autoencoders

Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab

2022-12-09ICCV 2023 1Representation LearningAudio Classification
PaperPDFCode(official)Code(official)

Abstract

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.

Results

TaskDatasetMetricValueModel
Audio ClassificationEPIC-KITCHENS-100Top-1 Action46Audiovisual Masked Autoencoder (Audiovisual, Single)
Audio ClassificationEPIC-KITCHENS-100Top-1 Noun56.4Audiovisual Masked Autoencoder (Audiovisual, Single)
Audio ClassificationEPIC-KITCHENS-100Top-1 Verb71.4Audiovisual Masked Autoencoder (Audiovisual, Single)
Audio ClassificationEPIC-KITCHENS-100Top-1 Action45.8Audiovisual Masked Autoencoder (Video-only, Single)
Audio ClassificationEPIC-KITCHENS-100Top-1 Noun55.9Audiovisual Masked Autoencoder (Video-only, Single)
Audio ClassificationEPIC-KITCHENS-100Top-1 Verb70.8Audiovisual Masked Autoencoder (Video-only, Single)
Audio ClassificationEPIC-KITCHENS-100Top-1 Action19.7Audiovisual Masked Autoencoder (Audio-only, Single)
Audio ClassificationEPIC-KITCHENS-100Top-1 Noun27.2Audiovisual Masked Autoencoder (Audio-only, Single)
Audio ClassificationEPIC-KITCHENS-100Top-1 Verb52.7Audiovisual Masked Autoencoder (Audio-only, Single)
Audio ClassificationAudioSetTest mAP0.518Audiovisual Masked Autoencoder (Audiovisual, Single)
Audio ClassificationAudioSetTest mAP0.466Audiovisual Masked Autoencoder (Audio-only, Single)
Audio ClassificationVGGSoundTop 1 Accuracy65Audiovisual Masked Autoencoder (Audiovisual, Single)
Audio ClassificationVGGSoundTop 1 Accuracy57.2Audiovisual Masked Autoencoder (Audio-only, Single)
ClassificationEPIC-KITCHENS-100Top-1 Action46Audiovisual Masked Autoencoder (Audiovisual, Single)
ClassificationEPIC-KITCHENS-100Top-1 Noun56.4Audiovisual Masked Autoencoder (Audiovisual, Single)
ClassificationEPIC-KITCHENS-100Top-1 Verb71.4Audiovisual Masked Autoencoder (Audiovisual, Single)
ClassificationEPIC-KITCHENS-100Top-1 Action45.8Audiovisual Masked Autoencoder (Video-only, Single)
ClassificationEPIC-KITCHENS-100Top-1 Noun55.9Audiovisual Masked Autoencoder (Video-only, Single)
ClassificationEPIC-KITCHENS-100Top-1 Verb70.8Audiovisual Masked Autoencoder (Video-only, Single)
ClassificationEPIC-KITCHENS-100Top-1 Action19.7Audiovisual Masked Autoencoder (Audio-only, Single)
ClassificationEPIC-KITCHENS-100Top-1 Noun27.2Audiovisual Masked Autoencoder (Audio-only, Single)
ClassificationEPIC-KITCHENS-100Top-1 Verb52.7Audiovisual Masked Autoencoder (Audio-only, Single)
ClassificationAudioSetTest mAP0.518Audiovisual Masked Autoencoder (Audiovisual, Single)
ClassificationAudioSetTest mAP0.466Audiovisual Masked Autoencoder (Audio-only, Single)
ClassificationVGGSoundTop 1 Accuracy65Audiovisual Masked Autoencoder (Audiovisual, Single)
ClassificationVGGSoundTop 1 Accuracy57.2Audiovisual Masked Autoencoder (Audio-only, Single)

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16