TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VATT: Transformers for Multimodal Self-Supervised Learning...

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong

2021-04-22NeurIPS 2021 12Video RetrievalImage ClassificationAction ClassificationAudio ClassificationZero-Shot Video RetrievalSelf-Supervised LearningText to Video RetrievalGeneral ClassificationAction RecognitionRetrievalAction Recognition In VideosTemporal Action Localization
PaperPDFCode(official)CodeCodeCodeCode

Abstract

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, we study a modality-agnostic, single-backbone Transformer by sharing weights among the three modalities. We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks. Especially, VATT's vision Transformer achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600, 72.7% on Kinetics-700, and 41.1% on Moments in Time, new records while avoiding supervised pre-training. Transferring to image classification leads to 78.7% top-1 accuracy on ImageNet compared to 64.7% by training the same Transformer from scratch, showing the generalizability of our model despite the domain gap between videos and images. VATT's audio Transformer also sets a new record on waveform-based audio event recognition by achieving the mAP of 39.4% on AudioSet without any supervised pre-training. VATT's source code is publicly available.

Results

TaskDatasetMetricValueModel
VideoMiTTop 1 Accuracy41.1VATT-Large
VideoMiTTop 5 Accuracy67.7VATT-Large
VideoKinetics-400Acc@182.1VATT-Large
VideoKinetics-400Acc@595.5VATT-Large
VideoKinetics-600Top-1 Accuracy83.6VATT-Large
VideoKinetics-600Top-5 Accuracy96.6VATT-Large
Audio ClassificationAudioSetAUC0.971VATT-Base
Audio ClassificationAudioSetTest mAP0.394VATT-Base
Audio ClassificationAudioSetd-prime2.895VATT-Base
ClassificationAudioSetAUC0.971VATT-Base
ClassificationAudioSetTest mAP0.394VATT-Base
ClassificationAudioSetd-prime2.895VATT-Base
Zero-Shot Video RetrievalMSR-VTTtext-to-video Median Rank49VATT-MBS
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1029.7VATT-MBS
Zero-Shot Video RetrievalYouCook2text-to-video Mean Rank13VATT-MBS
Zero-Shot Video RetrievalYouCook2text-to-video R@1045.5VATT-MBS

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17