TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Attention Bottlenecks for Multimodal Fusion

Attention Bottlenecks for Multimodal Fusion

Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

2021-06-30NeurIPS 2021 12Action ClassificationAudio ClassificationVideo ClassificationAction Recognition
PaperPDFCode(official)

Abstract

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.

Results

TaskDatasetMetricValueModel
VideoKinetics-SoundsTop 1 Accuracy85MBT (AV)
VideoKinetics-SoundsTop 5 Accuracy96.8MBT (AV)
VideoMiTTop 1 Accuracy37.3MBT (AV)
VideoMiTTop 5 Accuracy61.2MBT (AV)
VideoKinetics-400Acc@180.8MBT (AV)
VideoKinetics-400Acc@594.6MBT (AV)
Activity RecognitionEPIC-KITCHENS-100Action@143.4MBT
Activity RecognitionEPIC-KITCHENS-100Noun@158MBT
Activity RecognitionEPIC-KITCHENS-100Verb@164.8MBT
Audio ClassificationAudioSetTest mAP0.496MBT (AS-500K training + Video)
Audio ClassificationVGGSoundTop 1 Accuracy52.3MBT (A)
Audio ClassificationVGGSoundTop 5 Accuracy78.1MBT (A)
Audio ClassificationVGGSoundTop 1 Accuracy51.2MBT (V)
Audio ClassificationVGGSoundTop 5 Accuracy72.6MBT (V)
Audio ClassificationVGGSoundTop 5 Accuracy85.6MBT (AV)
Action RecognitionEPIC-KITCHENS-100Action@143.4MBT
Action RecognitionEPIC-KITCHENS-100Noun@158MBT
Action RecognitionEPIC-KITCHENS-100Verb@164.8MBT
ClassificationAudioSetTest mAP0.496MBT (AS-500K training + Video)
ClassificationVGGSoundTop 1 Accuracy52.3MBT (A)
ClassificationVGGSoundTop 5 Accuracy78.1MBT (A)
ClassificationVGGSoundTop 1 Accuracy51.2MBT (V)
ClassificationVGGSoundTop 5 Accuracy72.6MBT (V)
ClassificationVGGSoundTop 5 Accuracy85.6MBT (AV)

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment2025-06-28EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25