Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Kinetics-Sounds | Top 1 Accuracy | 85 | MBT (AV) |
| Video | Kinetics-Sounds | Top 5 Accuracy | 96.8 | MBT (AV) |
| Video | MiT | Top 1 Accuracy | 37.3 | MBT (AV) |
| Video | MiT | Top 5 Accuracy | 61.2 | MBT (AV) |
| Video | Kinetics-400 | Acc@1 | 80.8 | MBT (AV) |
| Video | Kinetics-400 | Acc@5 | 94.6 | MBT (AV) |
| Activity Recognition | EPIC-KITCHENS-100 | Action@1 | 43.4 | MBT |
| Activity Recognition | EPIC-KITCHENS-100 | Noun@1 | 58 | MBT |
| Activity Recognition | EPIC-KITCHENS-100 | Verb@1 | 64.8 | MBT |
| Audio Classification | AudioSet | Test mAP | 0.496 | MBT (AS-500K training + Video) |
| Audio Classification | VGGSound | Top 1 Accuracy | 52.3 | MBT (A) |
| Audio Classification | VGGSound | Top 5 Accuracy | 78.1 | MBT (A) |
| Audio Classification | VGGSound | Top 1 Accuracy | 51.2 | MBT (V) |
| Audio Classification | VGGSound | Top 5 Accuracy | 72.6 | MBT (V) |
| Audio Classification | VGGSound | Top 5 Accuracy | 85.6 | MBT (AV) |
| Action Recognition | EPIC-KITCHENS-100 | Action@1 | 43.4 | MBT |
| Action Recognition | EPIC-KITCHENS-100 | Noun@1 | 58 | MBT |
| Action Recognition | EPIC-KITCHENS-100 | Verb@1 | 64.8 | MBT |
| Classification | AudioSet | Test mAP | 0.496 | MBT (AS-500K training + Video) |
| Classification | VGGSound | Top 1 Accuracy | 52.3 | MBT (A) |
| Classification | VGGSound | Top 5 Accuracy | 78.1 | MBT (A) |
| Classification | VGGSound | Top 1 Accuracy | 51.2 | MBT (V) |
| Classification | VGGSound | Top 5 Accuracy | 72.6 | MBT (V) |
| Classification | VGGSound | Top 5 Accuracy | 85.6 | MBT (AV) |