Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass
Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Audio Classification | AudioSet | Test mAP | 0.504 | UAVM (Audio + Video) |
| Audio Classification | VGGSound | Top 1 Accuracy | 65.8 | UAVM (Audio + Video) |
| Audio Classification | VGGSound | Top 1 Accuracy | 56.5 | UAVM (Audio Only) |
| Audio Classification | VGGSound | Top 1 Accuracy | 49.9 | UAVM (Video Only) |
| Classification | AudioSet | Test mAP | 0.504 | UAVM (Audio + Video) |
| Classification | VGGSound | Top 1 Accuracy | 65.8 | UAVM (Audio + Video) |
| Classification | VGGSound | Top 1 Accuracy | 56.5 | UAVM (Audio Only) |
| Classification | VGGSound | Top 1 Accuracy | 49.9 | UAVM (Video Only) |
| Classification | VGG-Sound | Top-1 Accuracy | 65.8 | UAVM |
| Classification | AudioSet | Average mAP | 0.504 | UAVM |
| Multi-modal Classification | VGG-Sound | Top-1 Accuracy | 65.8 | UAVM |
| Multi-modal Classification | AudioSet | Average mAP | 0.504 | UAVM |