Metric: Acc (higher is better)
| # | Model↕ | Acc▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | VAST | 80.7 | Yes | VAST: A Vision-Audio-Subtitle-Text Omni-Modality... | 2023-05-29 | Code |
| 2 | CoQo(Internvideo2) | 79.6 | No | - | - | - |
| 3 | VALOR | 78.9 | Yes | VALOR: Vision-Audio-Language Omni-Perception Pre... | 2023-04-17 | Code |
| 4 | CAD | 78.26 | No | CAD -- Contextual Multi-modal Alignment for Dyna... | 2023-10-25 | - |
| 5 | LAVISH | 77.08 | No | Vision Transformers are Parameter-Efficient Audi... | 2022-12-15 | Code |
| 6 | ST-AVQA | 71.52 | No | Learning to Answer Questions in Dynamic Audio-Vi... | 2022-03-26 | Code |