Metric: Accuracy (higher is better)
| # | Model↕ | Accuracy▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | Text + Text (no Multimodal Pretext Training) | 93.2 | No | Towards Fast Adaptation of Pretrained Contrastiv... | 2022-06-05 | Code |
| 2 | FrozenBiLM | 86.7 | Yes | Zero-Shot Video Question Answering via Frozen Bi... | 2022-06-16 | Code |
| 3 | Just Ask | 84.4 | Yes | Just Ask: Learning to Answer Questions from Mill... | 2020-12-01 | Code |
| 4 | SeViLA | 83.7 | No | - | - | - |
| 5 | Hero w/ pre-training | 77.75 | No | HERO: Hierarchical Encoder for Video+Language Om... | 2020-05-01 | Code |
| 6 | ATP | 65.1 | No | Revisiting the "Video" in Video-Language Underst... | 2022-06-03 | Code |
| 7 | FrozenBiLM (0-shot) | 58.4 | No | Zero-Shot Video Question Answering via Frozen Bi... | 2022-06-16 | Code |
| 8 | Just Ask (0-shot) | 51.1 | No | Just Ask: Learning to Answer Questions from Mill... | 2020-12-01 | Code |