Metric: Accuracy (higher is better)
| # | Model↕ | Accuracy▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | Text + Text (no Multimodal Pretext Training) | 40.2 | No | Towards Fast Adaptation of Pretrained Contrastiv... | 2022-06-05 | Code |
| 2 | FrozenBiLM | 39.6 | Yes | Zero-Shot Video Question Answering via Frozen Bi... | 2022-06-16 | Code |
| 3 | VideoCoCa | 39 | Yes | VideoCoCa: Video-Text Modeling with Zero-Shot Tr... | 2022-12-09 | - |
| 4 | Co-Tokenization | 38.2 | No | Video Question Answering with Iterative Video-Te... | 2022-08-01 | - |
| 5 | Just Ask (fine-tune) | 35.4 | No | Just Ask: Learning to Answer Questions from Mill... | 2020-12-01 | Code |
| 6 | FrozenBiLM (0-shot) | 26.8 | No | Zero-Shot Video Question Answering via Frozen Bi... | 2022-06-16 | Code |
| 7 | Just Ask (0-shot) | 12.2 | No | Just Ask: Learning to Answer Questions from Mill... | 2020-12-01 | Code |