Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models are publicly available at https://github.com/antoyang/FrozenBiLM.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Zero-Shot Learning | iVQA | Accuracy | 0.268 | FrozenBiLM |
| Zero-Shot Learning | LSMDC | Accuracy | 51.5 | FrozenBiLM |
| Question Answering | MSVD-QA | Accuracy | 33.8 | FrozenBiLM |
| Question Answering | TGIF-QA | Accuracy | 41.9 | FrozenBiLM |
| Question Answering | TVQA | Accuracy | 59.7 | FrozenBiLM (with speech) |
| Question Answering | TVQA | Accuracy | 29.7 | FrozenBILM (no speech) |
| Question Answering | EgoSchema (fullset) | Accuracy | 26.9 | FrozenBiLM |
| Question Answering | ActivityNet-QA | Accuracy | 24.7 | FrozenBiLM |
| Visual Question Answering (VQA) | MSVD-QA | Accuracy | 0.548 | FrozenBiLM |
| Visual Question Answering (VQA) | MSRVTT-QA | Accuracy | 0.47 | FrozenBiLM |
| Video Question Answering | TVQA | Accuracy | 82 | FrozenBiLM |
| Video Question Answering | ActivityNet-QA | Accuracy | 43.2 | FrozenBiLM |
| Video Question Answering | ActivityNet-QA | Accuracy | 25.9 | FrozenBiLM (0-shot) |
| Video Question Answering | MSRVTT-QA | Accuracy | 47 | FrozenBiLM |
| Video Question Answering | MSRVTT-QA | Accuracy | 16.7 | FrozenBiLM (0-shot) |
| Video Question Answering | iVQA | Accuracy | 39.6 | FrozenBiLM |
| Video Question Answering | iVQA | Accuracy | 26.8 | FrozenBiLM (0-shot) |
| Video Question Answering | How2QA | Accuracy | 86.7 | FrozenBiLM |
| Video Question Answering | How2QA | Accuracy | 58.4 | FrozenBiLM (0-shot) |
| Video Question Answering | MSVD-QA | Accuracy | 33.8 | FrozenBiLM |
| Video Question Answering | TGIF-QA | Accuracy | 41.9 | FrozenBiLM |
| Video Question Answering | TVQA | Accuracy | 59.7 | FrozenBiLM (with speech) |
| Video Question Answering | TVQA | Accuracy | 29.7 | FrozenBILM (no speech) |
| Video Question Answering | EgoSchema (fullset) | Accuracy | 26.9 | FrozenBiLM |
| Video Question Answering | ActivityNet-QA | Accuracy | 24.7 | FrozenBiLM |
| Visual Question Answering | MSVD-QA | Accuracy | 0.548 | FrozenBiLM |
| Visual Question Answering | MSRVTT-QA | Accuracy | 0.47 | FrozenBiLM |