Text + Text (no Multimodal Pretext Training)

Reported on 3 benchmarks across 1 task · 1 paper · 3 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Reasoning3 results

Video Question AnsweringonActivityNet-QA
Accuracy· 2022-06-05
41.4
best: 61.6 (Tarsier (34B))
SOTA
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval arXiv:2206.02082
Video Question AnsweringoniVQA
Accuracy· 2022-06-05
40.2
SOTA
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval arXiv:2206.02082
Video Question AnsweringonHow2QA
Accuracy· 2022-06-05
93.2
SOTA
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval arXiv:2206.02082