Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/VideoCoCa

VideoCoCa

Reported on 41 benchmarks across 5 tasks · 1 paper · 28 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision37 results

Video CaptioningonVATEX
BLEU-4· uses extra data· 2022-12-09
39.7
best: 45.6 (VALOR)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Video CaptioningonVATEX
CIDEr· uses extra data· 2022-12-09
77.8
best: 99.5 (VAST)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Video CaptioningonVATEX
ROUGE-L· uses extra data· 2022-12-09
54.5
best: 57.4 (VALOR)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Video CaptioningonActivityNet Captions
BLEU4· uses extra data· 2022-12-09
14.7
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Video CaptioningonActivityNet Captions
CIDEr· uses extra data· 2022-12-09
39.3
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Action RecognitiononUCF101
Top-5 accuracy· uses extra data· 2022-12-09
98.4
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Action RecognitiononKinetics
Top-1 Accuracy· uses extra data· 2022-12-09
70.1
best: 78.1 (TC-CLIP)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Action RecognitiononCharades
mAP· uses extra data· 2022-12-09
25.8
best: 35.59 (MSQNet)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Action RecognitiononHMDB51
Top-5 Accuracy· uses extra data· 2022-12-09
84.5
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonVATEX
text-to-video R@1· uses extra data· 2022-12-09
53.2
best: 83.9 (GRAM)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonVATEX
text-to-video R@10· uses extra data· 2022-12-09
90.1
best: 99.5 (GRAM)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonVATEX
text-to-video R@5· uses extra data· 2022-12-09
83.3
best: 94 (InternVideo2-6B)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonVATEX
video-to-text R@1· uses extra data· 2022-12-09
73.6
best: 85.4 (InternVideo2-1B)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonVATEX
video-to-text R@10· uses extra data· 2022-12-09
97.2
best: 99.3 (InternVideo2-6B)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonVATEX
video-to-text R@5· uses extra data· 2022-12-09
93.2
best: 97.9 (InternVideo2-6B)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonMSR-VTT-full
text-to-video R@1· uses extra data· 2022-12-09
34.3
best: 46.3 (InternVL-G)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonMSR-VTT-full
text-to-video R@10· uses extra data· 2022-12-09
67
best: 79.6 (InternVL-G)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonMSR-VTT-full
text-to-video R@5· uses extra data· 2022-12-09
57.8
best: 70.5 (InternVL-G)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonMSR-VTT-full
video-to-text R@1· uses extra data· 2022-12-09
64.7
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonMSR-VTT-full
video-to-text R@10· uses extra data· 2022-12-09
91.4
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonMSR-VTT-full
video-to-text R@5· uses extra data· 2022-12-09
85.2
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonActivityNet
text-to-video R@1· uses extra data· 2022-12-09
34.5
best: 63.2 (InternVideo2-6B)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonActivityNet
text-to-video R@10· uses extra data· 2022-12-09
76.6
best: 92.5 (InternVideo2-6B)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonActivityNet
text-to-video R@5· uses extra data· 2022-12-09
63.2
best: 85.6 (InternVideo2-6B)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonActivityNet
video-to-text R@1· uses extra data· 2022-12-09
33
best: 56.5 (InternVideo2-6B)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonActivityNet
video-to-text R@10· uses extra data· 2022-12-09
75.3
best: 90.3 (InternVideo2-6B)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Video RetrievalonActivityNet
video-to-text R@5· uses extra data· 2022-12-09
61.6
best: 82.8 (InternVideo2-6B)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Video CaptioningonMSR-VTT
BLEU-4· uses extra data· 2022-12-09
53.8
best: 57.8 (mPLUG-2)
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Video CaptioningonMSR-VTT
CIDEr· uses extra data· 2022-12-09
73.2
best: 80 (mPLUG-2)
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Video CaptioningonMSR-VTT
ROUGE-L· uses extra data· 2022-12-09
68
best: 70.1 (mPLUG-2)
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Video CaptioningonYouCook2
BLEU-4· uses extra data· 2022-12-09
14.2
best: 18.2 (VAST)
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Video CaptioningonYouCook2
CIDEr· uses extra data· 2022-12-09
1.28
best: 116.4 (HowToCaption)
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Video CaptioningonYouCook2
ROUGE-L· uses extra data· 2022-12-09
37.7
best: 47.04 (UniVL + MELTR)
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Video CaptioningonActivityNet Captions
ROUGE-L· uses extra data· 2022-12-09
35
best: 36.56 (VLTinT (ae-test split) C3D/Ling)
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Action RecognitiononUCF101
Top-1 Accuracy· uses extra data· 2022-12-09
86.6
best: 92.8 (OTI(ViT-L/14))
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Action RecognitiononKinetics
Top-5 Accuracy· uses extra data· 2022-12-09
88.9
best: 95.7 (TC-CLIP)
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Zero-Shot Action RecognitiononHMDB51
Top-1 Accuracy· uses extra data· 2022-12-09
58.7
best: 64.7 (MOV (ViT-L/14))
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979

Natural Language Processing2 results

Visual Question Answering (VQA)onMSVD-QA
Accuracy· uses extra data· 2022-12-09
0.569
best: 0.61 (VLAB)
SOTA
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Visual Question Answering (VQA)onMSRVTT-QA
Accuracy· uses extra data· 2022-12-09
0.463
best: 0.496 (VLAB)
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979

Reasoning2 results

Video Question AnsweringonActivityNet-QA
Accuracy· uses extra data· 2022-12-09
56.1
best: 61.6 (Tarsier (34B))
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979
Video Question AnsweringoniVQA
Accuracy· uses extra data· 2022-12-09
39
best: 40.2 (Text + Text (no Multimodal Pretext Training))
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners arXiv:2212.04979