TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/VideoCoCa

VideoCoCa

Reported on 41 benchmarks across 5 tasks · 1 paper · 28 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision37 results

  • Video CaptioningonVATEX
    BLEU-4· uses extra data· 2022-12-09
    39.7
    best: 45.6 (VALOR)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Video CaptioningonVATEX
    CIDEr· uses extra data· 2022-12-09
    77.8
    best: 99.5 (VAST)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Video CaptioningonVATEX
    ROUGE-L· uses extra data· 2022-12-09
    54.5
    best: 57.4 (VALOR)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Video CaptioningonActivityNet Captions
    BLEU4· uses extra data· 2022-12-09
    14.7
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Video CaptioningonActivityNet Captions
    CIDEr· uses extra data· 2022-12-09
    39.3
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Action RecognitiononUCF101
    Top-5 accuracy· uses extra data· 2022-12-09
    98.4
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Action RecognitiononKinetics
    Top-1 Accuracy· uses extra data· 2022-12-09
    70.1
    best: 78.1 (TC-CLIP)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Action RecognitiononCharades
    mAP· uses extra data· 2022-12-09
    25.8
    best: 35.59 (MSQNet)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Action RecognitiononHMDB51
    Top-5 Accuracy· uses extra data· 2022-12-09
    84.5
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonVATEX
    text-to-video R@1· uses extra data· 2022-12-09
    53.2
    best: 83.9 (GRAM)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonVATEX
    text-to-video R@10· uses extra data· 2022-12-09
    90.1
    best: 99.5 (GRAM)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonVATEX
    text-to-video R@5· uses extra data· 2022-12-09
    83.3
    best: 94 (InternVideo2-6B)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonVATEX
    video-to-text R@1· uses extra data· 2022-12-09
    73.6
    best: 85.4 (InternVideo2-1B)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonVATEX
    video-to-text R@10· uses extra data· 2022-12-09
    97.2
    best: 99.3 (InternVideo2-6B)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonVATEX
    video-to-text R@5· uses extra data· 2022-12-09
    93.2
    best: 97.9 (InternVideo2-6B)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonMSR-VTT-full
    text-to-video R@1· uses extra data· 2022-12-09
    34.3
    best: 46.3 (InternVL-G)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonMSR-VTT-full
    text-to-video R@10· uses extra data· 2022-12-09
    67
    best: 79.6 (InternVL-G)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonMSR-VTT-full
    text-to-video R@5· uses extra data· 2022-12-09
    57.8
    best: 70.5 (InternVL-G)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonMSR-VTT-full
    video-to-text R@1· uses extra data· 2022-12-09
    64.7
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonMSR-VTT-full
    video-to-text R@10· uses extra data· 2022-12-09
    91.4
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonMSR-VTT-full
    video-to-text R@5· uses extra data· 2022-12-09
    85.2
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonActivityNet
    text-to-video R@1· uses extra data· 2022-12-09
    34.5
    best: 63.2 (InternVideo2-6B)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonActivityNet
    text-to-video R@10· uses extra data· 2022-12-09
    76.6
    best: 92.5 (InternVideo2-6B)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonActivityNet
    text-to-video R@5· uses extra data· 2022-12-09
    63.2
    best: 85.6 (InternVideo2-6B)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonActivityNet
    video-to-text R@1· uses extra data· 2022-12-09
    33
    best: 56.5 (InternVideo2-6B)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonActivityNet
    video-to-text R@10· uses extra data· 2022-12-09
    75.3
    best: 90.3 (InternVideo2-6B)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Video RetrievalonActivityNet
    video-to-text R@5· uses extra data· 2022-12-09
    61.6
    best: 82.8 (InternVideo2-6B)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Video CaptioningonMSR-VTT
    BLEU-4· uses extra data· 2022-12-09
    53.8
    best: 57.8 (mPLUG-2)
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Video CaptioningonMSR-VTT
    CIDEr· uses extra data· 2022-12-09
    73.2
    best: 80 (mPLUG-2)
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Video CaptioningonMSR-VTT
    ROUGE-L· uses extra data· 2022-12-09
    68
    best: 70.1 (mPLUG-2)
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Video CaptioningonYouCook2
    BLEU-4· uses extra data· 2022-12-09
    14.2
    best: 18.2 (VAST)
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Video CaptioningonYouCook2
    CIDEr· uses extra data· 2022-12-09
    1.28
    best: 116.4 (HowToCaption)
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Video CaptioningonYouCook2
    ROUGE-L· uses extra data· 2022-12-09
    37.7
    best: 47.04 (UniVL + MELTR)
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Video CaptioningonActivityNet Captions
    ROUGE-L· uses extra data· 2022-12-09
    35
    best: 36.56 (VLTinT (ae-test split) C3D/Ling)
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Action RecognitiononUCF101
    Top-1 Accuracy· uses extra data· 2022-12-09
    86.6
    best: 92.8 (OTI(ViT-L/14))
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Action RecognitiononKinetics
    Top-5 Accuracy· uses extra data· 2022-12-09
    88.9
    best: 95.7 (TC-CLIP)
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Zero-Shot Action RecognitiononHMDB51
    Top-1 Accuracy· uses extra data· 2022-12-09
    58.7
    best: 64.7 (MOV (ViT-L/14))
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979

Natural Language Processing2 results

  • Visual Question Answering (VQA)onMSVD-QA
    Accuracy· uses extra data· 2022-12-09
    0.569
    best: 0.61 (VLAB)
    SOTA
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Visual Question Answering (VQA)onMSRVTT-QA
    Accuracy· uses extra data· 2022-12-09
    0.463
    best: 0.496 (VLAB)
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979

Reasoning2 results

  • Video Question AnsweringonActivityNet-QA
    Accuracy· uses extra data· 2022-12-09
    56.1
    best: 61.6 (Tarsier (34B))
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979
  • Video Question AnsweringoniVQA
    Accuracy· uses extra data· 2022-12-09
    39
    best: 40.2 (Text + Text (no Multimodal Pretext Training))
    VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive CaptionersarXiv:2212.04979