TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/Vid2Seq

Vid2Seq

Reported on 32 benchmarks across 3 tasks · 2 papers · 24 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision33 results

  • Video CaptioningonVidChapters-7M
    CIDEr· uses extra data· 2023-09-25
    55.7
    best: 120.5
    SOTA
    VidChapters-7M: Video Chapters at ScalearXiv:2309.13952
  • Dense Video CaptioningonVidChapters-7M
    CIDEr· uses extra data· 2023-09-25
    55.7
    SOTA
    VidChapters-7M: Video Chapters at ScalearXiv:2309.13952
  • Video ChapteringonVidChapters-7M
    CIDEr· 2023-09-25
    55.7
    SOTA
    VidChapters-7M: Video Chapters at ScalearXiv:2309.13952
  • Video ChapteringonVidChapters-7M
    P@0.5· 2023-09-25
    43.1
    SOTA
    VidChapters-7M: Video Chapters at ScalearXiv:2309.13952
  • Video ChapteringonVidChapters-7M
    P@0.7· 2023-09-25
    26.4
    SOTA
    VidChapters-7M: Video Chapters at ScalearXiv:2309.13952
  • Video ChapteringonVidChapters-7M
    P@3s· 2023-09-25
    24
    SOTA
    VidChapters-7M: Video Chapters at ScalearXiv:2309.13952
  • Video ChapteringonVidChapters-7M
    P@5s· 2023-09-25
    30.3
    best: 52 (Chapter-Llama)
    SOTA
    VidChapters-7M: Video Chapters at ScalearXiv:2309.13952
  • Video ChapteringonVidChapters-7M
    R@0.5· 2023-09-25
    48.2
    SOTA
    VidChapters-7M: Video Chapters at ScalearXiv:2309.13952
  • Video ChapteringonVidChapters-7M
    R@0.7· 2023-09-25
    28.5
    SOTA
    VidChapters-7M: Video Chapters at ScalearXiv:2309.13952
  • Video ChapteringonVidChapters-7M
    R@3s· 2023-09-25
    28.5
    SOTA
    VidChapters-7M: Video Chapters at ScalearXiv:2309.13952
  • Video ChapteringonVidChapters-7M
    R@5s· 2023-09-25
    36.4
    SOTA
    VidChapters-7M: Video Chapters at ScalearXiv:2309.13952
  • Video ChapteringonVidChapters-7M
    SODA· 2023-09-25
    0.114
    SOTA
    VidChapters-7M: Video Chapters at ScalearXiv:2309.13952
  • Video CaptioningonYouCook2
    CIDEr· uses extra data· 2023-02-27
    47.1
    best: 116.4 (HowToCaption)
    SOTA
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Video CaptioningonYouCook2
    SODA· uses extra data· 2023-02-27
    7.9
    best: 10.73 (HiCM²)
    SOTA
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Video CaptioningonViTT
    CIDEr· uses extra data· 2023-02-27
    43.5
    best: 51.2 (HiCM²)
    SOTA
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Video CaptioningonViTT
    METEOR· uses extra data· 2023-02-27
    8.5
    best: 9.6 (HiCM²)
    SOTA
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Video CaptioningonViTT
    SODA· uses extra data· 2023-02-27
    0.135
    best: 9.1 (Vid2Seq (VidChapters-7M PT))
    SOTA
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Dense Video CaptioningonYouCook2
    CIDEr· uses extra data· 2023-02-27
    47.1
    best: 71.84 (HiCM²)
    SOTA
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Dense Video CaptioningonYouCook2
    METEOR· uses extra data· 2023-02-27
    9.3
    best: 12.8 (HiCM²)
    SOTA
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Dense Video CaptioningonYouCook2
    SODA· uses extra data· 2023-02-27
    7.9
    best: 10.73 (HiCM²)
    SOTA
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Dense Video CaptioningonViTT
    CIDEr· uses extra data· 2023-02-27
    43.5
    best: 51.2 (HiCM²)
    SOTA
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Dense Video CaptioningonViTT
    METEOR· uses extra data· 2023-02-27
    8.5
    best: 9.6 (HiCM²)
    SOTA
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Dense Video CaptioningonViTT
    SODA· uses extra data· 2023-02-27
    0.135
    best: 9.1 (Vid2Seq (VidChapters-7M PT))
    SOTA
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Dense Video CaptioningonActivityNet Captions
    METEOR· uses extra data· 2023-02-27
    17
    SOTA
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Video CaptioningonMSR-VTT
    CIDEr· uses extra data· 2023-02-27
    64.6
    best: 80 (mPLUG-2)
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Video CaptioningonMSR-VTT
    METEOR· uses extra data· 2023-02-27
    30.8
    best: 38.7 (MV-GPT)
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Video CaptioningonMSVD
    CIDEr· uses extra data· 2023-02-27
    146.2
    best: 195.6 (MaMMUT)
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Video CaptioningonMSVD
    METEOR· uses extra data· 2023-02-27
    45.3
    best: 51.2 (VLAB)
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Video CaptioningonYouCook2
    METEOR· uses extra data· 2023-02-27
    9.3
    best: 22.56 (UniVL + MELTR)
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Video CaptioningonActivityNet Captions
    CIDEr· uses extra data· 2023-02-27
    28
    best: 39.3 (VideoCoCa)
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Video CaptioningonActivityNet Captions
    METEOR· uses extra data· 2023-02-27
    17
    best: 17.97 (VLTinT (ae-test split) C3D/Ling)
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Dense Video CaptioningonActivityNet Captions
    CIDEr· uses extra data· 2023-02-27
    28
    best: 33.33 (GVL)
    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningarXiv:2302.14115
  • Video CaptioningonVidChapters-7M
    CIDEr· uses extra data
    120.5