Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/GVL

GVL

Reported on 16 benchmarks across 3 tasks · 1 paper · 3 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision16 results

Video CaptioningonActivityNet Captions
SODA· 2023-03-11
7.11
SOTA
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378
Dense Video CaptioningonActivityNet Captions
CIDEr· 2023-03-11
33.33
SOTA
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378
Dense Video CaptioningonActivityNet Captions
SODA· 2023-03-11
7.11
SOTA
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378
VideoonTACoS
R@1,IoU=0.3· 2023-03-11
45.92
best: 58.1 (SG-DETR (w/ PT))
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378
VideoonTACoS
R@1,IoU=0.5· 2023-03-11
34.57
best: 46.79 (DeCafNet)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378
VideoonActivityNet Captions
R@1,IoU=0.5· 2023-03-11
49.18
best: 60.67 (GVL (paragraph-level))
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378
VideoonActivityNet Captions
R@1,IoU=0.7· 2023-03-11
29.69
best: 38.55 (GVL (paragraph-level))
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378
Video CaptioningonYouCook2
CIDEr· 2023-03-11
26.52
best: 116.4 (HowToCaption)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378
Video CaptioningonYouCook2
METEOR· 2023-03-11
5.01
best: 22.56 (UniVL + MELTR)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378
Video CaptioningonYouCook2
SODA· 2023-03-11
4.91
best: 10.73 (HiCM²)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378
Video CaptioningonActivityNet Captions
CIDEr· 2023-03-11
33.33
best: 39.3 (VideoCoCa)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378
Video CaptioningonActivityNet Captions
METEOR· 2023-03-11
10.03
best: 17.97 (VLTinT (ae-test split) C3D/Ling)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378
Dense Video CaptioningonYouCook2
CIDEr· 2023-03-11
26.52
best: 71.84 (HiCM²)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378
Dense Video CaptioningonYouCook2
METEOR· 2023-03-11
5.01
best: 12.8 (HiCM²)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378
Dense Video CaptioningonYouCook2
SODA· 2023-03-11
4.91
best: 10.73 (HiCM²)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378
Dense Video CaptioningonActivityNet Captions
METEOR· 2023-03-11
10.03
best: 17 (Vid2Seq)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos arXiv:2303.06378