VidChapters-7M

TextsVideosMITIntroduced 2023-09-25

VidChapters-7M is a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. It is designed for training and evaluating models for video chapter generation with or without ground-truth boundaries, and video chapter grounding, as well as for video-language pretraining.

Benchmarks

Dense Video Captioning/CIDEr Language-Based Temporal Localization/R1@.9 Language-Based Temporal Localization/R@10s Temporal Localization/R1@.9 Temporal Localization/R@10s Video Captioning/CIDEr Video Chaptering/P@5s Video Chaptering/CIDEr Video Chaptering/P@0.5 Video Chaptering/P@0.7 Video Chaptering/P@3s Video Chaptering/R@0.5 Video Chaptering/R@0.7 Video Chaptering/R@3s Video Chaptering/R@5s Video Chaptering/SODA