DeVAn
Dense Video Annotation for Video-Language Models
VideosIntroduced 2024-08-11
DeVAn is a multi-modal dataset containing 8.5K video clips carefully selected from previously published YouTube-based video datasets (YouTube-8M and YT-Temporal-1B) that integrate visual and auditory information. Over the span of 10 months, a team of 24 human annotators (college and graduate level students) created 5 short captions (1 sentence each) and 5 long summaries (3-10 sentences) for each video clip, resulting in a rich and comprehensive human-annotated dataset that serves as a robust ground truth for subsequent model training and evaluation.