TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MAD: A Scalable Dataset for Language Grounding in Videos f...

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Mattia Soldan, Alejandro Pardo, Juan León Alcázar, Fabian Caba Heilbron, Chen Zhao, Silvio Giancola, Bernard Ghanem

2021-12-01CVPR 2022 1Moment RetrievalNatural Language Moment Retrieval
PaperPDFCode(official)

Abstract

The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In comparison, limited effort has been made at assessing the fitness of these datasets for the video-language grounding task. Recent works have begun to discover significant limitations in these datasets, suggesting that state-of-the-art techniques commonly overfit to hidden dataset biases. In this work, we present MAD (Movie Audio Descriptions), a novel benchmark that departs from the paradigm of augmenting existing video datasets with text annotations and focuses on crawling and aligning available audio descriptions of mainstream movies. MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of videos and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets. MAD's collection strategy enables a novel and more challenging version of video-language grounding, where short temporal moments (typically seconds long) must be accurately grounded in diverse long-form videos that can last up to three hours. We have released MAD's data and baselines code at https://github.com/Soldelli/MAD.

Results

TaskDatasetMetricValueModel
VideoMADR@1,IoU=0.16.57CLIP
VideoMADR@1,IoU=0.33.13CLIP
VideoMADR@1,IoU=0.51.39CLIP
VideoMADR@10,IoU=0.120.26CLIP
VideoMADR@10,IoU=0.314.13CLIP
VideoMADR@10,IoU=0.58.38CLIP
VideoMADR@100,IoU=0.147.73CLIP
VideoMADR@100,IoU=0.336.98CLIP
VideoMADR@100,IoU=0.524.99CLIP
VideoMADR@5,IoU=0.115.05CLIP
VideoMADR@5,IoU=0.39.85CLIP
VideoMADR@5,IoU=0.55.44CLIP
VideoMADR@50,IoU=0.137.92CLIP
VideoMADR@50,IoU=0.328.71CLIP
VideoMADR@50,IoU=0.518.8CLIP
VideoMADR@1,IoU=0.13.5VLG-Net
VideoMADR@1,IoU=0.32.63VLG-Net
VideoMADR@1,IoU=0.51.61VLG-Net
VideoMADR@10,IoU=0.118.32VLG-Net
VideoMADR@10,IoU=0.315.2VLG-Net
VideoMADR@10,IoU=0.510.18VLG-Net
VideoMADR@100,IoU=0.149.65VLG-Net
VideoMADR@100,IoU=0.343.95VLG-Net
VideoMADR@100,IoU=0.534.18VLG-Net
VideoMADR@5,IoU=0.111.74VLG-Net
VideoMADR@5,IoU=0.39.49VLG-Net
VideoMADR@5,IoU=0.56.23VLG-Net
VideoMADR@50,IoU=0.138.41VLG-Net
VideoMADR@50,IoU=0.333.68VLG-Net
VideoMADR@50,IoU=0.525.33VLG-Net
VideoMADR@1,IoU=0.10.09Random Chance
VideoMADR@1,IoU=0.30.04Random Chance
VideoMADR@1,IoU=0.50.01Random Chance
VideoMADR@10,IoU=0.10.88Random Chance
VideoMADR@10,IoU=0.30.39Random Chance
VideoMADR@10,IoU=0.50.14Random Chance
VideoMADR@100,IoU=0.18.47Random Chance
VideoMADR@100,IoU=0.33.8Random Chance
VideoMADR@100,IoU=0.51.4Random Chance
VideoMADR@5,IoU=0.10.44Random Chance
VideoMADR@5,IoU=0.30.19Random Chance
VideoMADR@5,IoU=0.50.07Random Chance
VideoMADR@50,IoU=0.14.33Random Chance
VideoMADR@50,IoU=0.31.92Random Chance
VideoMADR@50,IoU=0.50.71Random Chance

Related Papers

DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding2025-06-16DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos2025-05-22Retrieval Augmented Generation Evaluation for Health Documents2025-05-07Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection2025-04-20Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking2025-04-11TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos2025-03-09MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval2025-02-18Moment of Untruth: Dealing with Negative Queries in Video Moment Retrieval2025-02-12