Spatio-Temporal Video Grounding

12 benchmarks22 papers

Spatio-temporal video grounding is a computer vision and natural language processing (NLP) task that involves linking textual descriptions to specific spatio-temporal regions or moments in a video. In other words, it aims to determine which parts of a video correspond to a given textual query or description. This task is essential for various applications, including video summarization, content-based video retrieval, video captioning, and more.

Benchmarks