OVBench

TextsVideosIntroduced 2024-12-31

OVBench is a benchmark tailored for real-time video understanding:

  • Memory, Perception, and Prediction of Temporal Contexts: Questions are framed to reference the present state of entities, requiring models to memorize/perceive/predict past/present/future temporal contexts over time.
  • Dynamic Spatio-temporal Interaction: The benchmark demands precise real-time interactions with video content, where actions, objects, and events must be understood in the context of their spatial and temporal relationships.
  • Contextual Awareness at Specific Moments: Real-time questions are contextual, changing based on the specific timestamp they are asked, requiring a deep understanding of how temporal context evolves.