VNBench

Introduced 2024-06-13

VNBench is a comprehensive benchmark suite for video generative models, which evaluates video generation quality across specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. The suite includes 16 dimensions for evaluating Text-to-Video (T2V) models, such as subject consistency, motion smoothness, and overall consistency. VNBench also supports evaluating Image-to-Video (I2V) models and has recently introduced VBench-Long for evaluating longer videos¹. It's designed to align with human perceptions and provide valuable insights for future developments in video generation.

Diversity of “Needle” Types

Edit: Using artificially added subtitles as the "needle". These subtitles are embedded in video frames to simulate the scenario of finding specific textual information in a video.

Insert: Using images as the "needle". These images are inserted as static segments between video frames to assess the model's ability to recognize and remember static images in a video.

Level Classification: Classified into two levels based on image recognizability. The first level uses common objects (e.g., fruit images), while the second level uses more challenging landmark or object images, increasing the task's difficulty.

Diversity of Video “Haystack”

Temporal Distribution: The video "haystack" used by VNBench comes from different data sources, with video durations ranging from 10 seconds to 180 seconds. This covers short, medium, and long video lengths to evaluate the model's adaptability to different video lengths.

Content Coverage: The video content includes various scenes, ensuring the evaluation's broadness and the diversity of video sources.

Diversity of Queries

Retrieval Task: Requires the model to retrieve specific "needles" from videos, assessing the model's fine-grained understanding and information extraction ability.

Ordering Task: Requires the model to identify and order the timestamps of all inserted "needles" in the video, assessing the model's understanding of video temporal dynamics and event sequences.

Counting Task: Requires the model to count the occurrences of specific objects in the video, including recognizing and tracking repetitive patterns within and across frames. This assesses the model's understanding of spatial and temporal dimensions.

(1) Vchitect/VBench: [CVPR2024 Highlight] VBench - GitHub. https://github.com/Vchitect/VBench. (2) VBench: Comprehensive Benchmark Suite for Video Generative Models. https://arxiv.org/abs/2311.17982. (3) Linux中vdbench的安装与使用_vdbench参数详解-CSDN博客. https://blog.csdn.net/SweeNeil/article/details/95338293. (4) Cinebench 2024 Downloads - Maxon. https://www.maxon.net/zh/downloads/cinebench-2024-downloads. (5) undefined. https://doi.org/10.48550/arXiv.2311.17982.