Dataset Card for SVBench

This dataset card aims to provide a comprehensive overview of the SVBench dataset, including its purpose, structure, and sources. For details, see our Project, Paper and GitHub repository.

Dataset Details

Dataset Description

SVBench is the first benchmark specifically designed to evaluate long-context streaming video understanding through temporal multi-turn question-answering (QA) chains. It addresses the limitations of existing video understanding benchmarks by emphasizing continuous temporal reasoning over streaming video data.

The dataset includes:

1,353 streaming videos (average duration: 1–8 minutes)
49,979 QA pairs organized into multi-turn dialogues
Temporal linkages between QA chains to simulate real-world streaming scenarios

Languages

NLP: English & Chinese (bilingual annotations)

License

Apache-2.0

Dataset Sources

YT-Temporal-1B|YouCook2|MovieChat|Panda-70M|Ego4D|ActivityNet.

Uses

Download SVBench dataset from Hugging Face:

git clone https://huggingface.co/yzy666/SVBench

Intended Use

Evaluate long-context reasoning capabilities of Large Vision-Language Models (LVLMs) in streaming video scenarios.
Benchmark temporal understanding through multi-turn dialogues.
Support research in streaming video QA, activity forecasting, and interactive AI assistants.

Direct Use

Training or fine-tuning LVLMs for streaming video tasks.
Testing model robustness in handling dynamic, time-sensitive queries.
Comparative analysis of open-source vs. closed-source LVLMs.

Restrictions

Videos and annotations are for research purposes only.
Users must comply with the original licenses of datasets.

Dataset Structure

Folder Structure Tree:

SVBench/
├── Con/
│   ├── Con_EN/
│   └── Con_ZH/
├── Dialogue/
│   ├── Dialogue_EN/
│   └── Dialogue_ZH/
├── Meta/
│   ├── Meta_EN/
│   └── Meta_ZH/
├── Path/
├── Src/
├── Streaming/
│   ├── Streaming_EN/
│   └── Streaming_ZH/
├── Video/
└── Your_Model_Name/
    ├── dialogue/
    │   └── --NDulaHyrE.json
    └── streaming/
        └── -4h8cuweoKo.json

Dataset Division:
- Training Set: 42,605 QA pairs and 1,153 videos
- Testing Set: 7,374 QA pairs and 200 videos

Data Fields

| Key | Description | | ------------------------- | ------------------------------------------------------------ | | Video_Name | Unique identifier or title of the video file. | | Sort_of_Set | Category or type of the dataset subset (e.g., "Train", "Test"). | | Path_of_QandA | File path to the question-answer pairs. | | Path_of_Con | Path to relationship files. | | Path_of_StreamingPathData | Path to Q&A sequence for streaming evaluation. Each streaming path contains all Q&A sequences in streaming order within the path. | | Path_of_Dialogue | Path to Q&A sequence for dialogue evaluation. Each dialogue contains all Q&A sequences in order within the dialogue. | | Path_of_Streaming | Path to Q&A sequence for streaming evaluation represented only by serial numbers (e.g., the path [[0,0],[1,2],[2,3]...] indicates starting from the 1st question of the 1st chain, then proceeding to the 3rd question of the 2nd chain, and then to the 4th question of the 3rd chain, and so on). | | Path_of_Video | Absolute file path to the raw video file. | | Video_Duration | Total duration of the video in seconds. | | Source_of_Dataset | Origin of the dataset. |

Leaderboard Submission

Submit results via https://forms.gle/Rmi6u4WGhyEZ2X7g8.

Submission Instructions:

Save result files following the case structure under the SVBench/[Your_Model_Name] directory
Compress the entire [Your_Model_Name] folder into a ZIP archive
Upload the generated ZIP file through this submission portal

Important Notes:

Maintain the exact directory hierarchy: [Your_Model_Name]/[dialogue/streaming]/[Video_Name.json]
Ensure your model name follows exact capitalization (case-sensitive)
Package only the required result files (avoid including source code/executables)

Submission Form

Dataset Annotation

See GitHub repository for details.

Semi-automated annotation using a hybrid approach:

Automatic QA Generation: Leveraging GPT-4 to generate dialogue chains based on video transcripts.
Human Verification: Annotators validate temporal links and refine QA pairs for consistency.
Temporal Linking: Artificial alignment of QA chains across video segments.
Quality Control: Inter-annotator agreement >90% for temporal links.

Citation

If you find our data useful, please consider citing our work!

@article{yang2025svbench,
  title={SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding},
  author={Yang, Zhenyu and Hu, Yuhang and Du, Zemin and Xue, Dizhan and Qian, Shengsheng and Wu, Jiahong and Yang, Fan and Dong, Weiming and Xu, Changsheng},
  journal={arXiv preprint arXiv:2502.10810},
  year={2025}
}

Dataset Card for SVBench

This dataset card aims to provide a comprehensive overview of the SVBench dataset, including its purpose, structure, and sources. For details, see our Project, Paper and GitHub repository.

Dataset Details

Dataset Description

The dataset includes:

1,353 streaming videos (average duration: 1–8 minutes)
49,979 QA pairs organized into multi-turn dialogues
Temporal linkages between QA chains to simulate real-world streaming scenarios

Languages

NLP: English & Chinese (bilingual annotations)

License

Apache-2.0

Dataset Sources

YT-Temporal-1B|YouCook2|MovieChat|Panda-70M|Ego4D|ActivityNet.

Uses

Download SVBench dataset from Hugging Face:

git clone https://huggingface.co/yzy666/SVBench

Intended Use

Evaluate long-context reasoning capabilities of Large Vision-Language Models (LVLMs) in streaming video scenarios.
Benchmark temporal understanding through multi-turn dialogues.
Support research in streaming video QA, activity forecasting, and interactive AI assistants.

Direct Use

Training or fine-tuning LVLMs for streaming video tasks.
Testing model robustness in handling dynamic, time-sensitive queries.
Comparative analysis of open-source vs. closed-source LVLMs.

Restrictions

Videos and annotations are for research purposes only.
Users must comply with the original licenses of datasets.

Dataset Structure

Folder Structure Tree:

SVBench/
├── Con/
│   ├── Con_EN/
│   └── Con_ZH/
├── Dialogue/
│   ├── Dialogue_EN/
│   └── Dialogue_ZH/
├── Meta/
│   ├── Meta_EN/
│   └── Meta_ZH/
├── Path/
├── Src/
├── Streaming/
│   ├── Streaming_EN/
│   └── Streaming_ZH/
├── Video/
└── Your_Model_Name/
    ├── dialogue/
    │   └── --NDulaHyrE.json
    └── streaming/
        └── -4h8cuweoKo.json

Dataset Division:
- Training Set: 42,605 QA pairs and 1,153 videos
- Testing Set: 7,374 QA pairs and 200 videos

Data Fields

Leaderboard Submission

Submit results via https://forms.gle/Rmi6u4WGhyEZ2X7g8.

Submission Instructions:

Save result files following the case structure under the SVBench/[Your_Model_Name] directory
Compress the entire [Your_Model_Name] folder into a ZIP archive
Upload the generated ZIP file through this submission portal

Important Notes:

Maintain the exact directory hierarchy: [Your_Model_Name]/[dialogue/streaming]/[Video_Name.json]
Ensure your model name follows exact capitalization (case-sensitive)
Package only the required result files (avoid including source code/executables)

Submission Form

Dataset Annotation

See GitHub repository for details.

Semi-automated annotation using a hybrid approach:

Automatic QA Generation: Leveraging GPT-4 to generate dialogue chains based on video transcripts.
Human Verification: Annotators validate temporal links and refine QA pairs for consistency.
Temporal Linking: Artificial alignment of QA chains across video segments.
Quality Control: Inter-annotator agreement >90% for temporal links.

Citation

If you find our data useful, please consider citing our work!

@article{yang2025svbench,
  title={SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding},
  author={Yang, Zhenyu and Hu, Yuhang and Du, Zemin and Xue, Dizhan and Qian, Shengsheng and Wu, Jiahong and Yang, Fan and Dong, Weiming and Xu, Changsheng},
  journal={arXiv preprint arXiv:2502.10810},
  year={2025}
}