StreamingBench
StreamingBench evaluates Multimodal Large Language Models (MLLMs) in real-time, streaming video understanding tasks. š
šļø Overview As MLLMs continue to advance, they remain largely focused on offline video comprehension, where all frames are pre-loaded before making queries. However, this is far from the human ability to process and respond to video streams in real-time, capturing the dynamic nature of multimedia content. To bridge this gap, StreamingBench introduces the first comprehensive benchmark for streaming video understanding in MLLMs.
Key Evaluation Aspects šÆ Real-time Visual Understanding: Can the model process and respond to visual changes in real-time? š Omni-source Understanding: Does the model integrate visual and audio inputs synchronously in real-time video streams? š¬ Contextual Understanding: Can the model comprehend the broader context within video streams? Dataset Statistics š 900 diverse videos š 4,500 human-annotated QA pairs ā±ļø Five questions per video at different timestamps