Papers With Code 2 | ML Benchmarks, SotA Results & Code

StreamingBench evaluates Multimodal Large Language Models (MLLMs) in real-time, streaming video understanding tasks. 🌟

🎞️ Overview As MLLMs continue to advance, they remain largely focused on offline video comprehension, where all frames are pre-loaded before making queries. However, this is far from the human ability to process and respond to video streams in real-time, capturing the dynamic nature of multimedia content. To bridge this gap, StreamingBench introduces the first comprehensive benchmark for streaming video understanding in MLLMs.

Key Evaluation Aspects 🎯 Real-time Visual Understanding: Can the model process and respond to visual changes in real-time? 🔊 Omni-source Understanding: Does the model integrate visual and audio inputs synchronously in real-time video streams? 🎬 Contextual Understanding: Can the model comprehend the broader context within video streams? Dataset Statistics 📊 900 diverse videos 📝 4,500 human-annotated QA pairs ⏱️ Five questions per video at different timestamps