TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Flash-VStream: Memory-Based Real-Time Understanding for Lo...

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin

2024-06-12Zero-Shot Video Question AnswerQuestion Answeringcross-modal alignmentVideo Question AnsweringVideo UnderstandingLanguage Modelling
PaperPDFCode(official)

Abstract

Benefiting from the advancements in large language models and cross-modal alignment, existing multi-modal video understanding methods have achieved prominent performance in offline scenario. However, online video streams, as one of the most common media forms in the real world, have seldom received attention. Compared to offline videos, the 'dynamic' nature of online video streams poses challenges for the direct application of existing models and introduces new problems, such as the storage of extremely long-term information, interaction between continuous visual content and 'asynchronous' user questions. Therefore, in this paper we present Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously. Compared to existing models, Flash-VStream achieves significant reductions in inference latency and VRAM consumption, which is intimately related to performing understanding of online streaming video. In addition, given that existing video understanding benchmarks predominantly concentrate on offline scenario, we propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding. Comparisons with popular existing methods on the proposed benchmark demonstrate the superiority of our method for such challenging setting. To verify the generalizability of our approach, we further evaluate it on existing video understanding benchmarks and achieves state-of-the-art performance in offline scenarios as well. All code, models, and datasets are available at the https://invinciblewyq.github.io/vstream-page/

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QA (Open-ended VideoQA)Accuracy61.6Flash-VStream
Question AnsweringNExT-QA (Open-ended VideoQA)Confidence Score3.4Flash-VStream
Question AnsweringMSVD-QAAccuracy80.3Flash-VStream
Question AnsweringMSVD-QAConfidence Score3.9Flash-VStream
Question AnsweringMSRVTT-QAAccuracy72.4Flash-VStream
Question AnsweringMSRVTT-QAConfidence Score3.4Flash-VStream
Question AnsweringActivityNet-QAAccuracy51.9Flash-VStream
Question AnsweringActivityNet-QAConfidence Score3.4Flash-VStream
Video Question AnsweringOVBenchAVG31.2Flash-Vstream (7B)
Video Question AnsweringMSVD-QAAccuracy80.3Flash-VStream
Video Question AnsweringMSVD-QAConfidence Score3.9Flash-VStream
Video Question AnsweringMSRVTT-QAAccuracy72.4Flash-VStream
Video Question AnsweringMSRVTT-QAConfidence Score3.4Flash-VStream
Video Question AnsweringActivityNet-QAAccuracy51.9Flash-VStream
Video Question AnsweringActivityNet-QAConfidence Score3.4Flash-VStream

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Transformer-based Spatial Grounding: A Comprehensive Survey2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17