TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VideoChat: Chat-Centric Video Understanding

VideoChat: Chat-Centric Video Understanding

Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, LiMin Wang, Yu Qiao

2023-05-10Zero-Shot Video Question AnswerQuestion AnsweringVideo-based Generative Performance BenchmarkingVideo-based Generative Performance Benchmarking (Contextual Understanding)Video-based Generative Performance Benchmarking (Correctness of Information)Video Question AnsweringVideo-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Video-based Generative Performance Benchmarking (Detail Orientation))Video Understanding
PaperPDFCode(official)

Abstract

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QA (Open-ended VideoQA)Accuracy56.6VideoChat
Question AnsweringNExT-QA (Open-ended VideoQA)Confidence Score3.2VideoChat
Question AnsweringVNBenchAccuracy12.4VideoChat2
Question AnsweringMSVD-QAAccuracy56.3Video Chat-7B
Question AnsweringMSVD-QAConfidence Score2.8Video Chat-7B
Question AnsweringTGIF-QAAccuracy34.4Video Chat-7B
Question AnsweringTGIF-QAConfidence Score2.3Video Chat-7B
Question AnsweringMSRVTT-QAAccuracy45Video Chat-7B
Question AnsweringMSRVTT-QAConfidence Score2.5Video Chat-7B
Question AnsweringActivityNet-QAAccuracy26.5Video Chat
Question AnsweringActivityNet-QAConfidence Score2.2Video Chat
Visual Question Answering (VQA)VideoInstructConsistency2.24Video Chat
Visual Question Answering (VQA)VideoInstructContextual Understanding2.53Video Chat
Visual Question Answering (VQA)VideoInstructCorrectness of Information2.23Video Chat
Visual Question Answering (VQA)VideoInstructDetail Orientation2.5Video Chat
Visual Question Answering (VQA)VideoInstructTemporal Understanding1.94Video Chat
Visual Question Answering (VQA)VideoInstructmean2.29Video Chat
Visual Question Answering (VQA)VideoInstructgpt-score2.53Video Chat
Visual Question Answering (VQA)VideoInstructgpt-score2.32Video Chat
Visual Question Answering (VQA)VideoInstructgpt-score2.5Video Chat
Visual Question Answering (VQA)VideoInstructgpt-score1.94Video Chat
Visual Question Answering (VQA)VideoInstructgpt-score2.24Video Chat
Video Question AnsweringActivityNet-QAAccuracy26.5Video Chat
Video Question AnsweringActivityNet-QAConfidence score2.2Video Chat
Video Question AnsweringMVBenchAvg.35.5VideoChat
Video Question AnsweringVNBenchAccuracy12.4VideoChat2
Video Question AnsweringMSVD-QAAccuracy56.3Video Chat-7B
Video Question AnsweringMSVD-QAConfidence Score2.8Video Chat-7B
Video Question AnsweringTGIF-QAAccuracy34.4Video Chat-7B
Video Question AnsweringTGIF-QAConfidence Score2.3Video Chat-7B
Video Question AnsweringMSRVTT-QAAccuracy45Video Chat-7B
Video Question AnsweringMSRVTT-QAConfidence Score2.5Video Chat-7B
Video Question AnsweringActivityNet-QAAccuracy26.5Video Chat
Video Question AnsweringActivityNet-QAConfidence Score2.2Video Chat
Generative Visual Question AnsweringVideoInstructConsistency2.24Video Chat
Generative Visual Question AnsweringVideoInstructContextual Understanding2.53Video Chat
Generative Visual Question AnsweringVideoInstructCorrectness of Information2.23Video Chat
Generative Visual Question AnsweringVideoInstructDetail Orientation2.5Video Chat
Generative Visual Question AnsweringVideoInstructTemporal Understanding1.94Video Chat
Generative Visual Question AnsweringVideoInstructmean2.29Video Chat
Generative Visual Question AnsweringVideoInstructgpt-score2.53Video Chat
Generative Visual Question AnsweringVideoInstructgpt-score2.32Video Chat
Generative Visual Question AnsweringVideoInstructgpt-score2.5Video Chat
Generative Visual Question AnsweringVideoInstructgpt-score1.94Video Chat
Generative Visual Question AnsweringVideoInstructgpt-score2.24Video Chat
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score2.32Video Chat
Video-based Generative Performance BenchmarkingVideoInstructConsistency2.24Video Chat
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding2.53Video Chat
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information2.23Video Chat
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation2.5Video Chat
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding1.94Video Chat
Video-based Generative Performance BenchmarkingVideoInstructmean2.29Video Chat
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.53Video Chat
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.32Video Chat
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.5Video Chat
Video-based Generative Performance BenchmarkingVideoInstructgpt-score1.94Video Chat
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.24Video Chat

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15