TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/BT-Adapter: Video Conversation is Feasible Without Video I...

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, Thomas H. Li, Ge Li

2023-09-27CVPR 2024 1Zero-Shot Video Question AnswerVideo-based Generative Performance BenchmarkingVideo-based Generative Performance Benchmarking (Contextual Understanding)Zero-Shot Video RetrievalVideo-based Generative Performance Benchmarking (Correctness of Information)Video Question AnsweringVideo-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Video-based Generative Performance Benchmarking (Detail Orientation))Video Understanding
PaperPDFCode(official)

Abstract

The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents, while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone, minimal GPU memory is left for facilitating effective temporal modeling, which is crucial for comprehending and providing feedback on videos. To this end, we propose Branching Temporal Adapter (BT-Adapter), a novel method for extending image-language pretrained models into the video domain. Specifically, BT-Adapter serves as a plug-and-use temporal modeling branch alongside the pretrained visual encoder, which is tuned while keeping the backbone frozen. Just pretrained once, BT-Adapter can be seamlessly integrated into all image conversation models using this version of CLIP, enabling video conversations without the need for video instructions. Besides, we develop a unique asymmetric token masking strategy inside the branch with tailor-made training tasks for BT-Adapter, facilitating faster convergence and better results. Thanks to BT-Adapter, we are able to empower existing multimodal dialogue models with strong video understanding capabilities without incurring excessive GPU costs. Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours. (2) better performance than current video chatbots without any video instruction tuning. (3) state-of-the-art results of video chatting using video instruction tuning, outperforming previous SOTAs by a large margin.

Results

TaskDatasetMetricValueModel
Question AnsweringMSVD-QAAccuracy67BT-Adapter (zero-shot)
Question AnsweringMSVD-QAConfidence Score3.6BT-Adapter (zero-shot)
Question AnsweringMSVD-QAAccuracy67BT-Adapter (zero-shot)
Question AnsweringMSVD-QAConfidence Score3.6BT-Adapter (zero-shot)
Question AnsweringMSRVTT-QAAccuracy51.2BT-Adapter (zero-shot)
Question AnsweringMSRVTT-QAConfidence Score2.9BT-Adapter (zero-shot)
Question AnsweringMSRVTT-QAAccuracy51.2BT-Adapter (zero-shot)
Question AnsweringMSRVTT-QAConfidence Score2.9BT-Adapter (zero-shot)
Question AnsweringActivityNet-QAAccuracy46.1BT-Adapter (zero-shot)
Question AnsweringActivityNet-QAConfidence Score3.2BT-Adapter (zero-shot)
Visual Question Answering (VQA)VideoInstructConsistency2.46BT-Adapter
Visual Question Answering (VQA)VideoInstructContextual Understanding3.27BT-Adapter
Visual Question Answering (VQA)VideoInstructCorrectness of Information2.68BT-Adapter
Visual Question Answering (VQA)VideoInstructDetail Orientation2.69BT-Adapter
Visual Question Answering (VQA)VideoInstructTemporal Understanding2.34BT-Adapter
Visual Question Answering (VQA)VideoInstructmean2.69BT-Adapter
Visual Question Answering (VQA)VideoInstructConsistency2.2BT-Adapter (zero-shot)
Visual Question Answering (VQA)VideoInstructContextual Understanding2.89BT-Adapter (zero-shot)
Visual Question Answering (VQA)VideoInstructCorrectness of Information2.16BT-Adapter (zero-shot)
Visual Question Answering (VQA)VideoInstructDetail Orientation2.46BT-Adapter (zero-shot)
Visual Question Answering (VQA)VideoInstructTemporal Understanding2.13BT-Adapter (zero-shot)
Visual Question Answering (VQA)VideoInstructmean2.46BT-Adapter (zero-shot)
Visual Question Answering (VQA)VideoInstructgpt-score3.27BT-Adapter
Visual Question Answering (VQA)VideoInstructgpt-score2.89BT-Adapter (zero-shot)
Visual Question Answering (VQA)VideoInstructgpt-score2.68BT-Adapter
Visual Question Answering (VQA)VideoInstructgpt-score2.16BT-Adapter (zero-shot)
Visual Question Answering (VQA)VideoInstructgpt-score2.69BT-Adapter
Visual Question Answering (VQA)VideoInstructgpt-score2.46BT-Adapter (zero-shot)
Visual Question Answering (VQA)VideoInstructgpt-score2.34BT-Adapter
Visual Question Answering (VQA)VideoInstructgpt-score2.13BT-Adapter (zero-shot)
Visual Question Answering (VQA)VideoInstructgpt-score2.46BT-Adapter
Visual Question Answering (VQA)VideoInstructgpt-score2.2BT-Adapter (zero-shot)
Video Question AnsweringActivityNet-QAAccuracy46.1BT-Adapter (zero-shot)
Video Question AnsweringActivityNet-QAConfidence score3.6BT-Adapter (zero-shot)
Video Question AnsweringMSVD-QAAccuracy67BT-Adapter (zero-shot)
Video Question AnsweringMSVD-QAConfidence Score3.6BT-Adapter (zero-shot)
Video Question AnsweringMSVD-QAAccuracy67BT-Adapter (zero-shot)
Video Question AnsweringMSVD-QAConfidence Score3.6BT-Adapter (zero-shot)
Video Question AnsweringMSRVTT-QAAccuracy51.2BT-Adapter (zero-shot)
Video Question AnsweringMSRVTT-QAConfidence Score2.9BT-Adapter (zero-shot)
Video Question AnsweringMSRVTT-QAAccuracy51.2BT-Adapter (zero-shot)
Video Question AnsweringMSRVTT-QAConfidence Score2.9BT-Adapter (zero-shot)
Video Question AnsweringActivityNet-QAAccuracy46.1BT-Adapter (zero-shot)
Video Question AnsweringActivityNet-QAConfidence Score3.2BT-Adapter (zero-shot)
Generative Visual Question AnsweringVideoInstructConsistency2.46BT-Adapter
Generative Visual Question AnsweringVideoInstructContextual Understanding3.27BT-Adapter
Generative Visual Question AnsweringVideoInstructCorrectness of Information2.68BT-Adapter
Generative Visual Question AnsweringVideoInstructDetail Orientation2.69BT-Adapter
Generative Visual Question AnsweringVideoInstructTemporal Understanding2.34BT-Adapter
Generative Visual Question AnsweringVideoInstructmean2.69BT-Adapter
Generative Visual Question AnsweringVideoInstructConsistency2.2BT-Adapter (zero-shot)
Generative Visual Question AnsweringVideoInstructContextual Understanding2.89BT-Adapter (zero-shot)
Generative Visual Question AnsweringVideoInstructCorrectness of Information2.16BT-Adapter (zero-shot)
Generative Visual Question AnsweringVideoInstructDetail Orientation2.46BT-Adapter (zero-shot)
Generative Visual Question AnsweringVideoInstructTemporal Understanding2.13BT-Adapter (zero-shot)
Generative Visual Question AnsweringVideoInstructmean2.46BT-Adapter (zero-shot)
Generative Visual Question AnsweringVideoInstructgpt-score3.27BT-Adapter
Generative Visual Question AnsweringVideoInstructgpt-score2.89BT-Adapter (zero-shot)
Generative Visual Question AnsweringVideoInstructgpt-score2.68BT-Adapter
Generative Visual Question AnsweringVideoInstructgpt-score2.16BT-Adapter (zero-shot)
Generative Visual Question AnsweringVideoInstructgpt-score2.69BT-Adapter
Generative Visual Question AnsweringVideoInstructgpt-score2.46BT-Adapter (zero-shot)
Generative Visual Question AnsweringVideoInstructgpt-score2.34BT-Adapter
Generative Visual Question AnsweringVideoInstructgpt-score2.13BT-Adapter (zero-shot)
Generative Visual Question AnsweringVideoInstructgpt-score2.46BT-Adapter
Generative Visual Question AnsweringVideoInstructgpt-score2.2BT-Adapter (zero-shot)
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score2.68BT-Adapter
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score2.16BT-Adapter (zero-shot)
Video-based Generative Performance BenchmarkingVideoInstructConsistency2.46BT-Adapter
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding3.27BT-Adapter
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information2.68BT-Adapter
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation2.69BT-Adapter
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding2.34BT-Adapter
Video-based Generative Performance BenchmarkingVideoInstructmean2.69BT-Adapter
Video-based Generative Performance BenchmarkingVideoInstructConsistency2.2BT-Adapter (zero-shot)
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding2.89BT-Adapter (zero-shot)
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information2.16BT-Adapter (zero-shot)
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation2.46BT-Adapter (zero-shot)
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding2.13BT-Adapter (zero-shot)
Video-based Generative Performance BenchmarkingVideoInstructmean2.46BT-Adapter (zero-shot)
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.27BT-Adapter
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.89BT-Adapter (zero-shot)
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.68BT-Adapter
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.16BT-Adapter (zero-shot)
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.69BT-Adapter
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.46BT-Adapter (zero-shot)
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.34BT-Adapter
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.13BT-Adapter (zero-shot)
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.46BT-Adapter
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.2BT-Adapter (zero-shot)
VCGBench-DiverseVideoInstructConsistency2.27BT-Adapter
VCGBench-DiverseVideoInstructContextual Understanding2.59BT-Adapter
VCGBench-DiverseVideoInstructCorrectness of Information2.2BT-Adapter
VCGBench-DiverseVideoInstructDense Captioning1.03BT-Adapter
VCGBench-DiverseVideoInstructDetail Orientation2.62BT-Adapter
VCGBench-DiverseVideoInstructReasoning3.62BT-Adapter
VCGBench-DiverseVideoInstructSpatial Understanding2.35BT-Adapter
VCGBench-DiverseVideoInstructTemporal Understanding1.29BT-Adapter
VCGBench-DiverseVideoInstructmean2.19BT-Adapter
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@140.9BT-Adapter
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1073.5BT-Adapter
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@564.7BT-Adapter
Zero-Shot Video RetrievalDiDeMotext-to-video R@135.6BT-Adapter
Zero-Shot Video RetrievalDiDeMotext-to-video R@1072.6BT-Adapter
Zero-Shot Video RetrievalDiDeMotext-to-video R@561.9BT-Adapter
Zero-Shot Video RetrievalLSMDCtext-to-video R@119.5BT-Adapter
Zero-Shot Video RetrievalLSMDCtext-to-video R@1045BT-Adapter
Zero-Shot Video RetrievalLSMDCtext-to-video R@535.9BT-Adapter
Zero-Shot Video RetrievalActivityNettext-to-video R@137BT-Adapter
Zero-Shot Video RetrievalActivityNettext-to-video R@1078.9BT-Adapter
Zero-Shot Video RetrievalActivityNettext-to-video R@566.7BT-Adapter

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models2025-07-08