Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, Thomas H. Li, Ge Li
The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents, while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone, minimal GPU memory is left for facilitating effective temporal modeling, which is crucial for comprehending and providing feedback on videos. To this end, we propose Branching Temporal Adapter (BT-Adapter), a novel method for extending image-language pretrained models into the video domain. Specifically, BT-Adapter serves as a plug-and-use temporal modeling branch alongside the pretrained visual encoder, which is tuned while keeping the backbone frozen. Just pretrained once, BT-Adapter can be seamlessly integrated into all image conversation models using this version of CLIP, enabling video conversations without the need for video instructions. Besides, we develop a unique asymmetric token masking strategy inside the branch with tailor-made training tasks for BT-Adapter, facilitating faster convergence and better results. Thanks to BT-Adapter, we are able to empower existing multimodal dialogue models with strong video understanding capabilities without incurring excessive GPU costs. Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours. (2) better performance than current video chatbots without any video instruction tuning. (3) state-of-the-art results of video chatting using video instruction tuning, outperforming previous SOTAs by a large margin.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Question Answering | MSVD-QA | Accuracy | 67 | BT-Adapter (zero-shot) |
| Question Answering | MSVD-QA | Confidence Score | 3.6 | BT-Adapter (zero-shot) |
| Question Answering | MSVD-QA | Accuracy | 67 | BT-Adapter (zero-shot) |
| Question Answering | MSVD-QA | Confidence Score | 3.6 | BT-Adapter (zero-shot) |
| Question Answering | MSRVTT-QA | Accuracy | 51.2 | BT-Adapter (zero-shot) |
| Question Answering | MSRVTT-QA | Confidence Score | 2.9 | BT-Adapter (zero-shot) |
| Question Answering | MSRVTT-QA | Accuracy | 51.2 | BT-Adapter (zero-shot) |
| Question Answering | MSRVTT-QA | Confidence Score | 2.9 | BT-Adapter (zero-shot) |
| Question Answering | ActivityNet-QA | Accuracy | 46.1 | BT-Adapter (zero-shot) |
| Question Answering | ActivityNet-QA | Confidence Score | 3.2 | BT-Adapter (zero-shot) |
| Visual Question Answering (VQA) | VideoInstruct | Consistency | 2.46 | BT-Adapter |
| Visual Question Answering (VQA) | VideoInstruct | Contextual Understanding | 3.27 | BT-Adapter |
| Visual Question Answering (VQA) | VideoInstruct | Correctness of Information | 2.68 | BT-Adapter |
| Visual Question Answering (VQA) | VideoInstruct | Detail Orientation | 2.69 | BT-Adapter |
| Visual Question Answering (VQA) | VideoInstruct | Temporal Understanding | 2.34 | BT-Adapter |
| Visual Question Answering (VQA) | VideoInstruct | mean | 2.69 | BT-Adapter |
| Visual Question Answering (VQA) | VideoInstruct | Consistency | 2.2 | BT-Adapter (zero-shot) |
| Visual Question Answering (VQA) | VideoInstruct | Contextual Understanding | 2.89 | BT-Adapter (zero-shot) |
| Visual Question Answering (VQA) | VideoInstruct | Correctness of Information | 2.16 | BT-Adapter (zero-shot) |
| Visual Question Answering (VQA) | VideoInstruct | Detail Orientation | 2.46 | BT-Adapter (zero-shot) |
| Visual Question Answering (VQA) | VideoInstruct | Temporal Understanding | 2.13 | BT-Adapter (zero-shot) |
| Visual Question Answering (VQA) | VideoInstruct | mean | 2.46 | BT-Adapter (zero-shot) |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 3.27 | BT-Adapter |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.89 | BT-Adapter (zero-shot) |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.68 | BT-Adapter |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.16 | BT-Adapter (zero-shot) |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.69 | BT-Adapter |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.46 | BT-Adapter (zero-shot) |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.34 | BT-Adapter |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.13 | BT-Adapter (zero-shot) |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.46 | BT-Adapter |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.2 | BT-Adapter (zero-shot) |
| Video Question Answering | ActivityNet-QA | Accuracy | 46.1 | BT-Adapter (zero-shot) |
| Video Question Answering | ActivityNet-QA | Confidence score | 3.6 | BT-Adapter (zero-shot) |
| Video Question Answering | MSVD-QA | Accuracy | 67 | BT-Adapter (zero-shot) |
| Video Question Answering | MSVD-QA | Confidence Score | 3.6 | BT-Adapter (zero-shot) |
| Video Question Answering | MSVD-QA | Accuracy | 67 | BT-Adapter (zero-shot) |
| Video Question Answering | MSVD-QA | Confidence Score | 3.6 | BT-Adapter (zero-shot) |
| Video Question Answering | MSRVTT-QA | Accuracy | 51.2 | BT-Adapter (zero-shot) |
| Video Question Answering | MSRVTT-QA | Confidence Score | 2.9 | BT-Adapter (zero-shot) |
| Video Question Answering | MSRVTT-QA | Accuracy | 51.2 | BT-Adapter (zero-shot) |
| Video Question Answering | MSRVTT-QA | Confidence Score | 2.9 | BT-Adapter (zero-shot) |
| Video Question Answering | ActivityNet-QA | Accuracy | 46.1 | BT-Adapter (zero-shot) |
| Video Question Answering | ActivityNet-QA | Confidence Score | 3.2 | BT-Adapter (zero-shot) |
| Generative Visual Question Answering | VideoInstruct | Consistency | 2.46 | BT-Adapter |
| Generative Visual Question Answering | VideoInstruct | Contextual Understanding | 3.27 | BT-Adapter |
| Generative Visual Question Answering | VideoInstruct | Correctness of Information | 2.68 | BT-Adapter |
| Generative Visual Question Answering | VideoInstruct | Detail Orientation | 2.69 | BT-Adapter |
| Generative Visual Question Answering | VideoInstruct | Temporal Understanding | 2.34 | BT-Adapter |
| Generative Visual Question Answering | VideoInstruct | mean | 2.69 | BT-Adapter |
| Generative Visual Question Answering | VideoInstruct | Consistency | 2.2 | BT-Adapter (zero-shot) |
| Generative Visual Question Answering | VideoInstruct | Contextual Understanding | 2.89 | BT-Adapter (zero-shot) |
| Generative Visual Question Answering | VideoInstruct | Correctness of Information | 2.16 | BT-Adapter (zero-shot) |
| Generative Visual Question Answering | VideoInstruct | Detail Orientation | 2.46 | BT-Adapter (zero-shot) |
| Generative Visual Question Answering | VideoInstruct | Temporal Understanding | 2.13 | BT-Adapter (zero-shot) |
| Generative Visual Question Answering | VideoInstruct | mean | 2.46 | BT-Adapter (zero-shot) |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 3.27 | BT-Adapter |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.89 | BT-Adapter (zero-shot) |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.68 | BT-Adapter |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.16 | BT-Adapter (zero-shot) |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.69 | BT-Adapter |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.46 | BT-Adapter (zero-shot) |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.34 | BT-Adapter |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.13 | BT-Adapter (zero-shot) |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.46 | BT-Adapter |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.2 | BT-Adapter (zero-shot) |
| Video-based Generative Performance Benchmarking (Correctness of Information) | VideoInstruct | gpt-score | 2.68 | BT-Adapter |
| Video-based Generative Performance Benchmarking (Correctness of Information) | VideoInstruct | gpt-score | 2.16 | BT-Adapter (zero-shot) |
| Video-based Generative Performance Benchmarking | VideoInstruct | Consistency | 2.46 | BT-Adapter |
| Video-based Generative Performance Benchmarking | VideoInstruct | Contextual Understanding | 3.27 | BT-Adapter |
| Video-based Generative Performance Benchmarking | VideoInstruct | Correctness of Information | 2.68 | BT-Adapter |
| Video-based Generative Performance Benchmarking | VideoInstruct | Detail Orientation | 2.69 | BT-Adapter |
| Video-based Generative Performance Benchmarking | VideoInstruct | Temporal Understanding | 2.34 | BT-Adapter |
| Video-based Generative Performance Benchmarking | VideoInstruct | mean | 2.69 | BT-Adapter |
| Video-based Generative Performance Benchmarking | VideoInstruct | Consistency | 2.2 | BT-Adapter (zero-shot) |
| Video-based Generative Performance Benchmarking | VideoInstruct | Contextual Understanding | 2.89 | BT-Adapter (zero-shot) |
| Video-based Generative Performance Benchmarking | VideoInstruct | Correctness of Information | 2.16 | BT-Adapter (zero-shot) |
| Video-based Generative Performance Benchmarking | VideoInstruct | Detail Orientation | 2.46 | BT-Adapter (zero-shot) |
| Video-based Generative Performance Benchmarking | VideoInstruct | Temporal Understanding | 2.13 | BT-Adapter (zero-shot) |
| Video-based Generative Performance Benchmarking | VideoInstruct | mean | 2.46 | BT-Adapter (zero-shot) |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 3.27 | BT-Adapter |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.89 | BT-Adapter (zero-shot) |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.68 | BT-Adapter |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.16 | BT-Adapter (zero-shot) |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.69 | BT-Adapter |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.46 | BT-Adapter (zero-shot) |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.34 | BT-Adapter |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.13 | BT-Adapter (zero-shot) |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.46 | BT-Adapter |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.2 | BT-Adapter (zero-shot) |
| VCGBench-Diverse | VideoInstruct | Consistency | 2.27 | BT-Adapter |
| VCGBench-Diverse | VideoInstruct | Contextual Understanding | 2.59 | BT-Adapter |
| VCGBench-Diverse | VideoInstruct | Correctness of Information | 2.2 | BT-Adapter |
| VCGBench-Diverse | VideoInstruct | Dense Captioning | 1.03 | BT-Adapter |
| VCGBench-Diverse | VideoInstruct | Detail Orientation | 2.62 | BT-Adapter |
| VCGBench-Diverse | VideoInstruct | Reasoning | 3.62 | BT-Adapter |
| VCGBench-Diverse | VideoInstruct | Spatial Understanding | 2.35 | BT-Adapter |
| VCGBench-Diverse | VideoInstruct | Temporal Understanding | 1.29 | BT-Adapter |
| VCGBench-Diverse | VideoInstruct | mean | 2.19 | BT-Adapter |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@1 | 40.9 | BT-Adapter |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@10 | 73.5 | BT-Adapter |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@5 | 64.7 | BT-Adapter |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@1 | 35.6 | BT-Adapter |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@10 | 72.6 | BT-Adapter |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@5 | 61.9 | BT-Adapter |
| Zero-Shot Video Retrieval | LSMDC | text-to-video R@1 | 19.5 | BT-Adapter |
| Zero-Shot Video Retrieval | LSMDC | text-to-video R@10 | 45 | BT-Adapter |
| Zero-Shot Video Retrieval | LSMDC | text-to-video R@5 | 35.9 | BT-Adapter |
| Zero-Shot Video Retrieval | ActivityNet | text-to-video R@1 | 37 | BT-Adapter |
| Zero-Shot Video Retrieval | ActivityNet | text-to-video R@10 | 78.9 | BT-Adapter |
| Zero-Shot Video Retrieval | ActivityNet | text-to-video R@5 | 66.7 | BT-Adapter |