TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LLaMA-VID: An Image is Worth 2 Tokens in Large Language Mo...

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Yanwei Li, Chengyao Wang, Jiaya Jia

2023-11-28Zero-Shot Video Question AnswerQuestion AnsweringVideo-based Generative Performance BenchmarkingVideo Question AnsweringImage CaptioningVisual Question Answering
PaperPDFCode(official)Code

Abstract

In this work, we present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens. LLaMA-VID addresses this issue by representing each frame with two distinct tokens, namely context token and content token. The context token encodes the overall image context based on user input, whereas the content token encapsulates visual cues in each frame. This dual-token strategy significantly reduces the overload of long videos while preserving critical information. Generally, LLaMA-VID empowers existing frameworks to support hour-long videos and pushes their upper limit with an extra context token. It is proved to surpass previous methods on most of video- or image-based benchmarks. Code is available https://github.com/dvlab-research/LLaMA-VID}{https://github.com/dvlab-research/LLaMA-VID

Results

TaskDatasetMetricValueModel
Question AnsweringMSVD-QAAccuracy70LLaMA-VID-13B (2 Token)
Question AnsweringMSVD-QAConfidence Score3.7LLaMA-VID-13B (2 Token)
Question AnsweringMSVD-QAAccuracy69.7LLaMA-VID-7B (2 Token)
Question AnsweringMSVD-QAConfidence Score3.7LLaMA-VID-7B (2 Token)
Question AnsweringMSRVTT-QAAccuracy58.9LLaMA-VID-13B (2 Token)
Question AnsweringMSRVTT-QAConfidence Score3.3LLaMA-VID-13B (2 Token)
Question AnsweringMSRVTT-QAAccuracy57.7LLaMA-VID-7B (2 Token)
Question AnsweringMSRVTT-QAConfidence Score3.2LLaMA-VID-7B (2 Token)
Question AnsweringActivityNet-QAAccuracy47.5LLaMA-VID-13B (2 Token)
Question AnsweringActivityNet-QAConfidence Score3.3LLaMA-VID-13B (2 Token)
Question AnsweringActivityNet-QAAccuracy47.4LLaMA-VID-7B (2 Token)
Question AnsweringActivityNet-QAConfidence Score3.3LLaMA-VID-7B (2 Token)
Visual Question Answering (VQA)VideoInstructConsistency2.63LLaMA-VID-13B (2 Token)
Visual Question Answering (VQA)VideoInstructContextual Understanding3.6LLaMA-VID-13B (2 Token)
Visual Question Answering (VQA)VideoInstructCorrectness of Information3.07LLaMA-VID-13B (2 Token)
Visual Question Answering (VQA)VideoInstructDetail Orientation3.05LLaMA-VID-13B (2 Token)
Visual Question Answering (VQA)VideoInstructTemporal Understanding2.58LLaMA-VID-13B (2 Token)
Visual Question Answering (VQA)VideoInstructmean2.99LLaMA-VID-13B (2 Token)
Visual Question Answering (VQA)VideoInstructConsistency2.51LLaMA-VID-7B (2 Token)
Visual Question Answering (VQA)VideoInstructContextual Understanding3.53LLaMA-VID-7B (2 Token)
Visual Question Answering (VQA)VideoInstructCorrectness of Information2.96LLaMA-VID-7B (2 Token)
Visual Question Answering (VQA)VideoInstructDetail Orientation3LLaMA-VID-7B (2 Token)
Visual Question Answering (VQA)VideoInstructTemporal Understanding2.46LLaMA-VID-7B (2 Token)
Visual Question Answering (VQA)VideoInstructmean2.89LLaMA-VID-7B (2 Token)
Video Question AnsweringOVBenchAVG41.9LLaMA-VID (7B)
Video Question AnsweringActivityNet-QAAccuracy47.5LLaMA-VID-13B (2 Token)
Video Question AnsweringActivityNet-QAConfidence score3.3LLaMA-VID-13B (2 Token)
Video Question AnsweringActivityNet-QAAccuracy47.4LLaMA-VID-7B (2 Token)
Video Question AnsweringActivityNet-QAConfidence score3.3LLaMA-VID-7B (2 Token)
Video Question AnsweringMSVD-QAAccuracy70LLaMA-VID-13B (2 Token)
Video Question AnsweringMSVD-QAConfidence Score3.7LLaMA-VID-13B (2 Token)
Video Question AnsweringMSVD-QAAccuracy69.7LLaMA-VID-7B (2 Token)
Video Question AnsweringMSVD-QAConfidence Score3.7LLaMA-VID-7B (2 Token)
Video Question AnsweringMSRVTT-QAAccuracy58.9LLaMA-VID-13B (2 Token)
Video Question AnsweringMSRVTT-QAConfidence Score3.3LLaMA-VID-13B (2 Token)
Video Question AnsweringMSRVTT-QAAccuracy57.7LLaMA-VID-7B (2 Token)
Video Question AnsweringMSRVTT-QAConfidence Score3.2LLaMA-VID-7B (2 Token)
Video Question AnsweringActivityNet-QAAccuracy47.5LLaMA-VID-13B (2 Token)
Video Question AnsweringActivityNet-QAConfidence Score3.3LLaMA-VID-13B (2 Token)
Video Question AnsweringActivityNet-QAAccuracy47.4LLaMA-VID-7B (2 Token)
Video Question AnsweringActivityNet-QAConfidence Score3.3LLaMA-VID-7B (2 Token)
Generative Visual Question AnsweringVideoInstructConsistency2.63LLaMA-VID-13B (2 Token)
Generative Visual Question AnsweringVideoInstructContextual Understanding3.6LLaMA-VID-13B (2 Token)
Generative Visual Question AnsweringVideoInstructCorrectness of Information3.07LLaMA-VID-13B (2 Token)
Generative Visual Question AnsweringVideoInstructDetail Orientation3.05LLaMA-VID-13B (2 Token)
Generative Visual Question AnsweringVideoInstructTemporal Understanding2.58LLaMA-VID-13B (2 Token)
Generative Visual Question AnsweringVideoInstructmean2.99LLaMA-VID-13B (2 Token)
Generative Visual Question AnsweringVideoInstructConsistency2.51LLaMA-VID-7B (2 Token)
Generative Visual Question AnsweringVideoInstructContextual Understanding3.53LLaMA-VID-7B (2 Token)
Generative Visual Question AnsweringVideoInstructCorrectness of Information2.96LLaMA-VID-7B (2 Token)
Generative Visual Question AnsweringVideoInstructDetail Orientation3LLaMA-VID-7B (2 Token)
Generative Visual Question AnsweringVideoInstructTemporal Understanding2.46LLaMA-VID-7B (2 Token)
Generative Visual Question AnsweringVideoInstructmean2.89LLaMA-VID-7B (2 Token)
Video-based Generative Performance BenchmarkingVideoInstructConsistency2.63LLaMA-VID-13B (2 Token)
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding3.6LLaMA-VID-13B (2 Token)
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information3.07LLaMA-VID-13B (2 Token)
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation3.05LLaMA-VID-13B (2 Token)
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding2.58LLaMA-VID-13B (2 Token)
Video-based Generative Performance BenchmarkingVideoInstructmean2.99LLaMA-VID-13B (2 Token)
Video-based Generative Performance BenchmarkingVideoInstructConsistency2.51LLaMA-VID-7B (2 Token)
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding3.53LLaMA-VID-7B (2 Token)
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information2.96LLaMA-VID-7B (2 Token)
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation3LLaMA-VID-7B (2 Token)
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding2.46LLaMA-VID-7B (2 Token)
Video-based Generative Performance BenchmarkingVideoInstructmean2.89LLaMA-VID-7B (2 Token)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Warehouse Spatial Question Answering with LLM Agent2025-07-14