TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PPLLaVA: Varied Video Sequence Understanding With Prompt G...

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

Ruyang Liu, Haoran Tang, Haibo Liu, Yixiao Ge, Ying Shan, Chen Li, Jiankun Yang

2024-11-04Zero-Shot Video Question AnswerVideo-based Generative Performance BenchmarkingVideo-based Generative Performance Benchmarking (Contextual Understanding)Caption GenerationVideo-based Generative Performance Benchmarking (Correctness of Information)Video Question AnsweringVideo-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Video-based Generative Performance Benchmarking (Detail Orientation))Video UnderstandingMultiple-choice
PaperPDFCode(official)

Abstract

The past year has witnessed the significant advancement of video-based large language models. However, the challenge of developing a unified model for both short and long video understanding remains unresolved. Most existing video LLMs cannot handle hour-long videos, while methods custom for long videos tend to be ineffective for shorter videos and images. In this paper, we identify the key issue as the redundant content in videos. To address this, we propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation. Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short. Specifically, PPLLaVA consists of three core components: the CLIP-based visual-prompt alignment that extracts visual information relevant to the user's instructions, the prompt-guided pooling that compresses the visual sequence to arbitrary scales using convolution-style pooling, and the clip context extension designed for lengthy prompt common in visual dialogue. Moreover, our codebase also integrates the most advanced video Direct Preference Optimization (DPO) and visual interleave training. Extensive experiments have validated the performance of our model. With superior throughput and only 1024 visual context, PPLLaVA achieves better results on image benchmarks as a video LLM, while achieving state-of-the-art performance across various video benchmarks, excelling in tasks ranging from caption generation to multiple-choice questions, and handling video lengths from seconds to hours. Codes have been available at https://github.com/farewellthree/PPLLaVA.

Results

TaskDatasetMetricValueModel
Question AnsweringMSVD-QAAccuracy77.1PPLLaVA-7B
Question AnsweringMSVD-QAConfidence Score4PPLLaVA-7B
Question AnsweringMSRVTT-QAAccuracy64.3PPLLaVA-7B
Question AnsweringMSRVTT-QAConfidence Score3.5PPLLaVA-7B
Question AnsweringActivityNet-QAAccuracy60.7PPLLaVA-7B
Question AnsweringActivityNet-QAConfidence Score3.6PPLLaVA-7B
Visual Question Answering (VQA)VideoInstructConsistency3.81PPLLaVA-7B-dpo
Visual Question Answering (VQA)VideoInstructContextual Understanding4.21PPLLaVA-7B-dpo
Visual Question Answering (VQA)VideoInstructCorrectness of Information3.85PPLLaVA-7B-dpo
Visual Question Answering (VQA)VideoInstructDetail Orientation3.56PPLLaVA-7B-dpo
Visual Question Answering (VQA)VideoInstructTemporal Understanding3.21PPLLaVA-7B-dpo
Visual Question Answering (VQA)VideoInstructmean3.73PPLLaVA-7B-dpo
Visual Question Answering (VQA)VideoInstructConsistency3.2PPLLaVA-7B
Visual Question Answering (VQA)VideoInstructContextual Understanding3.88PPLLaVA-7B
Visual Question Answering (VQA)VideoInstructCorrectness of Information3.32PPLLaVA-7B
Visual Question Answering (VQA)VideoInstructDetail Orientation3.2PPLLaVA-7B
Visual Question Answering (VQA)VideoInstructTemporal Understanding3PPLLaVA-7B
Visual Question Answering (VQA)VideoInstructmean3.32PPLLaVA-7B
Visual Question Answering (VQA)VideoInstructgpt-score4.21PPLLaVA-7B
Visual Question Answering (VQA)VideoInstructgpt-score3.85PPLLaVA-7B
Visual Question Answering (VQA)VideoInstructgpt-score3.56PPLLaVA-7B
Visual Question Answering (VQA)VideoInstructgpt-score3.21PPLLaVA-7B
Visual Question Answering (VQA)VideoInstructgpt-score3.81PPLLaVA-7B
Video Question AnsweringMVBenchAvg.59.2PPLLaVA (7b)
Video Question AnsweringMSVD-QAAccuracy77.1PPLLaVA-7B
Video Question AnsweringMSVD-QAConfidence Score4PPLLaVA-7B
Video Question AnsweringMSRVTT-QAAccuracy64.3PPLLaVA-7B
Video Question AnsweringMSRVTT-QAConfidence Score3.5PPLLaVA-7B
Video Question AnsweringActivityNet-QAAccuracy60.7PPLLaVA-7B
Video Question AnsweringActivityNet-QAConfidence Score3.6PPLLaVA-7B
Generative Visual Question AnsweringVideoInstructConsistency3.81PPLLaVA-7B-dpo
Generative Visual Question AnsweringVideoInstructContextual Understanding4.21PPLLaVA-7B-dpo
Generative Visual Question AnsweringVideoInstructCorrectness of Information3.85PPLLaVA-7B-dpo
Generative Visual Question AnsweringVideoInstructDetail Orientation3.56PPLLaVA-7B-dpo
Generative Visual Question AnsweringVideoInstructTemporal Understanding3.21PPLLaVA-7B-dpo
Generative Visual Question AnsweringVideoInstructmean3.73PPLLaVA-7B-dpo
Generative Visual Question AnsweringVideoInstructConsistency3.2PPLLaVA-7B
Generative Visual Question AnsweringVideoInstructContextual Understanding3.88PPLLaVA-7B
Generative Visual Question AnsweringVideoInstructCorrectness of Information3.32PPLLaVA-7B
Generative Visual Question AnsweringVideoInstructDetail Orientation3.2PPLLaVA-7B
Generative Visual Question AnsweringVideoInstructTemporal Understanding3PPLLaVA-7B
Generative Visual Question AnsweringVideoInstructmean3.32PPLLaVA-7B
Generative Visual Question AnsweringVideoInstructgpt-score4.21PPLLaVA-7B
Generative Visual Question AnsweringVideoInstructgpt-score3.85PPLLaVA-7B
Generative Visual Question AnsweringVideoInstructgpt-score3.56PPLLaVA-7B
Generative Visual Question AnsweringVideoInstructgpt-score3.21PPLLaVA-7B
Generative Visual Question AnsweringVideoInstructgpt-score3.81PPLLaVA-7B
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score3.85PPLLaVA-7B
Video-based Generative Performance BenchmarkingVideoInstructConsistency3.81PPLLaVA-7B-dpo
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding4.21PPLLaVA-7B-dpo
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information3.85PPLLaVA-7B-dpo
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation3.56PPLLaVA-7B-dpo
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding3.21PPLLaVA-7B-dpo
Video-based Generative Performance BenchmarkingVideoInstructmean3.73PPLLaVA-7B-dpo
Video-based Generative Performance BenchmarkingVideoInstructConsistency3.2PPLLaVA-7B
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding3.88PPLLaVA-7B
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information3.32PPLLaVA-7B
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation3.2PPLLaVA-7B
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding3PPLLaVA-7B
Video-based Generative Performance BenchmarkingVideoInstructmean3.32PPLLaVA-7B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score4.21PPLLaVA-7B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.85PPLLaVA-7B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.56PPLLaVA-7B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.21PPLLaVA-7B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.81PPLLaVA-7B

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning2025-07-09Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08