TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SlowFast-LLaVA: A Strong Training-Free Baseline for Video ...

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan

2024-07-22Zero-Shot Video Question AnswerVideo-based Generative Performance BenchmarkingVideo-based Generative Performance Benchmarking (Contextual Understanding)Video-based Generative Performance Benchmarking (Correctness of Information)Video-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Large Language ModelVideo-based Generative Performance Benchmarking (Detail Orientation))Video UnderstandingLanguage Modelling
PaperPDFCode(official)

Abstract

We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as much spatial detail as possible (e.g., with 12x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for detailed video understanding. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets. Code has been made available at: https://github.com/apple/ml-slowfast-llava.

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QAAccuracy64.2SlowFast-LLaVA-34B
Question AnsweringMSVD-QAAccuracy79.9SlowFast-LLaVA-34B
Question AnsweringMSVD-QAConfidence Score4.1SlowFast-LLaVA-34B
Question AnsweringTGIF-QAAccuracy80.6SlowFast-LLaVA-34B
Question AnsweringTGIF-QAConfidence Score4.3SlowFast-LLaVA-34B
Question AnsweringMSRVTT-QAAccuracy67.4SlowFast-LLaVA-34B
Question AnsweringMSRVTT-QAConfidence Score3.7SlowFast-LLaVA-34B
Question AnsweringIntentQAAccuracy60.1SlowFast-LLaVA-34B
Question AnsweringEgoSchema (subset)Accuracy47.2SlowFast-LLaVA-34B
Question AnsweringActivityNet-QAAccuracy59.2SlowFast-LLaVA-34B
Question AnsweringActivityNet-QAConfidence Score3.5SlowFast-LLaVA-34B
Visual Question Answering (VQA)VideoInstructmean3.32SlowFast-LLaVA-34B
Visual Question Answering (VQA)VideoInstructgpt-score3.84SlowFast-LLaVA-34B
Visual Question Answering (VQA)VideoInstructgpt-score3.48SlowFast-LLaVA-34B
Visual Question Answering (VQA)VideoInstructgpt-score2.96SlowFast-LLaVA-34B
Visual Question Answering (VQA)VideoInstructgpt-score2.77SlowFast-LLaVA-34B
Visual Question Answering (VQA)VideoInstructgpt-score3.57SlowFast-LLaVA-34B
Video Question AnsweringNExT-QAAccuracy64.2SlowFast-LLaVA-34B
Video Question AnsweringMSVD-QAAccuracy79.9SlowFast-LLaVA-34B
Video Question AnsweringMSVD-QAConfidence Score4.1SlowFast-LLaVA-34B
Video Question AnsweringTGIF-QAAccuracy80.6SlowFast-LLaVA-34B
Video Question AnsweringTGIF-QAConfidence Score4.3SlowFast-LLaVA-34B
Video Question AnsweringMSRVTT-QAAccuracy67.4SlowFast-LLaVA-34B
Video Question AnsweringMSRVTT-QAConfidence Score3.7SlowFast-LLaVA-34B
Video Question AnsweringIntentQAAccuracy60.1SlowFast-LLaVA-34B
Video Question AnsweringEgoSchema (subset)Accuracy47.2SlowFast-LLaVA-34B
Video Question AnsweringActivityNet-QAAccuracy59.2SlowFast-LLaVA-34B
Video Question AnsweringActivityNet-QAConfidence Score3.5SlowFast-LLaVA-34B
Generative Visual Question AnsweringVideoInstructmean3.32SlowFast-LLaVA-34B
Generative Visual Question AnsweringVideoInstructgpt-score3.84SlowFast-LLaVA-34B
Generative Visual Question AnsweringVideoInstructgpt-score3.48SlowFast-LLaVA-34B
Generative Visual Question AnsweringVideoInstructgpt-score2.96SlowFast-LLaVA-34B
Generative Visual Question AnsweringVideoInstructgpt-score2.77SlowFast-LLaVA-34B
Generative Visual Question AnsweringVideoInstructgpt-score3.57SlowFast-LLaVA-34B
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score3.48SlowFast-LLaVA-34B
Video-based Generative Performance BenchmarkingVideoInstructmean3.32SlowFast-LLaVA-34B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.84SlowFast-LLaVA-34B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.48SlowFast-LLaVA-34B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.96SlowFast-LLaVA-34B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.77SlowFast-LLaVA-34B
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.57SlowFast-LLaVA-34B

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17