TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/vid-TLDR: Training Free Token merging for Light-weight Vid...

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, Hyunwoo J. Kim

2024-03-20CVPR 2024 1Zero-Shot Video Question AnswerVideo RetrievalVideo-Text RetrievalZero-Shot Video RetrievalText RetrievalVideo Question AnsweringAction RecognitionVisual Question Answering (VQA)
PaperPDFCode(official)

Abstract

Video Transformers have become the prevalent solution for various video downstream tasks with superior expressive power and flexibility. However, these video transformers suffer from heavy computational costs induced by the massive number of tokens across the entire video frames, which has been the major barrier to training the model. Further, the patches irrelevant to the main contents, e.g., backgrounds, degrade the generalization performance of models. To tackle these issues, we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training. For vid-TLDR, we introduce a novel approach to capture the salient regions in videos only with the attention map. Further, we introduce the saliency-aware token merging strategy by dropping the background tokens and sharpening the object scores. Our experiments show that vid-TLDR significantly mitigates the computational complexity of video Transformers while achieving competitive performance compared to the base model without vid-TLDR. Code is available at https://github.com/mlvlab/vid-TLDR.

Results

TaskDatasetMetricValueModel
VideoSSv2-template retrievaltext-to-video R@190.2vid-TLDR (UMT-L)
VideoSSv2-template retrievaltext-to-video R@10100vid-TLDR (UMT-L)
VideoSSv2-template retrievaltext-to-video R@5100vid-TLDR (UMT-L)
VideoActivityNettext-to-video R@166.7vid-TLDR (UMT-L)
VideoActivityNettext-to-video R@1094.4vid-TLDR (UMT-L)
VideoActivityNettext-to-video R@588.6vid-TLDR (UMT-L)
VideoActivityNetvideo-to-text R@163.9vid-TLDR (UMT-L)
VideoActivityNetvideo-to-text R@1094.5vid-TLDR (UMT-L)
VideoActivityNetvideo-to-text R@588.7vid-TLDR (UMT-L)
VideoSSv2-label retrievaltext-to-video R@173.1vid-TLDR (UMT-L)
VideoSSv2-label retrievaltext-to-video R@1096.6vid-TLDR (UMT-L)
VideoSSv2-label retrievaltext-to-video R@593.3vid-TLDR (UMT-L)
VideoDiDeMotext-to-video R@172.3vid-TLDR (UMT-L)
VideoDiDeMotext-to-video R@1094.2vid-TLDR (UMT-L)
VideoDiDeMotext-to-video R@591.2vid-TLDR (UMT-L)
VideoDiDeMovideo-to-text R@168.5vid-TLDR (UMT-L)
VideoDiDeMovideo-to-text R@1093.8vid-TLDR (UMT-L)
VideoDiDeMovideo-to-text R@589.8vid-TLDR (UMT-L)
VideoMSR-VTTtext-to-video R@158.1vid-TLDR (UMT-L)
VideoMSR-VTTtext-to-video R@1081.6vid-TLDR (UMT-L)
VideoMSR-VTTtext-to-video R@581vid-TLDR (UMT-L)
VideoMSR-VTTvideo-to-text R@158.7vid-TLDR (UMT-L)
VideoMSR-VTTvideo-to-text R@1086.9vid-TLDR (UMT-L)
VideoMSR-VTTvideo-to-text R@581.6vid-TLDR (UMT-L)
VideoLSMDCtext-to-video R@143.1vid-TLDR (UMT-L)
VideoLSMDCtext-to-video R@1071.4vid-TLDR (UMT-L)
VideoLSMDCtext-to-video R@564.5vid-TLDR (UMT-L)
VideoLSMDCvideo-to-text R@140.7vid-TLDR (UMT-L)
VideoLSMDCvideo-to-text R@1063.6vid-TLDR (UMT-L)
VideoLSMDCvideo-to-text R@570.2vid-TLDR (UMT-L)
VideoMSVDtext-to-video R@157.9vid-TLDR (UMT-L)
VideoMSVDtext-to-video R@1089.4vid-TLDR (UMT-L)
VideoMSVDtext-to-video R@583.8vid-TLDR (UMT-L)
VideoMSVDvideo-to-text R@182.7vid-TLDR (UMT-L)
VideoMSVDvideo-to-text R@1096.3vid-TLDR (UMT-L)
VideoMSVDvideo-to-text R@594.5vid-TLDR (UMT-L)
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.47vid-TLDR (UMT-L)
Visual Question Answering (VQA)MSVD-QAAccuracy0.549vid-TLDR (UMT-L)
Video RetrievalSSv2-template retrievaltext-to-video R@190.2vid-TLDR (UMT-L)
Video RetrievalSSv2-template retrievaltext-to-video R@10100vid-TLDR (UMT-L)
Video RetrievalSSv2-template retrievaltext-to-video R@5100vid-TLDR (UMT-L)
Video RetrievalActivityNettext-to-video R@166.7vid-TLDR (UMT-L)
Video RetrievalActivityNettext-to-video R@1094.4vid-TLDR (UMT-L)
Video RetrievalActivityNettext-to-video R@588.6vid-TLDR (UMT-L)
Video RetrievalActivityNetvideo-to-text R@163.9vid-TLDR (UMT-L)
Video RetrievalActivityNetvideo-to-text R@1094.5vid-TLDR (UMT-L)
Video RetrievalActivityNetvideo-to-text R@588.7vid-TLDR (UMT-L)
Video RetrievalSSv2-label retrievaltext-to-video R@173.1vid-TLDR (UMT-L)
Video RetrievalSSv2-label retrievaltext-to-video R@1096.6vid-TLDR (UMT-L)
Video RetrievalSSv2-label retrievaltext-to-video R@593.3vid-TLDR (UMT-L)
Video RetrievalDiDeMotext-to-video R@172.3vid-TLDR (UMT-L)
Video RetrievalDiDeMotext-to-video R@1094.2vid-TLDR (UMT-L)
Video RetrievalDiDeMotext-to-video R@591.2vid-TLDR (UMT-L)
Video RetrievalDiDeMovideo-to-text R@168.5vid-TLDR (UMT-L)
Video RetrievalDiDeMovideo-to-text R@1093.8vid-TLDR (UMT-L)
Video RetrievalDiDeMovideo-to-text R@589.8vid-TLDR (UMT-L)
Video RetrievalMSR-VTTtext-to-video R@158.1vid-TLDR (UMT-L)
Video RetrievalMSR-VTTtext-to-video R@1081.6vid-TLDR (UMT-L)
Video RetrievalMSR-VTTtext-to-video R@581vid-TLDR (UMT-L)
Video RetrievalMSR-VTTvideo-to-text R@158.7vid-TLDR (UMT-L)
Video RetrievalMSR-VTTvideo-to-text R@1086.9vid-TLDR (UMT-L)
Video RetrievalMSR-VTTvideo-to-text R@581.6vid-TLDR (UMT-L)
Video RetrievalLSMDCtext-to-video R@143.1vid-TLDR (UMT-L)
Video RetrievalLSMDCtext-to-video R@1071.4vid-TLDR (UMT-L)
Video RetrievalLSMDCtext-to-video R@564.5vid-TLDR (UMT-L)
Video RetrievalLSMDCvideo-to-text R@140.7vid-TLDR (UMT-L)
Video RetrievalLSMDCvideo-to-text R@1063.6vid-TLDR (UMT-L)
Video RetrievalLSMDCvideo-to-text R@570.2vid-TLDR (UMT-L)
Video RetrievalMSVDtext-to-video R@157.9vid-TLDR (UMT-L)
Video RetrievalMSVDtext-to-video R@1089.4vid-TLDR (UMT-L)
Video RetrievalMSVDtext-to-video R@583.8vid-TLDR (UMT-L)
Video RetrievalMSVDvideo-to-text R@182.7vid-TLDR (UMT-L)
Video RetrievalMSVDvideo-to-text R@1096.3vid-TLDR (UMT-L)
Video RetrievalMSVDvideo-to-text R@594.5vid-TLDR (UMT-L)
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@142.1vid-TLDR (UMT-L)
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1072.4vid-TLDR (UMT-L)
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@563.9vid-TLDR (UMT-L)
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@137.7vid-TLDR (UMT-L)
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@1069.4vid-TLDR (UMT-L)
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@559.8vid-TLDR (UMT-L)
Zero-Shot Video RetrievalMSVDtext-to-video R@150vid-TLDR (UMT-L)
Zero-Shot Video RetrievalMSVDtext-to-video R@1085.5vid-TLDR (UMT-L)
Zero-Shot Video RetrievalMSVDtext-to-video R@577.6vid-TLDR (UMT-L)
Zero-Shot Video RetrievalMSVDvideo-to-text R@175.7vid-TLDR (UMT-L)
Zero-Shot Video RetrievalMSVDvideo-to-text R@1095.1vid-TLDR (UMT-L)
Zero-Shot Video RetrievalMSVDvideo-to-text R@590vid-TLDR (UMT-L)
Zero-Shot Video RetrievalDiDeMotext-to-video R@152vid-TLDR (UMT-L)
Zero-Shot Video RetrievalDiDeMotext-to-video R@1081vid-TLDR (UMT-L)
Zero-Shot Video RetrievalDiDeMotext-to-video R@574vid-TLDR (UMT-L)
Zero-Shot Video RetrievalDiDeMovideo-to-text R@152vid-TLDR (UMT-L)
Zero-Shot Video RetrievalDiDeMovideo-to-text R@1083.8vid-TLDR (UMT-L)
Zero-Shot Video RetrievalDiDeMovideo-to-text R@575.9vid-TLDR (UMT-L)
Zero-Shot Video RetrievalActivityNettext-to-video R@142.8vid-TLDR (UMT-L)
Zero-Shot Video RetrievalActivityNettext-to-video R@1079.6vid-TLDR (UMT-L)
Zero-Shot Video RetrievalActivityNettext-to-video R@569.4vid-TLDR (UMT-L)
Zero-Shot Video RetrievalActivityNetvideo-to-text R@141.2vid-TLDR (UMT-L)
Zero-Shot Video RetrievalActivityNetvideo-to-text R@1079.1vid-TLDR (UMT-L)
Zero-Shot Video RetrievalActivityNetvideo-to-text R@568.2vid-TLDR (UMT-L)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder2025-06-28