TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TESTA: Temporal-Spatial Token Aggregation for Long-form Vi...

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu sun, Lu Hou

2023-10-29Video RetrievalVideo-Text RetrievalFormVideo Question AnsweringRetrievalLanguage Modelling
PaperPDFCode(official)

Abstract

Large-scale video-language pre-training has made remarkable strides in advancing video-language understanding tasks. However, the heavy computational burden of video encoding remains a formidable efficiency bottleneck, particularly for long-form videos. These videos contain massive visual tokens due to their inherent 3D properties and spatiotemporal redundancy, making it challenging to capture complex temporal and spatial relationships. To tackle this issue, we propose an efficient method called TEmporal-Spatial Token Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks. Experimental results show that TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.

Results

TaskDatasetMetricValueModel
VideoCondensed Moviestext-to-video R@124.9TESTA (ViT-B/16)
VideoCondensed Moviestext-to-video R@1055.1TESTA (ViT-B/16)
VideoCondensed Moviestext-to-video R@546.5TESTA (ViT-B/16)
VideoActivityNettext-to-video R@154.8TESTA (ViT-B/16)
VideoActivityNettext-to-video R@1089.6TESTA (ViT-B/16)
VideoActivityNettext-to-video R@580.8TESTA (ViT-B/16)
VideoDiDeMotext-to-video R@161.2TESTA (ViT-B/16)
VideoDiDeMotext-to-video R@1091.5TESTA (ViT-B/16)
VideoDiDeMotext-to-video R@587.2TESTA (ViT-B/16)
VideoQuerYDtext-to-video R@183.4TESTA (ViT-B/16)
VideoQuerYDtext-to-video R@1095.3TESTA (ViT-B/16)
VideoQuerYDtext-to-video R@593.8TESTA (ViT-B/16)
Video Question AnsweringActivityNet-QAAccuracy45TESTA (ViT-B/16)
Video RetrievalCondensed Moviestext-to-video R@124.9TESTA (ViT-B/16)
Video RetrievalCondensed Moviestext-to-video R@1055.1TESTA (ViT-B/16)
Video RetrievalCondensed Moviestext-to-video R@546.5TESTA (ViT-B/16)
Video RetrievalActivityNettext-to-video R@154.8TESTA (ViT-B/16)
Video RetrievalActivityNettext-to-video R@1089.6TESTA (ViT-B/16)
Video RetrievalActivityNettext-to-video R@580.8TESTA (ViT-B/16)
Video RetrievalDiDeMotext-to-video R@161.2TESTA (ViT-B/16)
Video RetrievalDiDeMotext-to-video R@1091.5TESTA (ViT-B/16)
Video RetrievalDiDeMotext-to-video R@587.2TESTA (ViT-B/16)
Video RetrievalQuerYDtext-to-video R@183.4TESTA (ViT-B/16)
Video RetrievalQuerYDtext-to-video R@1095.3TESTA (ViT-B/16)
Video RetrievalQuerYDtext-to-video R@593.8TESTA (ViT-B/16)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17