TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu sun, Lu Hou

2023-10-29Video Retrieval Video-Text Retrieval Form Video Question Answering Retrieval Language Modelling

Abstract

Large-scale video-language pre-training has made remarkable strides in advancing video-language understanding tasks. However, the heavy computational burden of video encoding remains a formidable efficiency bottleneck, particularly for long-form videos. These videos contain massive visual tokens due to their inherent 3D properties and spatiotemporal redundancy, making it challenging to capture complex temporal and spatial relationships. To tackle this issue, we propose an efficient method called TEmporal-Spatial Token Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks. Experimental results show that TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.

Results

Task	Dataset	Metric	Value	Model
Video	Condensed Movies	text-to-video R@1	24.9	TESTA (ViT-B/16)
Video	Condensed Movies	text-to-video R@10	55.1	TESTA (ViT-B/16)
Video	Condensed Movies	text-to-video R@5	46.5	TESTA (ViT-B/16)
Video	ActivityNet	text-to-video R@1	54.8	TESTA (ViT-B/16)
Video	ActivityNet	text-to-video R@10	89.6	TESTA (ViT-B/16)
Video	ActivityNet	text-to-video R@5	80.8	TESTA (ViT-B/16)
Video	DiDeMo	text-to-video R@1	61.2	TESTA (ViT-B/16)
Video	DiDeMo	text-to-video R@10	91.5	TESTA (ViT-B/16)
Video	DiDeMo	text-to-video R@5	87.2	TESTA (ViT-B/16)
Video	QuerYD	text-to-video R@1	83.4	TESTA (ViT-B/16)
Video	QuerYD	text-to-video R@10	95.3	TESTA (ViT-B/16)
Video	QuerYD	text-to-video R@5	93.8	TESTA (ViT-B/16)
Video Question Answering	ActivityNet-QA	Accuracy	45	TESTA (ViT-B/16)
Video Retrieval	Condensed Movies	text-to-video R@1	24.9	TESTA (ViT-B/16)
Video Retrieval	Condensed Movies	text-to-video R@10	55.1	TESTA (ViT-B/16)
Video Retrieval	Condensed Movies	text-to-video R@5	46.5	TESTA (ViT-B/16)
Video Retrieval	ActivityNet	text-to-video R@1	54.8	TESTA (ViT-B/16)
Video Retrieval	ActivityNet	text-to-video R@10	89.6	TESTA (ViT-B/16)
Video Retrieval	ActivityNet	text-to-video R@5	80.8	TESTA (ViT-B/16)
Video Retrieval	DiDeMo	text-to-video R@1	61.2	TESTA (ViT-B/16)
Video Retrieval	DiDeMo	text-to-video R@10	91.5	TESTA (ViT-B/16)
Video Retrieval	DiDeMo	text-to-video R@5	87.2	TESTA (ViT-B/16)
Video Retrieval	QuerYD	text-to-video R@1	83.4	TESTA (ViT-B/16)
Video Retrieval	QuerYD	text-to-video R@10	95.3	TESTA (ViT-B/16)
Video Retrieval	QuerYD	text-to-video R@5	93.8	TESTA (ViT-B/16)

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Abstract

Results

Related Papers

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Abstract

Results

Related Papers