CAST: Cross-Attention in Space and Time for Video Action Recognition

DongHo Lee, Jongseo Lee, Jinwoo Choi

2023-11-30NeurIPS 2023 11Action Classification Video Understanding Action Recognition Action Recognition In Videos

Abstract

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.

Results

Task	Dataset	Metric	Value	Model
Video	Kinetics-400	Acc@1	85.3	CAST(ViT-B/16)
Activity Recognition	EPIC-KITCHENS-100	Action@1	49.3	CAST(ViT-B/16)
Activity Recognition	EPIC-KITCHENS-100	Noun@1	60.9	CAST(ViT-B/16)
Activity Recognition	EPIC-KITCHENS-100	Verb@1	72.5	CAST(ViT-B/16)
Activity Recognition	Something-Something V2	Top-1 Accuracy	71.6	CAST(ViT-B/16)
Action Recognition	EPIC-KITCHENS-100	Action@1	49.3	CAST(ViT-B/16)
Action Recognition	EPIC-KITCHENS-100	Noun@1	60.9	CAST(ViT-B/16)
Action Recognition	EPIC-KITCHENS-100	Verb@1	72.5	CAST(ViT-B/16)
Action Recognition	Something-Something V2	Top-1 Accuracy	71.6	CAST(ViT-B/16)

CAST: Cross-Attention in Space and Time for Video Action Recognition

Abstract

Results

Related Papers

CAST: Cross-Attention in Space and Time for Video Action Recognition

Abstract

Results

Related Papers