TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CAST: Cross-Attention in Space and Time for Video Action R...

CAST: Cross-Attention in Space and Time for Video Action Recognition

DongHo Lee, Jongseo Lee, Jinwoo Choi

2023-11-30NeurIPS 2023 11Action ClassificationVideo UnderstandingAction RecognitionAction Recognition In Videos
PaperPDFCode

Abstract

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@185.3CAST(ViT-B/16)
Activity RecognitionEPIC-KITCHENS-100Action@149.3CAST(ViT-B/16)
Activity RecognitionEPIC-KITCHENS-100Noun@160.9CAST(ViT-B/16)
Activity RecognitionEPIC-KITCHENS-100Verb@172.5CAST(ViT-B/16)
Activity RecognitionSomething-Something V2Top-1 Accuracy71.6CAST(ViT-B/16)
Action RecognitionEPIC-KITCHENS-100Action@149.3CAST(ViT-B/16)
Action RecognitionEPIC-KITCHENS-100Noun@160.9CAST(ViT-B/16)
Action RecognitionEPIC-KITCHENS-100Verb@172.5CAST(ViT-B/16)
Action RecognitionSomething-Something V2Top-1 Accuracy71.6CAST(ViT-B/16)

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08