TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CA^2ST: Cross-Attention in Audio, Space, and Time for Holi...

CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition

Jongseo Lee, Joohyun Chang, DongHo Lee, Jinwoo Choi

2025-03-30Action ClassificationAudio ClassificationVideo RecognitionVideo UnderstandingAction Recognition
PaperPDF

Abstract

We propose Cross-Attention in Audio, Space, and Time (CA^2ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400, consistently showing balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, and EPIC-SOUNDS. With a favorable performance of CAVA across these datasets, we demonstrate the effective information exchange among multiple experts within the B-CA module. In summary, CA^2ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.

Results

TaskDatasetMetricValueModel
VideoKinetics-SoundsTop 1 Accuracy93.3CA2ST(B/16)
VideoKinetics-SoundsTop 1 Accuracy92.9CAVA(B/16)
Activity RecognitionUCF1013-fold Accuracy97.2CA2ST(B/16)
Audio ClassificationEPIC-SOUNDSAccuracy61CA2ST(B/16)
Audio ClassificationEPIC-SOUNDSAccuracy60.3CAVA(B/16)
Audio ClassificationVGGSoundTop 1 Accuracy68.3CA2ST(B/16)
Audio ClassificationVGGSoundTop 1 Accuracy68.2CAVA(B/16)
Action RecognitionUCF1013-fold Accuracy97.2CA2ST(B/16)
ClassificationEPIC-SOUNDSAccuracy61CA2ST(B/16)
ClassificationEPIC-SOUNDSAccuracy60.3CAVA(B/16)
ClassificationVGGSoundTop 1 Accuracy68.3CA2ST(B/16)
ClassificationVGGSoundTop 1 Accuracy68.2CAVA(B/16)

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14