TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Relational Self-Attention: What's Missing in Attention for...

Relational Self-Attention: What's Missing in Attention for Video Understanding

Manjin Kim, Heeseung Kwon, Chunyu Wang, Suha Kwak, Minsu Cho

2021-11-02NeurIPS 2021 12Video UnderstandingAction RecognitionTemporal Action Localization
PaperPDFCode(official)

Abstract

Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1 & V2, Diving48, and FineGym.

Results

TaskDatasetMetricValueModel
Activity RecognitionDiving-48Accuracy84.2RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Activity RecognitionSomething-Something V1Top 1 Accuracy56.1RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Activity RecognitionSomething-Something V1Top 5 Accuracy82.8RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Activity RecognitionSomething-Something V1Top 1 Accuracy55.5RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Activity RecognitionSomething-Something V1Top 5 Accuracy82.6RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Activity RecognitionSomething-Something V1Top 1 Accuracy54RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Activity RecognitionSomething-Something V1Top 5 Accuracy81.1RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Activity RecognitionSomething-Something V1Top 1 Accuracy51.9RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Activity RecognitionSomething-Something V1Top 5 Accuracy79.6RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Activity RecognitionSomething-Something V2Top-1 Accuracy67.7RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips
Activity RecognitionSomething-Something V2Top-5 Accuracy91.1RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips
Activity RecognitionSomething-Something V2Top-1 Accuracy67.3RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Activity RecognitionSomething-Something V2Top-5 Accuracy90.8RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Activity RecognitionSomething-Something V2Top-1 Accuracy66RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Activity RecognitionSomething-Something V2Top-5 Accuracy89.8RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Activity RecognitionSomething-Something V2Top-1 Accuracy64.8RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Activity RecognitionSomething-Something V2Top-5 Accuracy89.1RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Activity RecognitionSomething-Something V2Top-5 Accuracy91.1RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Action RecognitionDiving-48Accuracy84.2RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Action RecognitionSomething-Something V1Top 1 Accuracy56.1RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Action RecognitionSomething-Something V1Top 5 Accuracy82.8RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Action RecognitionSomething-Something V1Top 1 Accuracy55.5RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Action RecognitionSomething-Something V1Top 5 Accuracy82.6RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Action RecognitionSomething-Something V1Top 1 Accuracy54RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Action RecognitionSomething-Something V1Top 5 Accuracy81.1RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Action RecognitionSomething-Something V1Top 1 Accuracy51.9RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Action RecognitionSomething-Something V1Top 5 Accuracy79.6RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Action RecognitionSomething-Something V2Top-1 Accuracy67.7RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips
Action RecognitionSomething-Something V2Top-5 Accuracy91.1RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips
Action RecognitionSomething-Something V2Top-1 Accuracy67.3RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Action RecognitionSomething-Something V2Top-5 Accuracy90.8RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Action RecognitionSomething-Something V2Top-1 Accuracy66RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Action RecognitionSomething-Something V2Top-5 Accuracy89.8RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Action RecognitionSomething-Something V2Top-1 Accuracy64.8RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Action RecognitionSomething-Something V2Top-5 Accuracy89.1RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Action RecognitionSomething-Something V2Top-5 Accuracy91.1RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08