Relational Self-Attention: What's Missing in Attention for Video Understanding

Manjin Kim, Heeseung Kwon, Chunyu Wang, Suha Kwak, Minsu Cho

2021-11-02NeurIPS 2021 12Video Understanding Action Recognition Temporal Action Localization

Abstract

Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1 & V2, Diving48, and FineGym.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	Diving-48	Accuracy	84.2	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V1	Top 1 Accuracy	56.1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Activity Recognition	Something-Something V1	Top 5 Accuracy	82.8	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Activity Recognition	Something-Something V1	Top 1 Accuracy	55.5	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V1	Top 5 Accuracy	82.6	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V1	Top 1 Accuracy	54	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V1	Top 5 Accuracy	81.1	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V1	Top 1 Accuracy	51.9	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V1	Top 5 Accuracy	79.6	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V2	Top-1 Accuracy	67.7	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips
Activity Recognition	Something-Something V2	Top-5 Accuracy	91.1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips
Activity Recognition	Something-Something V2	Top-1 Accuracy	67.3	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V2	Top-5 Accuracy	90.8	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V2	Top-1 Accuracy	66	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V2	Top-5 Accuracy	89.8	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V2	Top-1 Accuracy	64.8	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V2	Top-5 Accuracy	89.1	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V2	Top-5 Accuracy	91.1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Action Recognition	Diving-48	Accuracy	84.2	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V1	Top 1 Accuracy	56.1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Action Recognition	Something-Something V1	Top 5 Accuracy	82.8	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Action Recognition	Something-Something V1	Top 1 Accuracy	55.5	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V1	Top 5 Accuracy	82.6	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V1	Top 1 Accuracy	54	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V1	Top 5 Accuracy	81.1	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V1	Top 1 Accuracy	51.9	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V1	Top 5 Accuracy	79.6	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V2	Top-1 Accuracy	67.7	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips
Action Recognition	Something-Something V2	Top-5 Accuracy	91.1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips
Action Recognition	Something-Something V2	Top-1 Accuracy	67.3	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V2	Top-5 Accuracy	90.8	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V2	Top-1 Accuracy	66	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V2	Top-5 Accuracy	89.8	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V2	Top-1 Accuracy	64.8	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V2	Top-5 Accuracy	89.1	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V2	Top-5 Accuracy	91.1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)

Abstract

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	Diving-48	Accuracy	84.2	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V1	Top 1 Accuracy	56.1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Activity Recognition	Something-Something V1	Top 5 Accuracy	82.8	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Activity Recognition	Something-Something V1	Top 1 Accuracy	55.5	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V1	Top 5 Accuracy	82.6	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V1	Top 1 Accuracy	54	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V1	Top 5 Accuracy	81.1	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V1	Top 1 Accuracy	51.9	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V1	Top 5 Accuracy	79.6	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V2	Top-1 Accuracy	67.7	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips
Activity Recognition	Something-Something V2	Top-5 Accuracy	91.1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips
Activity Recognition	Something-Something V2	Top-1 Accuracy	67.3	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V2	Top-5 Accuracy	90.8	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V2	Top-1 Accuracy	66	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V2	Top-5 Accuracy	89.8	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V2	Top-1 Accuracy	64.8	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V2	Top-5 Accuracy	89.1	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Activity Recognition	Something-Something V2	Top-5 Accuracy	91.1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Action Recognition	Diving-48	Accuracy	84.2	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V1	Top 1 Accuracy	56.1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Action Recognition	Something-Something V1	Top 5 Accuracy	82.8	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)
Action Recognition	Something-Something V1	Top 1 Accuracy	55.5	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V1	Top 5 Accuracy	82.6	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V1	Top 1 Accuracy	54	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V1	Top 5 Accuracy	81.1	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V1	Top 1 Accuracy	51.9	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V1	Top 5 Accuracy	79.6	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V2	Top-1 Accuracy	67.7	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips
Action Recognition	Something-Something V2	Top-5 Accuracy	91.1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips
Action Recognition	Something-Something V2	Top-1 Accuracy	67.3	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V2	Top-5 Accuracy	90.8	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V2	Top-1 Accuracy	66	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V2	Top-5 Accuracy	89.8	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V2	Top-1 Accuracy	64.8	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V2	Top-5 Accuracy	89.1	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)
Action Recognition	Something-Something V2	Top-5 Accuracy	91.1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)

Relational Self-Attention: What's Missing in Attention for Video Understanding

Abstract

Results

Related Papers

Relational Self-Attention: What's Missing in Attention for Video Understanding

Abstract

Results

Related Papers