Region-based Non-local Operation for Video Classification

Guoxi Huang, Adrian G. Bors

2020-07-17Action Classification Video Classification General Classification Action Recognition Classification Action Recognition In Videos

Paper PDF Code(official)

Abstract

Convolutional Neural Networks (CNNs) model long-range dependencies by deeply stacking convolution operations with small window sizes, which makes the optimizations difficult. This paper presents region-based non-local (RNL) operations as a family of self-attention mechanisms, which can directly capture long-range dependencies without using a deep stack of local operations. Given an intermediate feature map, our method recalibrates the feature at a position by aggregating the information from the neighboring regions of all positions. By combining a channel attention module with the proposed RNL, we design an attention chain, which can be integrated into the off-the-shelf CNNs for end-to-end training. We evaluate our method on two video classification benchmarks. The experimental results of our method outperform other attention mechanisms, and we achieve state-of-the-art performance on the Something-Something V1 dataset.

Results

Task	Dataset	Metric	Value	Model
Video	Kinetics-400	Acc@1	77.4	RNL+TSM Ensemble(ResNet50, 8 + 16 frames)
Activity Recognition	Something-Something V1	Top 1 Accuracy	54.1	RNL+TSM Ensemble(R50+R101, ImageNet pretrained)
Activity Recognition	Something-Something V1	Top 5 Accuracy	82.2	RNL+TSM Ensemble(R50+R101, ImageNet pretrained)
Activity Recognition	Something-Something V1	Top 1 Accuracy	52.7	RNL+TSM Ensemble(ResNet50, ImageNet pretrained)
Activity Recognition	Something-Something V1	Top 5 Accuracy	81.5	RNL+TSM Ensemble(ResNet50, ImageNet pretrained)
Action Recognition	Something-Something V1	Top 1 Accuracy	54.1	RNL+TSM Ensemble(R50+R101, ImageNet pretrained)
Action Recognition	Something-Something V1	Top 5 Accuracy	82.2	RNL+TSM Ensemble(R50+R101, ImageNet pretrained)
Action Recognition	Something-Something V1	Top 1 Accuracy	52.7	RNL+TSM Ensemble(ResNet50, ImageNet pretrained)
Action Recognition	Something-Something V1	Top 5 Accuracy	81.5	RNL+TSM Ensemble(ResNet50, ImageNet pretrained)

Region-based Non-local Operation for Video Classification

Abstract

Results

Related Papers

Region-based Non-local Operation for Video Classification

Abstract

Results

Related Papers