MotionSqueeze: Neural Motion Feature Learning for Video Understanding

Heeseung Kwon, Manjin Kim, Suha Kwak, Minsu Cho

2020-07-20ECCV 2020 8Action Classification Video Classification Video Understanding Action Recognition

Abstract

Motion plays a crucial role in understanding videos and most state-of-the-art neural models for video classification incorporate motion information typically using optical flows extracted by a separate off-the-shelf method. As the frame-by-frame optical flows require heavy computation, incorporating motion information has remained a major computational bottleneck for video understanding. In this work, we replace external and heavy computation of optical flows with internal and light-weight learning of motion features. We propose a trainable neural module, dubbed MotionSqueeze, for effective motion feature extraction. Inserted in the middle of any neural network, it learns to establish correspondences across frames and convert them into motion features, which are readily fed to the next downstream layer for better prediction. We demonstrate that the proposed method provides a significant gain on four standard benchmarks for action recognition with only a small amount of additional cost, outperforming the state of the art on Something-Something-V1&V2 datasets.

Results

Task	Dataset	Metric	Value	Model
Video	Kinetics-400	Acc@1	76.4	MSNet-R50 (16 frames, ImageNet pretrained)
Video	Something-Something V1	Top-5 Accuracy	84	MSNet-R50En (ours)
Video	Something-Something V2	Top-5 Accuracy	91	MSNet-R50En (ours)
Activity Recognition	HMDB-51	Average accuracy of 3 splits	77.4	MSNet-R50 (16 frames, ImageNet pretrained)
Activity Recognition	Something-Something V1	Top 1 Accuracy	55.1	MSNet-R50En (ensemble)
Activity Recognition	Something-Something V1	Top 1 Accuracy	54.4	MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Activity Recognition	Something-Something V1	Top 5 Accuracy	83.8	MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Activity Recognition	Something-Something V1	Top 1 Accuracy	52.1	MSNet-R50 (16 frames, ImageNet pretrained)
Activity Recognition	Something-Something V1	Top 5 Accuracy	82.3	MSNet-R50 (16 frames, ImageNet pretrained)
Activity Recognition	Something-Something V1	Top 1 Accuracy	50.9	MSNet-R50 (8 frames, ImageNet pretrained)
Activity Recognition	Something-Something V1	Top 5 Accuracy	80.3	MSNet-R50 (8 frames, ImageNet pretrained)
Activity Recognition	Something-Something V2	Top-1 Accuracy	66.6	MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Activity Recognition	Something-Something V2	Top-5 Accuracy	90.6	MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Activity Recognition	Something-Something V2	Top-1 Accuracy	64.7	MSNet-R50 (16 frames, ImageNet pretrained)
Activity Recognition	Something-Something V2	Top-5 Accuracy	89.4	MSNet-R50 (16 frames, ImageNet pretrained)
Activity Recognition	Something-Something V2	Top-1 Accuracy	63	MSNet-R50 (8 frames, ImageNet pretrained)
Activity Recognition	Something-Something V2	Top-5 Accuracy	88.4	MSNet-R50 (8 frames, ImageNet pretrained)
Action Recognition	HMDB-51	Average accuracy of 3 splits	77.4	MSNet-R50 (16 frames, ImageNet pretrained)
Action Recognition	Something-Something V1	Top 1 Accuracy	55.1	MSNet-R50En (ensemble)
Action Recognition	Something-Something V1	Top 1 Accuracy	54.4	MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Action Recognition	Something-Something V1	Top 5 Accuracy	83.8	MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Action Recognition	Something-Something V1	Top 1 Accuracy	52.1	MSNet-R50 (16 frames, ImageNet pretrained)
Action Recognition	Something-Something V1	Top 5 Accuracy	82.3	MSNet-R50 (16 frames, ImageNet pretrained)
Action Recognition	Something-Something V1	Top 1 Accuracy	50.9	MSNet-R50 (8 frames, ImageNet pretrained)
Action Recognition	Something-Something V1	Top 5 Accuracy	80.3	MSNet-R50 (8 frames, ImageNet pretrained)
Action Recognition	Something-Something V2	Top-1 Accuracy	66.6	MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Action Recognition	Something-Something V2	Top-5 Accuracy	90.6	MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Action Recognition	Something-Something V2	Top-1 Accuracy	64.7	MSNet-R50 (16 frames, ImageNet pretrained)
Action Recognition	Something-Something V2	Top-5 Accuracy	89.4	MSNet-R50 (16 frames, ImageNet pretrained)
Action Recognition	Something-Something V2	Top-1 Accuracy	63	MSNet-R50 (8 frames, ImageNet pretrained)
Action Recognition	Something-Something V2	Top-5 Accuracy	88.4	MSNet-R50 (8 frames, ImageNet pretrained)
Video Classification	Something-Something V1	Top-5 Accuracy	84	MSNet-R50En (ours)
Video Classification	Something-Something V2	Top-5 Accuracy	91	MSNet-R50En (ours)

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

Abstract

Results

Related Papers

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

Abstract

Results

Related Papers