TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MotionSqueeze: Neural Motion Feature Learning for Video Un...

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

Heeseung Kwon, Manjin Kim, Suha Kwak, Minsu Cho

2020-07-20ECCV 2020 8Action ClassificationVideo ClassificationVideo UnderstandingAction Recognition
PaperPDFCodeCode

Abstract

Motion plays a crucial role in understanding videos and most state-of-the-art neural models for video classification incorporate motion information typically using optical flows extracted by a separate off-the-shelf method. As the frame-by-frame optical flows require heavy computation, incorporating motion information has remained a major computational bottleneck for video understanding. In this work, we replace external and heavy computation of optical flows with internal and light-weight learning of motion features. We propose a trainable neural module, dubbed MotionSqueeze, for effective motion feature extraction. Inserted in the middle of any neural network, it learns to establish correspondences across frames and convert them into motion features, which are readily fed to the next downstream layer for better prediction. We demonstrate that the proposed method provides a significant gain on four standard benchmarks for action recognition with only a small amount of additional cost, outperforming the state of the art on Something-Something-V1&V2 datasets.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@176.4MSNet-R50 (16 frames, ImageNet pretrained)
VideoSomething-Something V1Top-5 Accuracy84MSNet-R50En (ours)
VideoSomething-Something V2Top-5 Accuracy91MSNet-R50En (ours)
Activity RecognitionHMDB-51Average accuracy of 3 splits77.4MSNet-R50 (16 frames, ImageNet pretrained)
Activity RecognitionSomething-Something V1Top 1 Accuracy55.1MSNet-R50En (ensemble)
Activity RecognitionSomething-Something V1Top 1 Accuracy54.4MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Activity RecognitionSomething-Something V1Top 5 Accuracy83.8MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Activity RecognitionSomething-Something V1Top 1 Accuracy52.1MSNet-R50 (16 frames, ImageNet pretrained)
Activity RecognitionSomething-Something V1Top 5 Accuracy82.3MSNet-R50 (16 frames, ImageNet pretrained)
Activity RecognitionSomething-Something V1Top 1 Accuracy50.9MSNet-R50 (8 frames, ImageNet pretrained)
Activity RecognitionSomething-Something V1Top 5 Accuracy80.3MSNet-R50 (8 frames, ImageNet pretrained)
Activity RecognitionSomething-Something V2Top-1 Accuracy66.6MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Activity RecognitionSomething-Something V2Top-5 Accuracy90.6MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Activity RecognitionSomething-Something V2Top-1 Accuracy64.7MSNet-R50 (16 frames, ImageNet pretrained)
Activity RecognitionSomething-Something V2Top-5 Accuracy89.4MSNet-R50 (16 frames, ImageNet pretrained)
Activity RecognitionSomething-Something V2Top-1 Accuracy63MSNet-R50 (8 frames, ImageNet pretrained)
Activity RecognitionSomething-Something V2Top-5 Accuracy88.4MSNet-R50 (8 frames, ImageNet pretrained)
Action RecognitionHMDB-51Average accuracy of 3 splits77.4MSNet-R50 (16 frames, ImageNet pretrained)
Action RecognitionSomething-Something V1Top 1 Accuracy55.1MSNet-R50En (ensemble)
Action RecognitionSomething-Something V1Top 1 Accuracy54.4MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Action RecognitionSomething-Something V1Top 5 Accuracy83.8MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Action RecognitionSomething-Something V1Top 1 Accuracy52.1MSNet-R50 (16 frames, ImageNet pretrained)
Action RecognitionSomething-Something V1Top 5 Accuracy82.3MSNet-R50 (16 frames, ImageNet pretrained)
Action RecognitionSomething-Something V1Top 1 Accuracy50.9MSNet-R50 (8 frames, ImageNet pretrained)
Action RecognitionSomething-Something V1Top 5 Accuracy80.3MSNet-R50 (8 frames, ImageNet pretrained)
Action RecognitionSomething-Something V2Top-1 Accuracy66.6MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Action RecognitionSomething-Something V2Top-5 Accuracy90.6MSNet-R50En (8+16 ensemble, ImageNet pretrained)
Action RecognitionSomething-Something V2Top-1 Accuracy64.7MSNet-R50 (16 frames, ImageNet pretrained)
Action RecognitionSomething-Something V2Top-5 Accuracy89.4MSNet-R50 (16 frames, ImageNet pretrained)
Action RecognitionSomething-Something V2Top-1 Accuracy63MSNet-R50 (8 frames, ImageNet pretrained)
Action RecognitionSomething-Something V2Top-5 Accuracy88.4MSNet-R50 (8 frames, ImageNet pretrained)
Video ClassificationSomething-Something V1Top-5 Accuracy84MSNet-R50En (ours)
Video ClassificationSomething-Something V2Top-5 Accuracy91MSNet-R50En (ours)

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08