TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MVFNet: Multi-View Fusion Network for Efficient Video Reco...

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Wenhao Wu, Dongliang He, Tianwei Lin, Fu Li, Chuang Gan, Errui Ding

2020-12-13Action ClassificationVideo RecognitionAction RecognitionTemporal Action Localization
PaperPDFCodeCodeCode(official)

Abstract

Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@179.1MVFNet-ResNet101 (ensemble, ImageNet pretrained, RGB only)
VideoKinetics-400Acc@593.8MVFNet-ResNet101 (ensemble, ImageNet pretrained, RGB only)
Activity RecognitionSomething-Something V1Top 1 Accuracy54MVFNet-R50EN
Activity RecognitionSomething-Something V2Top-1 Accuracy66.3MVFNet-ResNet50 (center crop, 8+16 ensemble, ImageNet pretrained, RGB only)
Action RecognitionSomething-Something V1Top 1 Accuracy54MVFNet-R50EN
Action RecognitionSomething-Something V2Top-1 Accuracy66.3MVFNet-ResNet50 (center crop, 8+16 ensemble, ImageNet pretrained, RGB only)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22