TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Diverse Temporal Aggregation and Depthwise Spatiotemporal ...

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

Youngwan Lee, Hyung-Il Kim, Kimin Yun, Jinyoung Moon

2020-12-013D ArchitectureVideo ClassificationGeneral ClassificationAction Recognition
PaperPDFCode(official)

Abstract

Video classification researches that have recently attracted attention are the fields of temporal modeling and 3D efficient architecture. However, the temporal modeling methods are not efficient or the 3D efficient architecture is less interested in temporal modeling. For bridging the gap between them, we propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D. The T-OSA is devised to build a feature hierarchy by aggregating temporal features with different temporal receptive fields. Stacking this T-OSA enables the network itself to model short-range as well as long-range temporal relationships across frames without any external modules. Inspired by kernel factorization and channel factorization, we also design a depthwise spatiotemporal factorization module, named, D(2+1)D that decomposes a 3D depthwise convolution into two spatial and temporal depthwise convolutions for making our network more lightweight and efficient. By using the proposed temporal modeling method (T-OSA), and the efficient factorized component (D(2+1)D), we construct two types of VoV3D networks, VoV3D-M and VoV3D-L. Thanks to its efficiency and effectiveness of temporal modeling, VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing a state-of-the-art temporal modeling method on both Something-Something and Kinetics-400. Furthermore, VoV3D shows better temporal modeling ability than a state-of-the-art efficient 3D architecture, X3D having comparable model capacity. We hope that VoV3D can serve as a baseline for efficient video classification.

Results

TaskDatasetMetricValueModel
Activity RecognitionSomething-Something V1Top 1 Accuracy54.59VoV3D-L (32frames, Kinetics pretrained, single)
Activity RecognitionSomething-Something V1Top 5 Accuracy82.3VoV3D-L (32frames, Kinetics pretrained, single)
Activity RecognitionSomething-Something V1Top 1 Accuracy52.68VoV3D-M (32frames, Kinetics pretrained, single)
Activity RecognitionSomething-Something V1Top 5 Accuracy80.43VoV3D-M (32frames, Kinetics pretrained, single)
Activity RecognitionSomething-Something V1Top 1 Accuracy50.6VoV3D-L (32frames, from scratch, single)
Activity RecognitionSomething-Something V1Top 5 Accuracy78.7VoV3D-L (32frames, from scratch, single)
Activity RecognitionSomething-Something V1Top 1 Accuracy49.8VoV3D-M (32frames, from scratch, single)
Activity RecognitionSomething-Something V1Top 5 Accuracy78VoV3D-M (32frames, from scratch, single)
Activity RecognitionSomething-Something V1Top 1 Accuracy49.5VoV3D-L (16frames, from scratch, single)
Activity RecognitionSomething-Something V1Top 5 Accuracy78VoV3D-L (16frames, from scratch, single)
Activity RecognitionSomething-Something V1Top 1 Accuracy48.1VoV3D-M (16frames, from scratch, single)
Activity RecognitionSomething-Something V1Top 5 Accuracy76.9VoV3D-M (16frames, from scratch, single)
Activity RecognitionSomething-Something V2Top-1 Accuracy67.35VoV3D-L (32frames, Kinetics pretrained, single)
Activity RecognitionSomething-Something V2Top-5 Accuracy90.5VoV3D-L (32frames, Kinetics pretrained, single)
Activity RecognitionSomething-Something V2Top-1 Accuracy65.8VoV3D-L (32frames, from scratch, single)
Activity RecognitionSomething-Something V2Top-5 Accuracy89.5VoV3D-L (32frames, from scratch, single)
Activity RecognitionSomething-Something V2Top-1 Accuracy65.24VoV3D-M (32frames, Kinetics pretrained, single)
Activity RecognitionSomething-Something V2Top-5 Accuracy89.48VoV3D-M (32frames, Kinetics pretrained, single)
Activity RecognitionSomething-Something V2Top-1 Accuracy64.2VoV3D-M (32frames, from scratch, single)
Activity RecognitionSomething-Something V2Top-5 Accuracy88.8VoV3D-M (32frames, from scratch, single)
Activity RecognitionSomething-Something V2Top-1 Accuracy64.1VoV3D-L (16frames, from scratch, single)
Activity RecognitionSomething-Something V2Top-5 Accuracy88.6VoV3D-L (16frames, from scratch, single)
Activity RecognitionSomething-Something V2Top-1 Accuracy63.2VoV3D-M (16frames, from scratch, single)
Activity RecognitionSomething-Something V2Top-5 Accuracy88.2VoV3D-M (16frames, from scratch, single)
Action RecognitionSomething-Something V1Top 1 Accuracy54.59VoV3D-L (32frames, Kinetics pretrained, single)
Action RecognitionSomething-Something V1Top 5 Accuracy82.3VoV3D-L (32frames, Kinetics pretrained, single)
Action RecognitionSomething-Something V1Top 1 Accuracy52.68VoV3D-M (32frames, Kinetics pretrained, single)
Action RecognitionSomething-Something V1Top 5 Accuracy80.43VoV3D-M (32frames, Kinetics pretrained, single)
Action RecognitionSomething-Something V1Top 1 Accuracy50.6VoV3D-L (32frames, from scratch, single)
Action RecognitionSomething-Something V1Top 5 Accuracy78.7VoV3D-L (32frames, from scratch, single)
Action RecognitionSomething-Something V1Top 1 Accuracy49.8VoV3D-M (32frames, from scratch, single)
Action RecognitionSomething-Something V1Top 5 Accuracy78VoV3D-M (32frames, from scratch, single)
Action RecognitionSomething-Something V1Top 1 Accuracy49.5VoV3D-L (16frames, from scratch, single)
Action RecognitionSomething-Something V1Top 5 Accuracy78VoV3D-L (16frames, from scratch, single)
Action RecognitionSomething-Something V1Top 1 Accuracy48.1VoV3D-M (16frames, from scratch, single)
Action RecognitionSomething-Something V1Top 5 Accuracy76.9VoV3D-M (16frames, from scratch, single)
Action RecognitionSomething-Something V2Top-1 Accuracy67.35VoV3D-L (32frames, Kinetics pretrained, single)
Action RecognitionSomething-Something V2Top-5 Accuracy90.5VoV3D-L (32frames, Kinetics pretrained, single)
Action RecognitionSomething-Something V2Top-1 Accuracy65.8VoV3D-L (32frames, from scratch, single)
Action RecognitionSomething-Something V2Top-5 Accuracy89.5VoV3D-L (32frames, from scratch, single)
Action RecognitionSomething-Something V2Top-1 Accuracy65.24VoV3D-M (32frames, Kinetics pretrained, single)
Action RecognitionSomething-Something V2Top-5 Accuracy89.48VoV3D-M (32frames, Kinetics pretrained, single)
Action RecognitionSomething-Something V2Top-1 Accuracy64.2VoV3D-M (32frames, from scratch, single)
Action RecognitionSomething-Something V2Top-5 Accuracy88.8VoV3D-M (32frames, from scratch, single)
Action RecognitionSomething-Something V2Top-1 Accuracy64.1VoV3D-L (16frames, from scratch, single)
Action RecognitionSomething-Something V2Top-5 Accuracy88.6VoV3D-L (16frames, from scratch, single)
Action RecognitionSomething-Something V2Top-1 Accuracy63.2VoV3D-M (16frames, from scratch, single)
Action RecognitionSomething-Something V2Top-5 Accuracy88.2VoV3D-M (16frames, from scratch, single)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment2025-06-28EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22