TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TDN: Temporal Difference Networks for Efficient Action Rec...

TDN: Temporal Difference Networks for Efficient Action Recognition

LiMin Wang, Zhan Tong, Bin Ji, Gangshan Wu

2020-12-18CVPR 2021 1Action ClassificationAction RecognitionAction Recognition In Videos
PaperPDFCode(official)

Abstract

Temporal modeling still remains challenging for action recognition in videos. To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition. The core of our TDN is to devise an efficient temporal module (TDM) by explicitly leveraging a temporal difference operator, and systematically assess its effect on short-term and long-term motion modeling. To fully capture temporal information over the entire video, our TDN is established with a two-level difference modeling paradigm. Specifically, for local motion modeling, temporal difference over consecutive frames is used to supply 2D CNNs with finer motion pattern, while for global motion modeling, temporal difference across segments is incorporated to capture long-range structure for motion feature excitation. TDN provides a simple and principled temporal modeling framework and could be instantiated with the existing CNNs at a small extra computational cost. Our TDN presents a new state of the art on the Something-Something V1 & V2 datasets and is on par with the best performance on the Kinetics-400 dataset. In addition, we conduct in-depth ablation studies and plot the visualization results of our TDN, hopefully providing insightful analysis on temporal difference modeling. We release the code at https://github.com/MCG-NJU/TDN.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@179.4TDN-ResNet101 (ensemble, ImageNet pretrained, RGB only)
VideoKinetics-400Acc@594.4TDN-ResNet101 (ensemble, ImageNet pretrained, RGB only)
Activity RecognitionSomething-Something V1Top 1 Accuracy56.8TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)
Activity RecognitionSomething-Something V1Top 5 Accuracy84.1TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)
Activity RecognitionSomething-Something V2Top-1 Accuracy69.6TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only)
Activity RecognitionSomething-Something V2Top-5 Accuracy92.2TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only)
Activity RecognitionSomething-Something V2Top-1 Accuracy68.2TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)
Activity RecognitionSomething-Something V2Top-5 Accuracy91.6TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)
Action RecognitionSomething-Something V1Top 1 Accuracy56.8TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)
Action RecognitionSomething-Something V1Top 5 Accuracy84.1TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)
Action RecognitionSomething-Something V2Top-1 Accuracy69.6TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only)
Action RecognitionSomething-Something V2Top-5 Accuracy92.2TDN ResNet101 (one clip, three crop, 8+16 ensemble, ImageNet pretrained, RGB only)
Action RecognitionSomething-Something V2Top-1 Accuracy68.2TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)
Action RecognitionSomething-Something V2Top-5 Accuracy91.6TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22Active Multimodal Distillation for Few-shot Action Recognition2025-06-16