TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DenseImage Network: Video Spatial-Temporal Evolution Encod...

DenseImage Network: Video Spatial-Temporal Evolution Encoding and Understanding

Xiaokai Chen, Ke Gao

2018-05-19Gesture RecognitionVideo UnderstandingAction Recognition In Videos
PaperPDF

Abstract

Many of the leading approaches for video understanding are data-hungry and time-consuming, failing to capture the gist of spatial-temporal evolution in an efficient manner. The latest research shows that CNN network can reason about static relation of entities in images. To further exploit its capacity in dynamic evolution reasoning, we introduce a novel network module called DenseImage Network(DIN) with two main contributions. 1) A novel compact representation of video which distills its significant spatial-temporal evolution into a matrix called DenseImage, primed for efficient video encoding. 2) A simple yet powerful learning strategy based on DenseImage and a temporal-order-preserving CNN network is proposed for video understanding, which contains a local temporal correlation constraint capturing temporal evolution at multiple time scales with different filter widths. Extensive experiments on two recent challenging benchmarks demonstrate that our DenseImage Network can accurately capture the common spatial-temporal evolution between similar actions, even with enormous visual variations or different time scales. Moreover, we obtain the state-of-the-art results in action and gesture recognition with much less time-and-memory cost, indicating its immense potential in video representing and understanding.

Results

TaskDatasetMetricValueModel
Activity RecognitionJester (Gesture Recognition)Val95.31DIN
Activity RecognitionSomething-Something V2Top-1 Accuracy34.11DIN
Action RecognitionJester (Gesture Recognition)Val95.31DIN
Action RecognitionSomething-Something V2Top-1 Accuracy34.11DIN
Action Recognition In VideosJester (Gesture Recognition)Val95.31DIN
Action Recognition In VideosSomething-Something V2Top-1 Accuracy34.11DIN

Related Papers

Efficient Deployment of Spiking Neural Networks on SpiNNaker2 for DVS Gesture Recognition Using Neuromorphic Intermediate Representation2025-09-04VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08