TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Revisiting the Effectiveness of Off-the-shelf Temporal Mod...

Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification

Yunlong Bian, Chuang Gan, Xiao Liu, Fu Li, Xiang Long, Yandong Li, Heng Qi, Jie zhou, Shilei Wen, Yuanqing Lin

2017-08-12Action ClassificationVideo RecognitionVideo ClassificationGeneral Classification
PaperPDF

Abstract

This paper describes our solution for the video recognition task of ActivityNet Kinetics challenge that ranked the 1st place. Most of existing state-of-the-art video recognition approaches are in favor of an end-to-end pipeline. One exception is the framework of DevNet. The merit of DevNet is that they first use the video data to learn a network (i.e. fine-tuning or training from scratch). Instead of directly using the end-to-end classification scores (e.g. softmax scores), they extract the features from the learned network and then fed them into the off-the-shelf machine learning models to conduct video classification. However, the effectiveness of this line work has long-term been ignored and underestimated. In this submission, we extensively use this strategy. Particularly, we investigate four temporal modeling approaches using the learned features: Multi-group Shifting Attention Network, Temporal Xception Network, Multi-stream sequence Model and Fast-Forward Sequence Model. Experiment results on the challenging Kinetics dataset demonstrate that our proposed temporal modeling approaches can significantly improve existing approaches in the large-scale video recognition tasks. Most remarkably, our best single Multi-group Shifting Attention Network can achieve 77.7% in term of top-1 accuracy and 93.2% in term of top-5 accuracy on the validation set.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@173Inception-ResNet
VideoKinetics-400Acc@590.9Inception-ResNet

Related Papers

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment2025-06-28Exploring Audio Cues for Enhanced Test-Time Video Model Adaptation2025-06-14SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis2025-06-09From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos2025-06-05Spatiotemporal Analysis of Forest Machine Operations Using 3D Video Classification2025-05-30Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition2025-05-29SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding2025-05-22