TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VidTr: Video Transformer Without Convolutions

VidTr: Video Transformer Without Convolutions

Yanyi Zhang, Xinyu Li, Chunhui Liu, Bing Shuai, Yi Zhu, Biagio Brattoli, Hao Chen, Ivan Marsic, Joseph Tighe

2021-04-23ICCV 2021 10Action ClassificationVideo ClassificationAction Recognition
PaperPDF

Abstract

We introduce Video Transformer (VidTr) with separable-attention for video classification. Comparing with commonly used 3D networks, VidTr is able to aggregate spatio-temporal information via stacked attentions and provide better performance with higher efficiency. We first introduce the vanilla video transformer and show that transformer module is able to perform spatio-temporal modeling from raw pixels, but with heavy memory usage. We then present VidTr which reduces the memory cost by 3.3$\times$ while keeping the same performance. To further optimize the model, we propose the standard deviation based topK pooling for attention ($pool_{topK\_std}$), which reduces the computation by dropping non-informative features along temporal dimension. VidTr achieves state-of-the-art performance on five commonly used datasets with lower computational requirement, showing both the efficiency and effectiveness of our design. Finally, error analysis and visualization show that VidTr is especially good at predicting actions that require long-term temporal reasoning.

Results

TaskDatasetMetricValueModel
VideoKinetics-700Top-1 Accuracy70.8En-VidTr-L
VideoKinetics-700Top-5 Accuracy89.4En-VidTr-L
VideoKinetics-700Top-1 Accuracy70.2VidTr-L
VideoKinetics-700Top-5 Accuracy89VidTr-L
VideoKinetics-700Top-1 Accuracy69.5VidTr-M
VideoKinetics-700Top-5 Accuracy88.3VidTr-M
VideoKinetics-700Top-1 Accuracy67.3VidTr-S
VideoKinetics-700Top-5 Accuracy87.7VidTr-S
VideoCharadesMAP47.3En-VidTr-L
VideoCharadesMAP43.5VidTr-L
VideoKinetics-400Acc@180.5En-VidTr-L
VideoKinetics-400Acc@594.6En-VidTr-L
VideoKinetics-400Acc@179.7En-VidTr-M
VideoKinetics-400Acc@594.2En-VidTr-M
VideoKinetics-400Acc@179.4En-VidTr-S
VideoKinetics-400Acc@594En-VidTr-S
Activity RecognitionHMDB-51Average accuracy of 3 splits74.4VidTr-L
Activity RecognitionSomething-Something V2Top-1 Accuracy60.2VidTr-L
Activity RecognitionUCF1013-fold Accuracy96.7VidTr-L
Action RecognitionHMDB-51Average accuracy of 3 splits74.4VidTr-L
Action RecognitionSomething-Something V2Top-1 Accuracy60.2VidTr-L
Action RecognitionUCF1013-fold Accuracy96.7VidTr-L

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment2025-06-28EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22