TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Is Space-Time Attention All You Need for Video Understandi...

Is Space-Time Attention All You Need for Video Understanding?

Gedas Bertasius, Heng Wang, Lorenzo Torresani

2021-02-09Action ClassificationAnomaly DetectionVideo Question AnsweringAllVideo ClassificationGeneral ClassificationVideo UnderstandingAction Recognition
PaperPDFCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency (at a small drop in accuracy), and it can also be applied to much longer video clips (over one minute long). Code and models are available at: https://github.com/facebookresearch/TimeSformer.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@180.7TimeSformer-L
VideoKinetics-400Acc@594.7TimeSformer-L
VideoKinetics-400Parameters (M)121.4TimeSformer-L
VideoKinetics-400Acc@179.7TimeSformer-HR
VideoKinetics-400Acc@594.4TimeSformer-HR
VideoKinetics-400Acc@178TimeSformer
VideoKinetics-400Acc@593.7TimeSformer
Anomaly DetectionUBnormalRBDC0.04TimeSformer
Anomaly DetectionUBnormalTBDC0.05TimeSformer
Video Question AnsweringHowto100M-QAAccuracy62.1TimeSformer
Activity RecognitionDiving-48Accuracy81TimeSformer-L
Activity RecognitionDiving-48Accuracy78TimeSformer-HR
Activity RecognitionDiving-48Accuracy75TimeSformer
Activity RecognitionSomething-Something V2Top-1 Accuracy62.5TimeSformer-HR
Activity RecognitionSomething-Something V2Top-1 Accuracy62.3TimeSformer-L
Activity RecognitionSomething-Something V2Top-1 Accuracy59.5TimeSformer
Action RecognitionDiving-48Accuracy81TimeSformer-L
Action RecognitionDiving-48Accuracy78TimeSformer-HR
Action RecognitionDiving-48Accuracy75TimeSformer
Action RecognitionSomething-Something V2Top-1 Accuracy62.5TimeSformer-HR
Action RecognitionSomething-Something V2Top-1 Accuracy62.3TimeSformer-L
Action RecognitionSomething-Something V2Top-1 Accuracy59.5TimeSformer

Related Papers

Multi-Stage Prompt Inference Attacks on Enterprise LLM Systems2025-07-213DKeyAD: High-Resolution 3D Point Cloud Anomaly Detection via Keypoint-Guided Point Clustering2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17A Privacy-Preserving Framework for Advertising Personalization Incorporating Federated Learning and Differential Privacy2025-07-16Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection2025-07-15Modeling Code: Is Text All You Need?2025-07-15