TimeSformer

Computer VisionIntroduced 200018 papers

Description

TimeSformer is a convolution-free approach to video classification built exclusively on self-attention over space and time. It adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Specifically, the method adapts the image model [Vision Transformer](https//www.paperswithcode.com/method/vision-transformer) (ViT) to video by extending the self-attention mechanism from the image space to the space-time 3D volume. As in ViT, each patch is linearly mapped into an embedding and augmented with positional information. This makes it possible to interpret the resulting sequence of vector

Papers Using This Method

DualX-VSR: Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution without Motion Compensation2025-06-05Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks2025-06-04SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation2025-05-13FedDA-TSformer: Federated Domain Adaptation with Vision TimeSformer for Left Ventricle Segmentation on Gated Myocardial Perfusion SPECT Image2025-02-23EITNet: An IoT-Enhanced Framework for Real-Time Basketball Action Recognition2024-10-133D-LSPTM: An Automatic Framework with 3D-Large-Scale Pretrained Model for Laryngeal Cancer Detection Using Laryngoscopic Videos2024-09-02Motion meets Attention: Video Motion Prompts2024-07-03Pig aggression classification using CNN, Transformers and Recurrent Networks2024-03-13P-Age: Pexels Dataset for Robust Spatio-Temporal Apparent Age Classification2023-11-04Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data2023-04-04Video Question Answering Using CLIP-Guided Visual-Text Attention2023-03-06CholecTriplet2022: Show me a tool and tell me the triplet -- an endoscopic vision challenge for surgical action triplet detection2023-02-13MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection2022-11-20One Model is Not Enough: Ensembles for Isolated Sign Language Recognition2022-07-04Context-aware Proposal Network for Temporal Action Detection2022-06-18VIDI: A Video Dataset of Incidents2022-05-26IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers2021-06-23Is Space-Time Attention All You Need for Video Understanding?2021-02-09