TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Object-Region Video Transformers

Object-Region Video Transformers

Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson

2021-10-13CVPR 2022 1Action DetectionFew Shot Action RecognitionVideo UnderstandingAction Recognition
PaperPDFCode

Abstract

Recently, video transformers have shown great success in video understanding, exceeding CNN performance; yet existing video transformer models do not explicitly model objects, although objects can be essential for recognizing actions. In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly incorporates object representations. The key idea is to fuse object-centric representations starting from early layers and propagate them into the transformer-layers, thus affecting the spatio-temporal representations throughout the network. Our ORViT block consists of two object-level streams: appearance and dynamics. In the appearance stream, an "Object-Region Attention" module applies self-attention over the patches and \emph{object regions}. In this way, visual object regions interact with uniform patch tokens and enrich them with contextualized object information. We further model object dynamics via a separate "Object-Dynamics Module", which captures trajectory interactions, and show how to integrate the two streams. We evaluate our model on four tasks and five datasets: compositional and few-shot action recognition on SomethingElse, spatio-temporal action detection on AVA, and standard action recognition on Something-Something V2, Diving48 and Epic-Kitchen100. We show strong performance improvement across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a transformer architecture. For code and pretrained models, visit the project page at \url{https://roeiherz.github.io/ORViT/}

Results

TaskDatasetMetricValueModel
Activity RecognitionDiving-48Accuracy88ORViT TimeSformer
Activity RecognitionEPIC-KITCHENS-100Action@145.7ORViT Mformer-L (ORViT blocks)
Activity RecognitionEPIC-KITCHENS-100Noun@158.7ORViT Mformer-L (ORViT blocks)
Activity RecognitionEPIC-KITCHENS-100Verb@168.4ORViT Mformer-L (ORViT blocks)
Activity RecognitionSomething-Something V2Top-1 Accuracy69.5ORViT Mformer-L (ORViT blocks)
Activity RecognitionSomething-Something V2Top-5 Accuracy91.5ORViT Mformer-L (ORViT blocks)
Activity RecognitionSomething-Something V2Top-1 Accuracy67.9ORViT Mformer (ORViT blocks)
Activity RecognitionSomething-Something V2Top-5 Accuracy90.5ORViT Mformer (ORViT blocks)
Activity RecognitionAVA v2.2mAP26.6ORViT MViT-B, 16x4 (K400 pretraining)
Action RecognitionDiving-48Accuracy88ORViT TimeSformer
Action RecognitionEPIC-KITCHENS-100Action@145.7ORViT Mformer-L (ORViT blocks)
Action RecognitionEPIC-KITCHENS-100Noun@158.7ORViT Mformer-L (ORViT blocks)
Action RecognitionEPIC-KITCHENS-100Verb@168.4ORViT Mformer-L (ORViT blocks)
Action RecognitionSomething-Something V2Top-1 Accuracy69.5ORViT Mformer-L (ORViT blocks)
Action RecognitionSomething-Something V2Top-5 Accuracy91.5ORViT Mformer-L (ORViT blocks)
Action RecognitionSomething-Something V2Top-1 Accuracy67.9ORViT Mformer (ORViT blocks)
Action RecognitionSomething-Something V2Top-5 Accuracy90.5ORViT Mformer (ORViT blocks)
Action RecognitionAVA v2.2mAP26.6ORViT MViT-B, 16x4 (K400 pretraining)

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08