Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson
Recently, video transformers have shown great success in video understanding, exceeding CNN performance; yet existing video transformer models do not explicitly model objects, although objects can be essential for recognizing actions. In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly incorporates object representations. The key idea is to fuse object-centric representations starting from early layers and propagate them into the transformer-layers, thus affecting the spatio-temporal representations throughout the network. Our ORViT block consists of two object-level streams: appearance and dynamics. In the appearance stream, an "Object-Region Attention" module applies self-attention over the patches and \emph{object regions}. In this way, visual object regions interact with uniform patch tokens and enrich them with contextualized object information. We further model object dynamics via a separate "Object-Dynamics Module", which captures trajectory interactions, and show how to integrate the two streams. We evaluate our model on four tasks and five datasets: compositional and few-shot action recognition on SomethingElse, spatio-temporal action detection on AVA, and standard action recognition on Something-Something V2, Diving48 and Epic-Kitchen100. We show strong performance improvement across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a transformer architecture. For code and pretrained models, visit the project page at \url{https://roeiherz.github.io/ORViT/}
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Activity Recognition | Diving-48 | Accuracy | 88 | ORViT TimeSformer |
| Activity Recognition | EPIC-KITCHENS-100 | Action@1 | 45.7 | ORViT Mformer-L (ORViT blocks) |
| Activity Recognition | EPIC-KITCHENS-100 | Noun@1 | 58.7 | ORViT Mformer-L (ORViT blocks) |
| Activity Recognition | EPIC-KITCHENS-100 | Verb@1 | 68.4 | ORViT Mformer-L (ORViT blocks) |
| Activity Recognition | Something-Something V2 | Top-1 Accuracy | 69.5 | ORViT Mformer-L (ORViT blocks) |
| Activity Recognition | Something-Something V2 | Top-5 Accuracy | 91.5 | ORViT Mformer-L (ORViT blocks) |
| Activity Recognition | Something-Something V2 | Top-1 Accuracy | 67.9 | ORViT Mformer (ORViT blocks) |
| Activity Recognition | Something-Something V2 | Top-5 Accuracy | 90.5 | ORViT Mformer (ORViT blocks) |
| Activity Recognition | AVA v2.2 | mAP | 26.6 | ORViT MViT-B, 16x4 (K400 pretraining) |
| Action Recognition | Diving-48 | Accuracy | 88 | ORViT TimeSformer |
| Action Recognition | EPIC-KITCHENS-100 | Action@1 | 45.7 | ORViT Mformer-L (ORViT blocks) |
| Action Recognition | EPIC-KITCHENS-100 | Noun@1 | 58.7 | ORViT Mformer-L (ORViT blocks) |
| Action Recognition | EPIC-KITCHENS-100 | Verb@1 | 68.4 | ORViT Mformer-L (ORViT blocks) |
| Action Recognition | Something-Something V2 | Top-1 Accuracy | 69.5 | ORViT Mformer-L (ORViT blocks) |
| Action Recognition | Something-Something V2 | Top-5 Accuracy | 91.5 | ORViT Mformer-L (ORViT blocks) |
| Action Recognition | Something-Something V2 | Top-1 Accuracy | 67.9 | ORViT Mformer (ORViT blocks) |
| Action Recognition | Something-Something V2 | Top-5 Accuracy | 90.5 | ORViT Mformer (ORViT blocks) |
| Action Recognition | AVA v2.2 | mAP | 26.6 | ORViT MViT-B, 16x4 (K400 pretraining) |