TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

Honglu Zhou, Asim Kadav, Farley Lai, Alexandru Niculescu-Mizil, Martin Renqiang Min, Mubbasir Kapadia, Hans Peter Graf

2021-03-19ICLR 2021 1Video Object Tracking
PaperPDFCode(official)

Abstract

This paper considers the problem of spatiotemporal object-centric reasoning in videos. Central to our approach is the notion of object permanence, i.e., the ability to reason about the location of objects as they move through the video while being occluded, contained or carried by other objects. Existing deep learning based approaches often suffer from spatiotemporal biases when applied to video reasoning problems. We propose Hopper, which uses a Multi-hop Transformer for reasoning object permanence in videos. Given a video and a localization query, Hopper reasons over image and object tracks to automatically hop over critical frames in an iterative fashion to predict the final position of the object of interest. We demonstrate the effectiveness of using a contrastive loss to reduce spatiotemporal biases. We evaluate over CATER dataset and find that Hopper achieves 73.2% Top-1 accuracy using just 1 FPS by hopping through just a few critical frames. We also demonstrate Hopper can perform long-term reasoning by building a CATER-h dataset that requires multi-step reasoning to localize objects of interest correctly.

Results

TaskDatasetMetricValueModel
VideoCATERL10.85Hopper
VideoCATERTop 1 Accuracy73.2Hopper
VideoCATERTop 5 Accuracy93.8Hopper
Object TrackingCATERL10.85Hopper
Object TrackingCATERTop 1 Accuracy73.2Hopper
Object TrackingCATERTop 5 Accuracy93.8Hopper

Related Papers

HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking2025-07-10Enhancing Self-Supervised Fine-Grained Video Object Tracking with Dynamic Memory Prediction2025-04-30Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking2024-12-20Exploring Enhanced Contextual Information for Video-Level Object Tracking2024-12-15Referring Video Object Segmentation via Language-aligned Track Selection2024-12-02Teaching VLMs to Localize Specific Objects from In-context Examples2024-11-20NT-VOT211: A Large-Scale Benchmark for Night-time Visual Object Tracking2024-10-27Depth Attention for Robust RGB Tracking2024-10-27