TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning What and Where: Disentangling Location and Identi...

Learning What and Where: Disentangling Location and Identity Tracking Without Supervision

Manuel Traub, Sebastian Otte, Tobias Menge, Matthias Karlbauer, Jannik Thümmel, Martin V. Butz

2022-05-26Video Object Tracking
PaperPDFCode(official)

Abstract

Our brain can almost effortlessly decompose visual data streams into background and salient objects. Moreover, it can anticipate object motion and interactions, which are crucial abilities for conceptual planning and reasoning. Recent object reasoning datasets, such as CATER, have revealed fundamental shortcomings of current vision-based AI systems, particularly when targeting explicit object representations, object permanence, and object reasoning. Here we introduce a self-supervised LOCation and Identity tracking system (Loci), which excels on the CATER tracking challenge. Inspired by the dorsal and ventral pathways in the brain, Loci tackles the binding problem by processing separate, slot-wise encodings of `what' and `where'. Loci's predictive coding-like processing encourages active error minimization, such that individual slots tend to encode individual objects. Interactions between objects and object dynamics are processed in the disentangled latent space. Truncated backpropagation through time combined with forward eligibility accumulation significantly speeds up learning and improves memory efficiency. Besides exhibiting superior performance in current benchmarks, Loci effectively extracts objects from video streams and separates them into location and Gestalt components. We believe that this separation offers a representation that will facilitate effective planning and reasoning on conceptual levels.

Results

TaskDatasetMetricValueModel
VideoCATERL10.14Loci
VideoCATERTop 1 Accuracy90.7Loci
VideoCATERTop 5 Accuracy98.5Loci
Object TrackingCATERL10.14Loci
Object TrackingCATERTop 1 Accuracy90.7Loci
Object TrackingCATERTop 5 Accuracy98.5Loci

Related Papers

HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking2025-07-10Enhancing Self-Supervised Fine-Grained Video Object Tracking with Dynamic Memory Prediction2025-04-30Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking2024-12-20Exploring Enhanced Contextual Information for Video-Level Object Tracking2024-12-15Referring Video Object Segmentation via Language-aligned Track Selection2024-12-02Teaching VLMs to Localize Specific Objects from In-context Examples2024-11-20NT-VOT211: A Large-Scale Benchmark for Night-time Visual Object Tracking2024-10-27Depth Attention for Robust RGB Tracking2024-10-27