TextInPlace: Indoor Visual Place Recognition in Repetitive Structures with Scene Text Spotting and Verification

Huaqi Tao, Bingxi Liu, Calvin Chen, Tingjun Huang, He Li, Jinqiang Cui, Hong Zhang

2025-03-09Visual Place Recognition text similarity Robot Navigation Text Spotting STS

Abstract

Visual Place Recognition (VPR) is a crucial capability for long-term autonomous robots, enabling them to identify previously visited locations using visual information. However, existing methods remain limited in indoor settings due to the highly repetitive structures inherent in such environments. We observe that scene text typically appears in indoor spaces, serving to distinguish visually similar but different places. This inspires us to propose TextInPlace, a simple yet effective VPR framework that integrates Scene Text Spotting (STS) to mitigate visual perceptual ambiguity in repetitive indoor environments. Specifically, TextInPlace adopts a dual-branch architecture within a local parameter sharing network. The VPR branch employs attention-based aggregation to extract global descriptors for coarse-grained retrieval, while the STS branch utilizes a bridging text spotter to detect and recognize scene text. Finally, the discriminative text is filtered to compute text similarity and re-rank the top-K retrieved images. To bridge the gap between current text-based repetitive indoor scene datasets and the typical scenarios encountered in robot navigation, we establish an indoor VPR benchmark dataset, called Maze-with-Text. Extensive experiments on both custom and public datasets demonstrate that TextInPlace achieves superior performance over existing methods that rely solely on appearance information. The dataset, code, and trained models are publicly available at https://github.com/HqiTao/TextInPlace.

Related Papers

Visual Place Recognition for Large-Scale UAV Applications2025-07-20 Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition2025-07-04 ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment2025-06-28 DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning2025-06-26 Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn't Help with MT Evaluation2025-06-25 ADA-DPM: A Neural Descriptors-based Adaptive Noise Point Filtering Strategy for SLAM2025-06-22 GeNIE: A Generalizable Navigation System for In-the-Wild Environments2025-06-22 Adversarial Attacks and Detection in Visual Place Recognition for Safer Robot Navigation2025-06-19