TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/See Finer, See More: Implicit Modality Alignment for Text-...

See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval

Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, Xiao Wang

2022-08-18Text-based Person Retrieval with Noisy CorrespondenceText-based Person RetrievalPerson RetrievalRetrievalText based Person Retrieval
PaperPDFCode(official)

Abstract

Text-based person retrieval aims to find the query person based on a textual description. The key is to learn a common latent space mapping between visual-textual modalities. To achieve this goal, existing works employ segmentation to obtain explicitly cross-modal alignments or utilize attention to explore salient alignments. These methods have two shortcomings: 1) Labeling cross-modal alignments are time-consuming. 2) Attention methods can explore salient cross-modal alignments but may ignore some subtle and valuable pairs. To relieve these issues, we introduce an Implicit Visual-Textual (IVT) framework for text-based person retrieval. Different from previous models, IVT utilizes a single network to learn representation for both modalities, which contributes to the visual-textual interaction. To explore the fine-grained alignment, we further propose two implicit semantic alignment paradigms: multi-level alignment (MLA) and bidirectional mask modeling (BMM). The MLA module explores finer matching at sentence, phrase, and word levels, while the BMM module aims to mine \textbf{more} semantic alignments between visual and textual modalities. Extensive experiments are carried out to evaluate the proposed IVT on public datasets, i.e., CUHK-PEDES, RSTPReID, and ICFG-PEDES. Even without explicit body part alignment, our approach still achieves state-of-the-art performance. Code is available at: https://github.com/TencentYoutuResearch/PersonRetrieval-IVT.

Results

TaskDatasetMetricValueModel
Text-based Person Retrieval with Noisy CorrespondenceICFG-PEDESRank 150.21IVT
Text-based Person Retrieval with Noisy CorrespondenceICFG-PEDESRank-1076.18IVT
Text-based Person Retrieval with Noisy CorrespondenceICFG-PEDESRank-569.14IVT
Text-based Person Retrieval with Noisy CorrespondenceICFG-PEDESmAP34.72IVT
Text-based Person Retrieval with Noisy CorrespondenceICFG-PEDESmINP8.77IVT
Text-based Person Retrieval with Noisy CorrespondenceRSTPReidRank 143.65IVT
Text-based Person Retrieval with Noisy CorrespondenceRSTPReidRank 1075.7IVT
Text-based Person Retrieval with Noisy CorrespondenceRSTPReidRank 566.5IVT
Text-based Person Retrieval with Noisy CorrespondenceRSTPReidmAP37.22IVT
Text-based Person Retrieval with Noisy CorrespondenceRSTPReidmINP20.47IVT
Text-based Person Retrieval with Noisy CorrespondenceCUHK-PEDESRank 1085.61IVT
Text-based Person Retrieval with Noisy CorrespondenceCUHK-PEDESRank-158.59IVT
Text-based Person Retrieval with Noisy CorrespondenceCUHK-PEDESRank-578.51IVT
Text-based Person Retrieval with Noisy CorrespondenceCUHK-PEDESmAP57.19IVT
Text-based Person Retrieval with Noisy CorrespondenceCUHK-PEDESmINP45.78IVT

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15