See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval

Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, Xiao Wang

2022-08-18Text-based Person Retrieval with Noisy Correspondence Text-based Person Retrieval Person Retrieval Retrieval Text based Person Retrieval

Paper PDF Code(official)

Abstract

Text-based person retrieval aims to find the query person based on a textual description. The key is to learn a common latent space mapping between visual-textual modalities. To achieve this goal, existing works employ segmentation to obtain explicitly cross-modal alignments or utilize attention to explore salient alignments. These methods have two shortcomings: 1) Labeling cross-modal alignments are time-consuming. 2) Attention methods can explore salient cross-modal alignments but may ignore some subtle and valuable pairs. To relieve these issues, we introduce an Implicit Visual-Textual (IVT) framework for text-based person retrieval. Different from previous models, IVT utilizes a single network to learn representation for both modalities, which contributes to the visual-textual interaction. To explore the fine-grained alignment, we further propose two implicit semantic alignment paradigms: multi-level alignment (MLA) and bidirectional mask modeling (BMM). The MLA module explores finer matching at sentence, phrase, and word levels, while the BMM module aims to mine \textbf{more} semantic alignments between visual and textual modalities. Extensive experiments are carried out to evaluate the proposed IVT on public datasets, i.e., CUHK-PEDES, RSTPReID, and ICFG-PEDES. Even without explicit body part alignment, our approach still achieves state-of-the-art performance. Code is available at: https://github.com/TencentYoutuResearch/PersonRetrieval-IVT.

Results

Task	Dataset	Metric	Value	Model
Text-based Person Retrieval with Noisy Correspondence	ICFG-PEDES	Rank 1	50.21	IVT
Text-based Person Retrieval with Noisy Correspondence	ICFG-PEDES	Rank-10	76.18	IVT
Text-based Person Retrieval with Noisy Correspondence	ICFG-PEDES	Rank-5	69.14	IVT
Text-based Person Retrieval with Noisy Correspondence	ICFG-PEDES	mAP	34.72	IVT
Text-based Person Retrieval with Noisy Correspondence	ICFG-PEDES	mINP	8.77	IVT
Text-based Person Retrieval with Noisy Correspondence	RSTPReid	Rank 1	43.65	IVT
Text-based Person Retrieval with Noisy Correspondence	RSTPReid	Rank 10	75.7	IVT
Text-based Person Retrieval with Noisy Correspondence	RSTPReid	Rank 5	66.5	IVT
Text-based Person Retrieval with Noisy Correspondence	RSTPReid	mAP	37.22	IVT
Text-based Person Retrieval with Noisy Correspondence	RSTPReid	mINP	20.47	IVT
Text-based Person Retrieval with Noisy Correspondence	CUHK-PEDES	Rank 10	85.61	IVT
Text-based Person Retrieval with Noisy Correspondence	CUHK-PEDES	Rank-1	58.59	IVT
Text-based Person Retrieval with Noisy Correspondence	CUHK-PEDES	Rank-5	78.51	IVT
Text-based Person Retrieval with Noisy Correspondence	CUHK-PEDES	mAP	57.19	IVT
Text-based Person Retrieval with Noisy Correspondence	CUHK-PEDES	mINP	45.78	IVT

See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval

Abstract

Results

Related Papers

See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval

Abstract

Results

Related Papers