TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Cross-Modal Implicit Relation Reasoning and Aligning for T...

Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval

Ding Jiang, Mang Ye

2023-03-22CVPR 2023 1Text-based Person Retrieval with Noisy CorrespondenceImage-text matchingText MatchingMasked Language Modelingtext similarityText-based Person RetrievalPerson RetrievalRetrievalLanguage ModellingText based Person Retrieval
PaperPDFCode(official)

Abstract

Text-to-image person retrieval aims to identify the target person based on a given textual description query. The primary challenge is to learn the mapping of visual and textual modalities into a common latent space. Prior works have attempted to address this challenge by leveraging separately pre-trained unimodal models to extract visual and textual features. However, these approaches lack the necessary underlying alignment capabilities required to match multimodal data effectively. Besides, these works use prior information to explore explicit part alignments, which may lead to the distortion of intra-modality information. To alleviate these issues, we present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework that learns relations between local visual-textual tokens and enhances global image-text matching without requiring additional prior supervision. Specifically, we first design an Implicit Relation Reasoning module in a masked language modeling paradigm. This achieves cross-modal interaction by integrating the visual cues into the textual tokens with a cross-modal multimodal interaction encoder. Secondly, to globally align the visual and textual embeddings, Similarity Distribution Matching is proposed to minimize the KL divergence between image-text similarity distributions and the normalized label matching distributions. The proposed method achieves new state-of-the-art results on all three public datasets, with a notable margin of about 3%-9% for Rank-1 accuracy compared to prior methods.

Results

TaskDatasetMetricValueModel
Text based Person RetrievalICFG-PEDESR@163.46IRRA
Text based Person RetrievalICFG-PEDESR@1085.82IRRA
Text based Person RetrievalICFG-PEDESR@580.25IRRA
Text based Person RetrievalICFG-PEDESmAP38.06IRRA
Text based Person RetrievalICFG-PEDESmINP7.93IRRA
Text based Person RetrievalRSTPReidR@160.2IRRA
Text based Person RetrievalRSTPReidR@1081.3IRRA
Text based Person RetrievalRSTPReidR@588.2IRRA
Text-based Person Retrieval with Noisy CorrespondenceICFG-PEDESRank 160.76IRRA
Text-based Person Retrieval with Noisy CorrespondenceICFG-PEDESRank-1084.01IRRA
Text-based Person Retrieval with Noisy CorrespondenceICFG-PEDESRank-578.26IRRA
Text-based Person Retrieval with Noisy CorrespondenceICFG-PEDESmAP35.87IRRA
Text-based Person Retrieval with Noisy CorrespondenceICFG-PEDESmINP6.8IRRA
Text-based Person Retrieval with Noisy CorrespondenceRSTPReidRank 158.75IRRA
Text-based Person Retrieval with Noisy CorrespondenceRSTPReidRank 1088.25IRRA
Text-based Person Retrieval with Noisy CorrespondenceRSTPReidRank 581.9IRRA
Text-based Person Retrieval with Noisy CorrespondenceRSTPReidmAP46.38IRRA
Text-based Person Retrieval with Noisy CorrespondenceRSTPReidmINP24.78IRRA
Text-based Person Retrieval with Noisy CorrespondenceCUHK-PEDESRank 1092.2IRRA
Text-based Person Retrieval with Noisy CorrespondenceCUHK-PEDESRank-169.74IRRA
Text-based Person Retrieval with Noisy CorrespondenceCUHK-PEDESRank-587.09IRRA
Text-based Person Retrieval with Noisy CorrespondenceCUHK-PEDESmAP62.28IRRA
Text-based Person Retrieval with Noisy CorrespondenceCUHK-PEDESmINP45.84IRRA

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17