Delong Liu, Haiwen Li, Zhaohui Hou, Zhicheng Zhao, Fei Su, Yuan Dong
Person retrieval has attracted rising attention. Existing methods are mainly divided into two retrieval modes, namely image-only and text-only. However, they are unable to make full use of the available information and are difficult to meet diverse application requirements. To address the above limitations, we propose a new Composed Person Retrieval (CPR) task, which combines visual and textual queries to identify individuals of interest from large-scale person image databases. Nevertheless, the foremost difficulty of the CPR task is the lack of available annotated datasets. Therefore, we first introduce a scalable automatic data synthesis pipeline, which decomposes complex multimodal data generation into the creation of textual quadruples followed by identity-consistent image synthesis using fine-tuned generative models. Meanwhile, a multimodal filtering method is designed to ensure the resulting SynCPR dataset retains 1.15 million high-quality and fully synthetic triplets. Additionally, to improve the representation of composed person queries, we propose a novel Fine-grained Adaptive Feature Alignment (FAFA) framework through fine-grained dynamic alignment and masked feature reasoning. Moreover, for objective evaluation, we manually annotate the Image-Text Composed Person Retrieval (ITCPR) test set. The extensive experiments demonstrate the effectiveness of the SynCPR dataset and the superiority of the proposed FAFA framework when compared with the state-of-the-art methods. All code and data will be provided at https://github.com/Delong-liu-bupt/Composed_Person_Retrieval.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Retrieval with Multi-Modal Query | ITCPR dataset | Rank-1 | 46.54 | FAFA |
| Image Retrieval with Multi-Modal Query | ITCPR dataset | mAP | 55.6 | FAFA |
| Image Retrieval with Multi-Modal Query | ITCPR dataset | Rank-1 | 45.55 | Word4Per(FAFA old version) |
| Image Retrieval with Multi-Modal Query | ITCPR dataset | mAP | 55.26 | Word4Per(FAFA old version) |
| Cross-Modal Information Retrieval | ITCPR dataset | Rank-1 | 46.54 | FAFA |
| Cross-Modal Information Retrieval | ITCPR dataset | mAP | 55.6 | FAFA |
| Cross-Modal Information Retrieval | ITCPR dataset | Rank-1 | 45.55 | Word4Per(FAFA old version) |
| Cross-Modal Information Retrieval | ITCPR dataset | mAP | 55.26 | Word4Per(FAFA old version) |
| Cross-Modal Retrieval | ITCPR dataset | Rank-1 | 46.54 | FAFA |
| Cross-Modal Retrieval | ITCPR dataset | mAP | 55.6 | FAFA |
| Cross-Modal Retrieval | ITCPR dataset | Rank-1 | 45.55 | Word4Per(FAFA old version) |
| Cross-Modal Retrieval | ITCPR dataset | mAP | 55.26 | Word4Per(FAFA old version) |