Yang Qin, Yingke Chen, Dezhong Peng, Xi Peng, Joey Tianyi Zhou, Peng Hu
Text-to-image person re-identification (TIReID) is a compelling topic in the cross-modal community, which aims to retrieve the target person based on a textual query. Although numerous TIReID methods have been proposed and achieved promising performance, they implicitly assume the training image-text pairs are correctly aligned, which is not always the case in real-world scenarios. In practice, the image-text pairs inevitably exist under-correlated or even false-correlated, a.k.a noisy correspondence (NC), due to the low quality of the images and annotation errors. To address this problem, we propose a novel Robust Dual Embedding method (RDE) that can learn robust visual-semantic associations even with NC. Specifically, RDE consists of two main components: 1) A Confident Consensus Division (CCD) module that leverages the dual-grained decisions of dual embedding modules to obtain a consensus set of clean training data, which enables the model to learn correct and reliable visual-semantic associations. 2) A Triplet Alignment Loss (TAL) relaxes the conventional Triplet Ranking loss with the hardest negative samples to a log-exponential upper bound over all negative ones, thus preventing the model collapse under NC and can also focus on hard-negative samples for promising performance. We conduct extensive experiments on three public benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID, to evaluate the performance and robustness of our RDE. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on all three datasets. Code is available at https://github.com/QinYang79/RDE.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Text based Person Retrieval | ICFG-PEDES | R@1 | 67.68 | RDE |
| Text based Person Retrieval | ICFG-PEDES | R@10 | 87.36 | RDE |
| Text based Person Retrieval | ICFG-PEDES | R@5 | 82.47 | RDE |
| Text based Person Retrieval | ICFG-PEDES | mAP | 40.06 | RDE |
| Text based Person Retrieval | ICFG-PEDES | mINP | 7.87 | RDE |
| Text based Person Retrieval | RSTPReid | R@1 | 65.35 | RDE |
| Text based Person Retrieval | RSTPReid | R@10 | 89.9 | RDE |
| Text based Person Retrieval | RSTPReid | R@5 | 83.95 | RDE |
| Text based Person Retrieval | RSTPReid | mAP | 50.88 | RDE |
| Text based Person Retrieval | RSTPReid | mINP | 28.08 | RDE |
| Text-based Person Retrieval with Noisy Correspondence | ICFG-PEDES | Rank 1 | 66.54 | RDE |
| Text-based Person Retrieval with Noisy Correspondence | ICFG-PEDES | Rank-10 | 86.7 | RDE |
| Text-based Person Retrieval with Noisy Correspondence | ICFG-PEDES | Rank-5 | 81.7 | RDE |
| Text-based Person Retrieval with Noisy Correspondence | ICFG-PEDES | mAP | 39.08 | RDE |
| Text-based Person Retrieval with Noisy Correspondence | ICFG-PEDES | mINP | 7.55 | RDE |
| Text-based Person Retrieval with Noisy Correspondence | RSTPReid | Rank 1 | 64.45 | RDE |
| Text-based Person Retrieval with Noisy Correspondence | RSTPReid | Rank 10 | 90 | RDE |
| Text-based Person Retrieval with Noisy Correspondence | RSTPReid | Rank 5 | 83.5 | RDE |
| Text-based Person Retrieval with Noisy Correspondence | RSTPReid | mAP | 49.78 | RDE |
| Text-based Person Retrieval with Noisy Correspondence | RSTPReid | mINP | 27.43 | RDE |
| Text-based Person Retrieval with Noisy Correspondence | CUHK-PEDES | Rank 10 | 93.63 | RDE |
| Text-based Person Retrieval with Noisy Correspondence | CUHK-PEDES | Rank-1 | 74.46 | RDE |
| Text-based Person Retrieval with Noisy Correspondence | CUHK-PEDES | Rank-5 | 89.42 | RDE |
| Text-based Person Retrieval with Noisy Correspondence | CUHK-PEDES | mAP | 66.13 | RDE |
| Text-based Person Retrieval with Noisy Correspondence | CUHK-PEDES | mINP | 49.66 | RDE |