Alex Ergasti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati
Text-based person search (TBPS) is a problem that gained significant interest within the research community. The task is that of retrieving one or more images of a specific individual based on a textual description. The multi-modal nature of the task requires learning representations that bridge text and image data within a shared latent space. Existing TBPS systems face two major challenges. One is defined as inter-identity noise that is due to the inherent vagueness and imprecision of text descriptions and it indicates how descriptions of visual attributes can be generally associated to different people; the other is the intra-identity variations, which are all those nuisances e.g. pose, illumination, that can alter the visual appearance of the same textual attributes for a given subject. To address these issues, this paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive), which enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss. The former employs a Masked AutoEncoder trained to reconstruct randomly masked image patches with the aid of the textual description. In doing so the model is encouraged to learn more expressive representations and textual-visual relations in the latent space. The Attribute Loss, instead, balances the contribution of different types of attributes, defined as adjective-noun chunks of text. This loss ensures that every attribute is taken into consideration in the person retrieval process. Extensive experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements, with significant gains in the mean Average Precision (mAP) metric w.r.t. the current state of the art.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Text based Person Retrieval | CUHK-PEDES | R@1 | 77.62 | MARS |
| Text based Person Retrieval | CUHK-PEDES | R@10 | 94.27 | MARS |
| Text based Person Retrieval | CUHK-PEDES | R@5 | 90.63 | MARS |
| Text based Person Retrieval | CUHK-PEDES | mAP | 71.41 | MARS |
| Text based Person Retrieval | ICFG-PEDES | R@1 | 67.6 | MARS |
| Text based Person Retrieval | ICFG-PEDES | R@10 | 85.79 | MARS |
| Text based Person Retrieval | ICFG-PEDES | R@5 | 81.47 | MARS |
| Text based Person Retrieval | ICFG-PEDES | mAP | 44.93 | MARS |
| Text based Person Retrieval | RSTPReid | R@1 | 67.55 | MARS |
| Text based Person Retrieval | RSTPReid | R@10 | 91.35 | MARS |
| Text based Person Retrieval | RSTPReid | R@5 | 86.65 | MARS |
| Text based Person Retrieval | RSTPReid | mAP | 52.92 | MARS |