Haochen Han, Kaiyao Miao, Qinghua Zheng, Minnan Luo
Despite the success of multimodal learning in cross-modal retrieval task, the remarkable progress relies on the correct correspondence among multimedia data. However, collecting such ideal data is expensive and time-consuming. In practice, most widely used datasets are harvested from the Internet and inevitably contain mismatched pairs. Training on such noisy correspondence datasets causes performance degradation because the cross-modal retrieval methods can wrongly enforce the mismatched data to be similar. To tackle this problem, we propose a Meta Similarity Correction Network (MSCN) to provide reliable similarity scores. We view a binary classification task as the meta-process that encourages the MSCN to learn discrimination from positive and negative meta-data. To further alleviate the influence of noise, we design an effective data purification strategy using meta-data as prior knowledge to remove the noisy samples. Extensive experiments are conducted to demonstrate the strengths of our method in both synthetic and real-world noises, including Flickr30K, MS-COCO, and Conceptual Captions.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Retrieval with Multi-Modal Query | COCO-Noisy | Image-to-text R@1 | 78.1 | MSCN |
| Image Retrieval with Multi-Modal Query | COCO-Noisy | Image-to-text R@10 | 98.8 | MSCN |
| Image Retrieval with Multi-Modal Query | COCO-Noisy | Image-to-text R@5 | 97.2 | MSCN |
| Image Retrieval with Multi-Modal Query | COCO-Noisy | R-Sum | 524.6 | MSCN |
| Image Retrieval with Multi-Modal Query | COCO-Noisy | Text-to-image R@1 | 64.3 | MSCN |
| Image Retrieval with Multi-Modal Query | COCO-Noisy | Text-to-image R@10 | 95.8 | MSCN |
| Image Retrieval with Multi-Modal Query | COCO-Noisy | Text-to-image R@5 | 90.4 | MSCN |
| Image Retrieval with Multi-Modal Query | CC152K | Image-to-text R@1 | 40.1 | MSCN |
| Image Retrieval with Multi-Modal Query | CC152K | Image-to-text R@10 | 76.6 | MSCN |
| Image Retrieval with Multi-Modal Query | CC152K | Image-to-text R@5 | 65.7 | MSCN |
| Image Retrieval with Multi-Modal Query | CC152K | R-Sum | 366.7 | MSCN |
| Image Retrieval with Multi-Modal Query | CC152K | Text-to-image R@1 | 40.6 | MSCN |
| Image Retrieval with Multi-Modal Query | CC152K | Text-to-image R@10 | 76.3 | MSCN |
| Image Retrieval with Multi-Modal Query | CC152K | Text-to-image R@5 | 67.4 | MSCN |
| Image Retrieval with Multi-Modal Query | Flickr30K-Noisy | Image-to-text R@1 | 77.4 | MSCN |
| Image Retrieval with Multi-Modal Query | Flickr30K-Noisy | Image-to-text R@10 | 97.6 | MSCN |
| Image Retrieval with Multi-Modal Query | Flickr30K-Noisy | Image-to-text R@5 | 94.9 | MSCN |
| Image Retrieval with Multi-Modal Query | Flickr30K-Noisy | R-Sum | 501.9 | MSCN |
| Image Retrieval with Multi-Modal Query | Flickr30K-Noisy | Text-to-image R@1 | 59.6 | MSCN |
| Image Retrieval with Multi-Modal Query | Flickr30K-Noisy | Text-to-image R@10 | 89.2 | MSCN |
| Image Retrieval with Multi-Modal Query | Flickr30K-Noisy | Text-to-image R@5 | 83.2 | MSCN |
| Cross-Modal Information Retrieval | COCO-Noisy | Image-to-text R@1 | 78.1 | MSCN |
| Cross-Modal Information Retrieval | COCO-Noisy | Image-to-text R@10 | 98.8 | MSCN |
| Cross-Modal Information Retrieval | COCO-Noisy | Image-to-text R@5 | 97.2 | MSCN |
| Cross-Modal Information Retrieval | COCO-Noisy | R-Sum | 524.6 | MSCN |
| Cross-Modal Information Retrieval | COCO-Noisy | Text-to-image R@1 | 64.3 | MSCN |
| Cross-Modal Information Retrieval | COCO-Noisy | Text-to-image R@10 | 95.8 | MSCN |
| Cross-Modal Information Retrieval | COCO-Noisy | Text-to-image R@5 | 90.4 | MSCN |
| Cross-Modal Information Retrieval | CC152K | Image-to-text R@1 | 40.1 | MSCN |
| Cross-Modal Information Retrieval | CC152K | Image-to-text R@10 | 76.6 | MSCN |
| Cross-Modal Information Retrieval | CC152K | Image-to-text R@5 | 65.7 | MSCN |
| Cross-Modal Information Retrieval | CC152K | R-Sum | 366.7 | MSCN |
| Cross-Modal Information Retrieval | CC152K | Text-to-image R@1 | 40.6 | MSCN |
| Cross-Modal Information Retrieval | CC152K | Text-to-image R@10 | 76.3 | MSCN |
| Cross-Modal Information Retrieval | CC152K | Text-to-image R@5 | 67.4 | MSCN |
| Cross-Modal Information Retrieval | Flickr30K-Noisy | Image-to-text R@1 | 77.4 | MSCN |
| Cross-Modal Information Retrieval | Flickr30K-Noisy | Image-to-text R@10 | 97.6 | MSCN |
| Cross-Modal Information Retrieval | Flickr30K-Noisy | Image-to-text R@5 | 94.9 | MSCN |
| Cross-Modal Information Retrieval | Flickr30K-Noisy | R-Sum | 501.9 | MSCN |
| Cross-Modal Information Retrieval | Flickr30K-Noisy | Text-to-image R@1 | 59.6 | MSCN |
| Cross-Modal Information Retrieval | Flickr30K-Noisy | Text-to-image R@10 | 89.2 | MSCN |
| Cross-Modal Information Retrieval | Flickr30K-Noisy | Text-to-image R@5 | 83.2 | MSCN |
| Cross-Modal Retrieval | COCO-Noisy | Image-to-text R@1 | 78.1 | MSCN |
| Cross-Modal Retrieval | COCO-Noisy | Image-to-text R@10 | 98.8 | MSCN |
| Cross-Modal Retrieval | COCO-Noisy | Image-to-text R@5 | 97.2 | MSCN |
| Cross-Modal Retrieval | COCO-Noisy | R-Sum | 524.6 | MSCN |
| Cross-Modal Retrieval | COCO-Noisy | Text-to-image R@1 | 64.3 | MSCN |
| Cross-Modal Retrieval | COCO-Noisy | Text-to-image R@10 | 95.8 | MSCN |
| Cross-Modal Retrieval | COCO-Noisy | Text-to-image R@5 | 90.4 | MSCN |
| Cross-Modal Retrieval | CC152K | Image-to-text R@1 | 40.1 | MSCN |
| Cross-Modal Retrieval | CC152K | Image-to-text R@10 | 76.6 | MSCN |
| Cross-Modal Retrieval | CC152K | Image-to-text R@5 | 65.7 | MSCN |
| Cross-Modal Retrieval | CC152K | R-Sum | 366.7 | MSCN |
| Cross-Modal Retrieval | CC152K | Text-to-image R@1 | 40.6 | MSCN |
| Cross-Modal Retrieval | CC152K | Text-to-image R@10 | 76.3 | MSCN |
| Cross-Modal Retrieval | CC152K | Text-to-image R@5 | 67.4 | MSCN |
| Cross-Modal Retrieval | Flickr30K-Noisy | Image-to-text R@1 | 77.4 | MSCN |
| Cross-Modal Retrieval | Flickr30K-Noisy | Image-to-text R@10 | 97.6 | MSCN |
| Cross-Modal Retrieval | Flickr30K-Noisy | Image-to-text R@5 | 94.9 | MSCN |
| Cross-Modal Retrieval | Flickr30K-Noisy | R-Sum | 501.9 | MSCN |
| Cross-Modal Retrieval | Flickr30K-Noisy | Text-to-image R@1 | 59.6 | MSCN |
| Cross-Modal Retrieval | Flickr30K-Noisy | Text-to-image R@10 | 89.2 | MSCN |
| Cross-Modal Retrieval | Flickr30K-Noisy | Text-to-image R@5 | 83.2 | MSCN |