TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VisualSparta: An Embarrassingly Simple Approach to Large-s...

VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words

Xiaopeng Lu, Tiancheng Zhao, Kyusong Lee

2021-01-01ACL 2021 5Cross-Modal RetrievalInformation RetrievalCross-Modal Information RetrievalRetrievalImage Retrieval
PaperPDFCode

Abstract

Text-to-image retrieval is an essential task in cross-modal information retrieval, i.e., retrieving relevant images from a large and unlabelled dataset given textual queries. In this paper, we propose VisualSparta, a novel (Visual-text Sparse Transformer Matching) model that shows significant improvement in terms of both accuracy and efficiency. VisualSparta is capable of outperforming previous state-of-the-art scalable methods in MSCOCO and Flickr30K. We also show that it achieves substantial retrieving speed advantages, i.e., for a 1 million image index, VisualSparta using CPU gets ~391X speedup compared to CPU vector search and ~5.4X speedup compared to vector search with GPU acceleration. Experiments show that this speed advantage even gets bigger for larger datasets because VisualSparta can be efficiently implemented as an inverted index. To the best of our knowledge, VisualSparta is the first transformer-based text-to-image retrieval model that can achieve real-time searching for large-scale datasets, with significant accuracy improvement compared to previous state-of-the-art methods.

Results

TaskDatasetMetricValueModel
Image RetrievalFlickr30K 1K testR@157.4VisualSparta
Image RetrievalFlickr30K 1K testR@1088.1VisualSparta
Image RetrievalFlickr30K 1K testR@582VisualSparta
Image RetrievalFlickr30kQPS451.4VisualSparta
Image RetrievalFlickr30kRecall@157.4VisualSparta
Image RetrievalFlickr30kRecall@1088.1VisualSparta
Image RetrievalFlickr30kRecall@582VisualSparta
Image RetrievalCOCO (Common Objects in Context)QPS451.4VisualSparta
Image RetrievalCOCO (Common Objects in Context)Recall@1096.3VisualSparta
Image RetrievalCOCO (Common Objects in Context)recall@168.2VisualSparta
Image RetrievalCOCO (Common Objects in Context)recall@591.8VisualSparta
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@144.4VisualSparta
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1082.4VisualSparta
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@572.8VisualSparta
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@144.4VisualSparta
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1082.4VisualSparta
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@572.8VisualSparta
Cross-Modal RetrievalCOCO 2014Text-to-image R@144.4VisualSparta
Cross-Modal RetrievalCOCO 2014Text-to-image R@1082.4VisualSparta
Cross-Modal RetrievalCOCO 2014Text-to-image R@572.8VisualSparta

Related Papers

Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16