TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Detector-Free Weakly Supervised Grounding by Separation

Detector-Free Weakly Supervised Grounding by Separation

Assaf Arbelle, Sivan Doveh, Amit Alfassy, Joseph Shtok, Guy Lev, Eli Schwartz, Hilde Kuehne, Hila Barak Levi, Prasanna Sattigeri, Rameswar Panda, Chun-Fu Chen, Alex Bronstein, Kate Saenko, Shimon Ullman, Raja Giryes, Rogerio Feris, Leonid Karlinsky

2021-04-20ICCV 2021 10Phrase Grounding
PaperPDFCode(official)

Abstract

Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume the existence of a pre-trained object detector, relying on it to produce the ROIs for localization. In this work, we focus on the task of Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector. We directly learn everything from the images and associated free-form text pairs, thus potentially gaining an advantage on the categories unsupported by the detector. The key idea behind our proposed Grounding by Separation (GbS) method is synthesizing `text to image-regions' associations by random alpha-blending of arbitrary image pairs and using the corresponding texts of the pair as conditions to recover the alpha map from the blended image via a segmentation network. At test time, this allows using the query phrase as a condition for a non-blended query image, thus interpreting the test image as a composition of a region corresponding to the phrase and the complement region. Using this approach we demonstrate a significant accuracy improvement, of up to $8.5\%$ over previous DF-WSG SotA, for a range of benchmarks including Flickr30K, Visual Genome, and ReferIt, as well as a significant complementary improvement (above $7\%$) over the detector-based approaches for WSG.

Results

TaskDatasetMetricValueModel
Phrase GroundingVisual GenomePointing Game Accuracy55.91GbS VG
Phrase GroundingVisual GenomePointing Game Accuracy54.55GbS Ensemble MS-COCO
Phrase GroundingFlickr30kPointing Game Accuracy85.9GBS Ensemble + 12-in-1
Phrase GroundingFlickr30kPointing Game Accuracy75.6GbS Ensemble MS-COCO
Phrase GroundingReferItPointing Game Accuracy58.21GbS Ensemble MS-COCO

Related Papers

Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models2025-06-12Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures2025-05-16A Comparison of Object Detection and Phrase Grounding Models in Chest X-ray Abnormality Localization using Eye-tracking Data2025-03-02Progressive Local Alignment for Medical Multimodal Pre-training2025-02-25Anatomical grounding pre-training for medical phrase grounding2025-02-23VICCA: Visual Interpretation and Comprehension of Chest X-ray Anomalies in Generated Report Without Human Feedback2025-01-29Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding2025-01-28Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension2025-01-02