TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Coarse-to-Fine Contrastive Learning in Image-Text-Graph Sp...

Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality

Harman Singh, Pengchuan Zhang, Qifan Wang, Mengjiao Wang, Wenhan Xiong, Jingfei Du, Yu Chen

2023-05-23Representation LearningAttributeSystematic GeneralizationContrastive LearningImage Retrieval
PaperPDF

Abstract

Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning, leading to state-of-the-art models for various downstream multimodal tasks. However, recent research has highlighted severe limitations of these models in their ability to perform compositional reasoning over objects, attributes, and relations. Scene graphs have emerged as an effective way to understand images compositionally. These are graph-structured semantic representations of images that contain objects, their attributes, and relations with other objects in a scene. In this work, we consider the scene graph parsed from text as a proxy for the image scene graph and propose a graph decomposition and augmentation framework along with a coarse-to-fine contrastive learning objective between images and text that aligns sentences of various complexities to the same image. Along with this, we propose novel negative mining techniques in the scene graph space for improving attribute binding and relation understanding. Through extensive experiments, we demonstrate the effectiveness of our approach that significantly improves attribute binding, relation understanding, systematic generalization, and productivity on multiple recently proposed benchmarks (For example, improvements upto $18\%$ for systematic generalization, $16.5\%$ for relation understanding over a strong baseline), while achieving similar or better performance than CLIP on various general multimodal tasks.

Results

TaskDatasetMetricValueModel
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)44.5Swin-T (MosaiCLIP, CC-12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)92.1Swin-T (MosaiCLIP, CC-12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)44.4RN-50 (MosaiCLIP, CC-12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)92.6RN-50 (MosaiCLIP, CC-12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)41.5MosaiCLIP (YFCC-FT)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)48.8MosaiCLIP (YFCC-FT)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)41.4RN-50 (NegCLIP, CC-12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)82RN-50 (NegCLIP, CC-12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)40.9MosaiCLIP (CC-FT)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)72.4MosaiCLIP (CC-FT)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)39.6Swin-T (NegCLIP, CC-12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)80.3Swin-T (NegCLIP, CC-12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)39.5CLIP (YFCC-FT)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)39.8CLIP (YFCC-FT)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)39NegCLIP (YFCC-FT)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)38.8NegCLIP (YFCC-FT)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)38.3CLIP-FT (YFCC-FT)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)36.4CLIP-FT (YFCC-FT)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)37.5NegCLIP (CC-FT)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)53.1NegCLIP (CC-FT)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)37.3Swin-T (CLIP, CC-12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)44.1Swin-T (CLIP, CC-12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)36.7RN-50 (CLIP, CC-12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)42.9RN-50 (CLIP, CC-12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)35.6CLIP-FT (CC-FT)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)45.8CLIP-FT (CC-FT)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)35CLIP (CC-FT)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)45.1CLIP (CC-FT)

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17