TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CREPE: Can Vision-Language Foundation Models Reason Compos...

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, Ranjay Krishna

2022-12-13CVPR 2023 1NegationRetrievalImage Retrieval
PaperPDFCode(official)

Abstract

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over $370K$ image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate $325K$, $316K$, and $309K$ hard negative captions for a subset of the pairs. To test productivity, CREPE contains $17K$ image-text pairs with nine different complexities plus $183K$ hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to $12\%$. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.

Results

TaskDatasetMetricValueModel
Question AnsweringIntentQAAccuracy20Random
Question AnsweringEgoSchema (fullset)Accuracy20Random
Question AnsweringEgoSchema (subset)Accuracy20Random
Video Question AnsweringIntentQAAccuracy20Random
Video Question AnsweringEgoSchema (fullset)Accuracy20Random
Video Question AnsweringEgoSchema (subset)Accuracy20Random
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, SC)39.44ViT-L-14 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, UC)33.81ViT-L-14 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)47.86ViT-L-14 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)60.78ViT-L-14 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, SC)37.32ViT-B-16+240 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, UC)32.26ViT-B-16+240 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)46.53ViT-B-16+240 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)60.19ViT-B-16+240 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, SC)37.01ViT-B-16 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, UC)30.81ViT-B-16 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)44.93ViT-B-16 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)59ViT-B-16 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, SC)34.28ViT-B-32 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, UC)28ViT-B-32 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)42.75ViT-B-32 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)54.8ViT-B-32 (LAION400M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, SC)23.38RN50 (YFCC15M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, UC)20.08RN50 (YFCC15M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)39.85RN50 (YFCC15M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)39.83RN50 (YFCC15M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, SC)23.26RN50 (CC12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, UC)19.96RN50 (CC12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)34.88RN50 (CC12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)45.27RN50 (CC12M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, SC)22.74RN101 (YFCC15M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, UC)20.5RN101 (YFCC15M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)39.5RN101 (YFCC15M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)39.56RN101 (YFCC15M)
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, SC)9.09Random
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom + HN-Comp, UC)9.09Random
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Atom, UC)20Random
Image RetrievalCREPE (Compositional REPresentation Evaluation)Recall@1 (HN-Comp, UC)14.29Random

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16