Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, Ranjay Krishna
A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over $370K$ image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate $325K$, $316K$, and $309K$ hard negative captions for a subset of the pairs. To test productivity, CREPE contains $17K$ image-text pairs with nine different complexities plus $183K$ hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to $12\%$. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Question Answering | IntentQA | Accuracy | 20 | Random |
| Question Answering | EgoSchema (fullset) | Accuracy | 20 | Random |
| Question Answering | EgoSchema (subset) | Accuracy | 20 | Random |
| Video Question Answering | IntentQA | Accuracy | 20 | Random |
| Video Question Answering | EgoSchema (fullset) | Accuracy | 20 | Random |
| Video Question Answering | EgoSchema (subset) | Accuracy | 20 | Random |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, SC) | 39.44 | ViT-L-14 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, UC) | 33.81 | ViT-L-14 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom, UC) | 47.86 | ViT-L-14 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Comp, UC) | 60.78 | ViT-L-14 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, SC) | 37.32 | ViT-B-16+240 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, UC) | 32.26 | ViT-B-16+240 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom, UC) | 46.53 | ViT-B-16+240 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Comp, UC) | 60.19 | ViT-B-16+240 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, SC) | 37.01 | ViT-B-16 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, UC) | 30.81 | ViT-B-16 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom, UC) | 44.93 | ViT-B-16 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Comp, UC) | 59 | ViT-B-16 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, SC) | 34.28 | ViT-B-32 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, UC) | 28 | ViT-B-32 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom, UC) | 42.75 | ViT-B-32 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Comp, UC) | 54.8 | ViT-B-32 (LAION400M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, SC) | 23.38 | RN50 (YFCC15M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, UC) | 20.08 | RN50 (YFCC15M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom, UC) | 39.85 | RN50 (YFCC15M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Comp, UC) | 39.83 | RN50 (YFCC15M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, SC) | 23.26 | RN50 (CC12M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, UC) | 19.96 | RN50 (CC12M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom, UC) | 34.88 | RN50 (CC12M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Comp, UC) | 45.27 | RN50 (CC12M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, SC) | 22.74 | RN101 (YFCC15M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, UC) | 20.5 | RN101 (YFCC15M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom, UC) | 39.5 | RN101 (YFCC15M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Comp, UC) | 39.56 | RN101 (YFCC15M) |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, SC) | 9.09 | Random |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom + HN-Comp, UC) | 9.09 | Random |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Atom, UC) | 20 | Random |
| Image Retrieval | CREPE (Compositional REPresentation Evaluation) | Recall@1 (HN-Comp, UC) | 14.29 | Random |