Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Retrieval | COCO (Common Objects in Context) | recall@1 | 38.38 | FLAVA (zero-shot) |
| Image Retrieval | COCO (Common Objects in Context) | recall@5 | 67.47 | FLAVA (zero-shot) |
| Image Retrieval | COCO (Common Objects in Context) | recall@1 | 33.29 | CLIP (zero-shot) |
| Image Retrieval | COCO (Common Objects in Context) | recall@5 | 62.47 | CLIP (zero-shot) |
| Image-to-Text Retrieval | COCO (Common Objects in Context) | Recall@1 | 42.74 | FLAVA (ViT-B, zero-shot) |
| Image-to-Text Retrieval | COCO (Common Objects in Context) | Recall@5 | 76.76 | FLAVA (ViT-B, zero-shot) |