TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Flamingo: a Visual Language Model for Few-Shot Learning

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan

2022-04-29DeepMind 2022 4Zero-Shot Video Question AnswerQuestion AnsweringFew-Shot LearningZero-Shot Cross-Modal RetrievalGenerative Visual Question AnsweringTemporal/Casual QAVideo Question AnsweringVideo UnderstandingVisual Question Answering (VQA)Meme ClassificationZero-Shot LearningMedical Visual Question AnsweringLanguage ModellingMultiple-choiceVisual Question Answering
PaperPDFCodeCodeCodeCodeCode

Abstract

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.

Results

TaskDatasetMetricValueModel
Question AnsweringSTAR BenchmarkAccuracy41.8Flamingo-9B
Question AnsweringNExT-QAWUPS33.5Flamingo(32-shot)
Question AnsweringNExT-QAWUPS26.7Flamingo(0-shot)
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.474Flamingo
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.31Flamingo (32-shot)
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.174Flamingo (0-shot)
Visual Question Answering (VQA)OK-VQAAccuracy50.6Flamingo80B
Visual Question Answering (VQA)OK-VQAAccuracy44.7Flamingo9B
Visual Question Answering (VQA)OK-VQAAccuracy41.2Flamingo3B
Visual Question Answering (VQA)PMC-VQAAccuracy26.4Open-Flamingo
Visual Question Answering (VQA)VQA v2 test-devAccuracy56.3Flamingo 80B
Visual Question Answering (VQA)VQA v2 test-devAccuracy51.8Flamingo 9B
Visual Question Answering (VQA)VQA v2 test-devAccuracy49.2Flamingo 3B
Visual Question Answering (VQA)PMC-VQABLEU-14.1Open-Flamingo
Video Question AnsweringSTAR BenchmarkAverage Accuracy42.8Flamingo-9B (4-shot)
Video Question AnsweringSTAR BenchmarkAverage Accuracy42.4Flamingo-80B (4-shot)
Video Question AnsweringSTAR BenchmarkAverage Accuracy41.8Flamingo-9B (0-shot)
Video Question AnsweringSTAR BenchmarkAverage Accuracy39.7Flamingo-80B (0-shot)
Video Question AnsweringSTAR BenchmarkAccuracy41.8Flamingo-9B
Activity RecognitionRareActmWAP60.8🦩 Flamingo
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@189.3Flamingo
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1099.7Flamingo
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@598.8Flamingo
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@179.5Flamingo
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1097.9Flamingo
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@595.3Flamingo
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@165.9Flamingo
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1092.9Flamingo
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@587.3Flamingo
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@148Flamingo
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1082.1Flamingo
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@573.3Flamingo
Action RecognitionRareActmWAP60.8🦩 Flamingo
Meme ClassificationHateful MemesROC-AUC0.866Flamingo (fine-tuned)
Meme ClassificationHateful MemesROC-AUC0.7Flamingo (few-shot:32)
Generative Visual Question AnsweringPMC-VQABLEU-14.1Open-Flamingo

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17