TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Image Captioning/COCO Captions

Image Captioning on COCO Captions

Metric: METEOR (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕METEOR▼Extra DataPaperDate↕Code
1CoCa33.9NoCoCa: Contrastive Captioners are Image-Text Foun...2022-05-04Code
2SimVLM33.4NoSimVLM: Simple Visual Language Model Pretraining...2021-08-24Code
3OFA32.5NoOFA: Unifying Architectures, Tasks, and Modaliti...2022-02-07Code
4GIT32.2NoGIT: A Generative Image-to-text Transformer for ...2022-05-27Code
5mPLUG32NomPLUG: Effective and Efficient Vision-Language L...2022-05-24Code
6Prompt Tuning31.51NoPrompt Tuning for Generative Multimodal Pretrain...2022-08-04Code
7LEMON31.4NoScaling Up Vision-Language Pre-training for Imag...2021-11-24-
8Prismer31.4NoPrismer: A Vision-Language Model with Multi-Task...2023-03-04Code
9L-Verse31.4NoL-Verse: Bidirectional Generation Between Image ...2021-11-22Code
10VinVL31.1NoVinVL: Revisiting Visual Representations in Visi...2021-01-02Code
11ExpansionNet v2 (No VL pretraining)30.6NoExploiting Multiple Sequence Lengths in Fast End...2022-08-13Code
12GRIT (No VL pretraining - base)30.6NoGRIT: Faster and Better Image captioning Transfo...2022-07-20Code
13Oscar30.6NoOscar: Object-Semantics Aligned Pre-training for...2020-04-13Code
14Xmodal-Ctx30.4NoBeyond a Pre-Trained Object Detector: Cross-Moda...2022-05-09Code
15PTP-BLIP (14M)30.4NoPosition-guided Text Prompt for Vision-Language ...2022-12-19Code
16Xmodal-Ctx30NoBeyond a Pre-Trained Object Detector: Cross-Moda...2022-05-09Code
17X-Transformer29.5NoX-Linear Attention Networks for Image Captioning2020-03-31Code
18LaDiC (ours, 30 steps)29.5NoLaDiC: Are Diffusion Models Really Inferior to A...2024-04-16Code
19AoANet + VC29.3NoVisual Commonsense R-CNN2020-02-27Code
20Meshed-Memory Transformer29.2NoMeshed-Memory Transformer for Image Captioning2019-12-17Code
21Transformer_NSC28.9NoA Better Variant of Self-Critical Sequence Train...2020-03-22Code
22CLIP Text Encoder (RL w/ CIDEr-reward)28.7NoFine-grained Image Captioning with CLIP Reward2022-05-26Code
23RefineCap (w/ REINFORCE)28.3NoRefineCap: Concept-Aware Refinement for Image Ca...2021-09-08-
24SmallCapd=16, Large28.3NoSmallCap: Lightweight Image Captioning Prompted ...2022-09-30Code
25RDN28.1NoReflective Decoding Network for Image Captioning2019-08-30-
26ClipCap (Transformer)27.45NoClipCap: CLIP Prefix for Image Captioning2021-11-18Code
27ClipCap (MLP + GPT2 tuning)27.1NoClipCap: CLIP Prefix for Image Captioning2021-11-18Code
28CapDec25.1NoText-Only Training for Image Captioning using No...2022-11-01Code
29From Captions to Visual Concepts and Back23.6NoFrom Captions to Visual Concepts and Back2014-11-18Code
30VLKD (ViT-B/16)19.7No---