TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Image Captioning/COCO Captions

Image Captioning on COCO Captions

Metric: BLEU-4 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕BLEU-4▼Extra DataPaperDate↕Code
1mPLUG46.5NomPLUG: Effective and Efficient Vision-Language L...2022-05-24Code
2OFA44.9NoOFA: Unifying Architectures, Tasks, and Modaliti...2022-02-07Code
3GIT44.1NoGIT: A Generative Image-to-text Transformer for ...2022-05-27Code
4BLIP-2 ViT-G OPT 2.7B (zero-shot)43.7NoBLIP-2: Bootstrapping Language-Image Pre-trainin...2023-01-30Code
5BLIP-2 ViT-G OPT 6.7B (zero-shot)43.5NoBLIP-2: Bootstrapping Language-Image Pre-trainin...2023-01-30Code
6ExpansionNet v2 (No VL pretraining)42.7NoExploiting Multiple Sequence Lengths in Fast End...2022-08-13Code
7LEMON42.6NoScaling Up Vision-Language Pre-training for Imag...2021-11-24-
8BLIP-2 ViT-G FlanT5 XL (zero-shot)42.4NoBLIP-2: Bootstrapping Language-Image Pre-trainin...2023-01-30Code
9GRIT (No VL pretraining - base)42.4NoGRIT: Faster and Better Image captioning Transfo...2022-07-20Code
10Prompt Tuning41.81NoPrompt Tuning for Generative Multimodal Pretrain...2022-08-04Code
11Oscar41.7NoOscar: Object-Semantics Aligned Pre-training for...2020-04-13Code
12Xmodal-Ctx41.4NoBeyond a Pre-Trained Object Detector: Cross-Moda...2022-05-09Code
13Xmodal-Ctx + OSCAR41.3NoBeyond a Pre-Trained Object Detector: Cross-Moda...2022-05-09Code
14X-VLM (base)41.3NoMulti-Grained Vision Language Pre-Training: Alig...2021-11-16Code
15VinVL41NoVinVL: Revisiting Visual Representations in Visi...2021-01-02Code
16CoCa40.9NoCoCa: Contrastive Captioners are Image-Text Foun...2022-05-04Code
17SimVLM40.6NoSimVLM: Simple Visual Language Model Pretraining...2021-08-24Code
18Prismer40.4NoPrismer: A Vision-Language Model with Multi-Task...2023-03-04Code
19PTP-BLIP (14M)40.1NoPosition-guided Text Prompt for Vision-Language ...2022-12-19Code
20L-Verse39.9NoL-Verse: Bidirectional Generation Between Image ...2021-11-22Code
21Xmodal-Ctx39.7NoBeyond a Pre-Trained Object Detector: Cross-Moda...2022-05-09Code
22X-Transformer39.7NoX-Linear Attention Networks for Image Captioning2020-03-31Code
23AoANet + VC39.5NoVisual Commonsense R-CNN2020-02-27Code
24Transformer_NSC39.4NoA Better Variant of Self-Critical Sequence Train...2020-03-22Code
25Meshed-Memory Transformer39.1NoMeshed-Memory Transformer for Image Captioning2019-12-17Code
26CLIP Text Encoder (RL w/ CIDEr-reward)38.2NoFine-grained Image Captioning with CLIP Reward2022-05-26Code
27RefineCap (w/ REINFORCE)37.8NoRefineCap: Concept-Aware Refinement for Image Ca...2021-09-08-
28RDN37.3NoReflective Decoding Network for Image Captioning2019-08-30-
29SmallCapd=16, Large37.2NoSmallCap: Lightweight Image Captioning Prompted ...2022-09-30Code
30ClipCap (Transformer)33.53NoClipCap: CLIP Prefix for Image Captioning2021-11-18Code
31ClipCap (MLP + GPT2 tuning)32.15NoClipCap: CLIP Prefix for Image Captioning2021-11-18Code
32CapDec26.4NoText-Only Training for Image Captioning using No...2022-11-01Code
33From Captions to Visual Concepts and Back25.7NoFrom Captions to Visual Concepts and Back2014-11-18Code
34VLKD (ViT-B/16)16.7No---
35LaDiC (ours, 30 steps)0.382NoLaDiC: Are Diffusion Models Really Inferior to A...2024-04-16Code