TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Image Captioning/COCO Captions

Image Captioning on COCO Captions

Metric: CIDER (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕CIDER▼Extra DataPaperDate↕Code
1mPLUG155.1NomPLUG: Effective and Efficient Vision-Language L...2022-05-24Code
2OFA154.9NoOFA: Unifying Architectures, Tasks, and Modaliti...2022-02-07Code
3VALOR152.5YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
4GIT151.1NoGIT: A Generative Image-to-text Transformer for ...2022-05-27Code
5VAST149YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
6BLIP-2 ViT-G OPT 2.7B (zero-shot)145.8NoBLIP-2: Bootstrapping Language-Image Pre-trainin...2023-01-30Code
7LEMON145.5NoScaling Up Vision-Language Pre-training for Imag...2021-11-24-
8BLIP-2 ViT-G OPT 6.7B (zero-shot)145.2NoBLIP-2: Bootstrapping Language-Image Pre-trainin...2023-01-30Code
9BLIP-2 ViT-G FlanT5 XL (zero-shot)144.5NoBLIP-2: Bootstrapping Language-Image Pre-trainin...2023-01-30Code
10GRIT (No VL pretraining - base)144.2NoGRIT: Faster and Better Image captioning Transfo...2022-07-20Code
11ExpansionNet v2 (No VL pretraining)143.7NoExploiting Multiple Sequence Lengths in Fast End...2022-08-13Code
12CoCa143.6NoCoCa: Contrastive Captioners are Image-Text Foun...2022-05-04Code
13SimVLM143.3NoSimVLM: Simple Visual Language Model Pretraining...2021-08-24Code
14Xmodal-Ctx + OSCAR142.2NoBeyond a Pre-Trained Object Detector: Cross-Moda...2022-05-09Code
15Prompt Tuning141.4NoPrompt Tuning for Generative Multimodal Pretrain...2022-08-04Code
16VinVL140.9NoVinVL: Revisiting Visual Representations in Visi...2021-01-02Code
17X-VLM (base)140.8NoMulti-Grained Vision Language Pre-Training: Alig...2021-11-16Code
18Oscar140NoOscar: Object-Semantics Aligned Pre-training for...2020-04-13Code
19Xmodal-Ctx139.9NoBeyond a Pre-Trained Object Detector: Cross-Moda...2022-05-09Code
20Prismer136.5NoPrismer: A Vision-Language Model with Multi-Task...2023-03-04Code
21Xmodal-Ctx135.9NoBeyond a Pre-Trained Object Detector: Cross-Moda...2022-05-09Code
22PTP-BLIP (14M)135NoPosition-guided Text Prompt for Vision-Language ...2022-12-19Code
23X-Transformer132.8NoX-Linear Attention Networks for Image Captioning2020-03-31Code
24Meshed-Memory Transformer131.2NoMeshed-Memory Transformer for Image Captioning2019-12-17Code
25Transformer_NSC129.6NoA Better Variant of Self-Critical Sequence Train...2020-03-22Code
26RefineCap (w/ REINFORCE)127.2NoRefineCap: Concept-Aware Refinement for Image Ca...2021-09-08-
27LaDiC (ours, 30 steps)126.2NoLaDiC: Are Diffusion Models Really Inferior to A...2024-04-16Code
28RDN125.2NoReflective Decoding Network for Image Captioning2019-08-30-
29CLIP Text Encoder (RL w/ CIDEr-reward)124.9NoFine-grained Image Captioning with CLIP Reward2022-05-26Code
30SmallCapd=16, Large121.8NoSmallCap: Lightweight Image Captioning Prompted ...2022-09-30Code
31ClipCap (Transformer)113.08NoClipCap: CLIP Prefix for Image Captioning2021-11-18Code
32ClipCap (MLP + GPT2 tuning)108.35NoClipCap: CLIP Prefix for Image Captioning2021-11-18Code
33Virtex (ResNet-101)94NoVirTex: Learning Visual Representations from Tex...2020-06-11Code
34CapDec91.8NoText-Only Training for Image Captioning using No...2022-11-01Code
35KOSMOS-1 (1.6B) (zero-shot)84.7No---
36VLKD (ViT-B/16)58.3No---