TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Image Captioning/COCO Captions

Image Captioning on COCO Captions

Metric: SPICE (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕SPICE▼Extra DataPaperDate↕Code
1VAST27YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
2OFA26.6NoOFA: Unifying Architectures, Tasks, and Modaliti...2022-02-07Code
3GIT26.3NoGIT: A Generative Image-to-text Transformer for ...2022-05-27Code
4mPLUG26NomPLUG: Effective and Efficient Vision-Language L...2022-05-24Code
5VALOR25.7YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
6LEMON25.5NoScaling Up Vision-Language Pre-training for Imag...2021-11-24-
7SimVLM25.4NoSimVLM: Simple Visual Language Model Pretraining...2021-08-24Code
8VinVL25.2NoVinVL: Revisiting Visual Representations in Visi...2021-01-02Code
9Xmodal-Ctx + OSCAR24.9NoBeyond a Pre-Trained Object Detector: Cross-Moda...2022-05-09Code
10ExpansionNet v2 (No VL pretraining)24.7NoExploiting Multiple Sequence Lengths in Fast End...2022-08-13Code
11CoCa24.7NoCoCa: Contrastive Captioners are Image-Text Foun...2022-05-04Code
12Oscar24.5NoOscar: Object-Semantics Aligned Pre-training for...2020-04-13Code
13Prompt Tuning24.42NoPrompt Tuning for Generative Multimodal Pretrain...2022-08-04Code
14Prismer24.4NoPrismer: A Vision-Language Model with Multi-Task...2023-03-04Code
15GRIT (No VL pretraining - base)24.3NoGRIT: Faster and Better Image captioning Transfo...2022-07-20Code
16Xmodal-Ctx24NoBeyond a Pre-Trained Object Detector: Cross-Moda...2022-05-09Code
17PTP-BLIP (14M)23.7NoPosition-guided Text Prompt for Vision-Language ...2022-12-19Code
18Xmodal-Ctx23.7NoBeyond a Pre-Trained Object Detector: Cross-Moda...2022-05-09Code
19X-Transformer23.4NoX-Linear Attention Networks for Image Captioning2020-03-31Code
20L-Verse23.3NoL-Verse: Bidirectional Generation Between Image ...2021-11-22Code
21Transformer_NSC22.8NoA Better Variant of Self-Critical Sequence Train...2020-03-22Code
22Meshed-Memory Transformer22.6NoMeshed-Memory Transformer for Image Captioning2019-12-17Code
23RefineCap (w/ REINFORCE)22.5NoRefineCap: Concept-Aware Refinement for Image Ca...2021-09-08-
24LaDiC22.4NoLaDiC: Are Diffusion Models Really Inferior to A...2024-04-16Code
25SmallCapd=16, Large21.5NoSmallCap: Lightweight Image Captioning Prompted ...2022-09-30Code
26ClipCap (Transformer)21.05NoClipCap: CLIP Prefix for Image Captioning2021-11-18Code
27ClipCap (MLP + GPT2 tuning)20.12NoClipCap: CLIP Prefix for Image Captioning2021-11-18Code
28Virtex (ResNet-101)18.5NoVirTex: Learning Visual Representations from Tex...2020-06-11Code
29KOSMOS-1 (1.6B) (zero-shot)16.8No---
30VLKD (ViT-B/16)13.4No---