TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Deep Visual-Semantic Alignments for Generating Image Descr...

Deep Visual-Semantic Alignments for Generating Image Descriptions

Andrej Karpathy, Li Fei-Fei

2014-12-07CVPR 2015 6Cross-Modal RetrievalImage CaptioningImage-to-Text RetrievalRetrieval
PaperPDFCodeCodeCodeCode

Abstract

We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.

Results

TaskDatasetMetricValueModel
Image CaptioningFlickr30k Captions testBLEU-415.7BRNN
Image CaptioningFlickr30k Captions testCIDEr24.7BRNN
Image CaptioningFlickr30k Captions testMETEOR15.3BRNN
Image RetrievalFlickr30K 1K testR@115.2DVSA (R-CNN, AlexNet)
Image RetrievalFlickr30K 1K testR@1050.5DVSA (R-CNN, AlexNet)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@141.2Dual-Path (ResNet)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1081.1Dual-Path (ResNet)
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@570.5Dual-Path (ResNet)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@125.3Dual-Path (ResNet)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1066.4Dual-Path (ResNet)
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@553.4Dual-Path (ResNet)
Question GenerationCOCO Visual Question Answering (VQA) real images 1.0 open endedBLEU-162.5coco-Caption [[Karpathy and Li2014]]
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@141.2Dual-Path (ResNet)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1081.1Dual-Path (ResNet)
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@570.5Dual-Path (ResNet)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@125.3Dual-Path (ResNet)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1066.4Dual-Path (ResNet)
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@553.4Dual-Path (ResNet)
Cross-Modal RetrievalCOCO 2014Image-to-text R@141.2Dual-Path (ResNet)
Cross-Modal RetrievalCOCO 2014Image-to-text R@1081.1Dual-Path (ResNet)
Cross-Modal RetrievalCOCO 2014Image-to-text R@570.5Dual-Path (ResNet)
Cross-Modal RetrievalCOCO 2014Text-to-image R@125.3Dual-Path (ResNet)
Cross-Modal RetrievalCOCO 2014Text-to-image R@1066.4Dual-Path (ResNet)
Cross-Modal RetrievalCOCO 2014Text-to-image R@553.4Dual-Path (ResNet)
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@1074.8DVSA

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15