TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SciCap: Generating Captions for Scientific Figures

SciCap: Generating Captions for Scientific Figures

Ting-Yao Hsu, C. Lee Giles, Ting-Hao 'Kenneth' Huang

2021-10-22Findings (EMNLP) 2021 11Image Captioning
PaperPDFCode(official)

Abstract

Researchers use figures to communicate rich, complex information in scientific papers. The captions of these figures are critical to conveying effective messages. However, low-quality figure captions commonly occur in scientific articles and may decrease understanding. In this paper, we propose an end-to-end neural framework to automatically generate informative, high-quality captions for scientific figures. To this end, we introduce SCICAP, a large-scale figure-caption dataset based on computer science arXiv papers published between 2010 and 2020. After pre-processing - including figure-type classification, sub-figure identification, text normalization, and caption text selection - SCICAP contained more than two million figures extracted from over 290,000 papers. We then established baseline models that caption graph plots, the dominant (19.2%) figure type. The experimental results showed both opportunities and steep challenges of generating captions for scientific figures.

Results

TaskDatasetMetricValueModel
Image CaptioningSCICAPBLEU-40.0219CNN+LSTM (Vision only, First sentence)
Image CaptioningSCICAPBLEU-40.0213CNN+LSTM (Text only, First sentence)
Image CaptioningSCICAPBLEU-40.0212CNN+LSTM (Text only, Single-Sent Caption)
Image CaptioningSCICAPBLEU-40.0207CNN+LSTM (Vision only, Single-Sent Caption)
Image CaptioningSCICAPBLEU-40.0205CNN+LSTM (Vision + Text, First sentence)
Image CaptioningSCICAPBLEU-40.0202CNN+LSTM (Vision + Text, Single-Sent Caption)
Image CaptioningSCICAPBLEU-40.0172CNN+LSTM (Vision only, Caption w/ <=100 words)
Image CaptioningSCICAPBLEU-40.0168CNN+LSTM (Vision + Text, Caption w/ <=100 words)
Image CaptioningSCICAPBLEU-40.0165CNN+LSTM (Text only, Caption w/ <=100 words)

Related Papers

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28HalLoc: Token-level Localization of Hallucinations for Vision Language Models2025-06-12ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs2025-06-11A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning2025-06-11Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning2025-06-11Edit Flows: Flow Matching with Edit Operations2025-06-10Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings2025-06-10