TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Dense Relational Captioning: Triple-Stream Networks for Re...

Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning

Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, In So Kweon

2019-03-14CVPR 2019 6POSImage Captioning
PaperPDFCode(official)

Abstract

Our goal in this work is to train an image captioning model that generates more dense and informative captions. We introduce "relational captioning," a novel image captioning task which aims to generate multiple captions with respect to relational information between objects in an image. Relational captioning is a framework that is advantageous in both diversity and amount of information, leading to image understanding based on relationships. Part-of speech (POS, i.e. subject-object-predicate categories) tags can be assigned to every English word. We leverage the POS as a prior to guide the correct sequence of words in a caption. To this end, we propose a multi-task triple-stream network (MTTSNet) which consists of three recurrent units for the respective POS and jointly performs POS prediction and captioning. We demonstrate more diverse and richer representations generated by the proposed model against several baselines and competing methods.

Results

TaskDatasetMetricValueModel
Image Captioningrelational captioning datasetImage-Level Recall34.27MTTSNet

Related Papers

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops2025-06-17Hybrid Meta-learners for Estimating Heterogeneous Treatment Effects2025-06-16HalLoc: Token-level Localization of Hallucinations for Vision Language Models2025-06-12ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs2025-06-11A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning2025-06-11Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning2025-06-11