Yekun Chai, Shuo Jin, Junliang Xing
Automatically translating images to texts involves image scene understanding and language modeling. In this paper, we propose a novel model, termed RefineCap, that refines the output vocabulary of the language decoder using decoder-guided visual semantics, and implicitly learns the mapping between visual tag words and images. The proposed Visual-Concept Refinement method can allow the generator to attend to semantic details in the image, thereby generating more semantically descriptive captions. Our model achieves superior performance on the MS-COCO dataset in comparison with previous visual-concept based models.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Captioning | COCO Captions | BLEU-1 | 80.2 | RefineCap (w/ REINFORCE) |
| Image Captioning | COCO Captions | BLEU-2 | 64.5 | RefineCap (w/ REINFORCE) |
| Image Captioning | COCO Captions | BLEU-3 | 49.9 | RefineCap (w/ REINFORCE) |
| Image Captioning | COCO Captions | BLEU-4 | 37.8 | RefineCap (w/ REINFORCE) |
| Image Captioning | COCO Captions | CIDER | 127.2 | RefineCap (w/ REINFORCE) |
| Image Captioning | COCO Captions | METEOR | 28.3 | RefineCap (w/ REINFORCE) |
| Image Captioning | COCO Captions | ROUGE-L | 58 | RefineCap (w/ REINFORCE) |
| Image Captioning | COCO Captions | SPICE | 22.5 | RefineCap (w/ REINFORCE) |