Improved Adversarial Image Captioning
Pierre Dognin, Igor Melnyk, Youssef Mroueh, Jarret Ross, Tom Sercu
Abstract
In this paper we study image captioning as a conditional GAN training, proposing both a context-aware LSTM captioner and co-attentive discriminator, which enforces semantic alignment between images and captions. We investigate the viability of two discrete GAN training methods: Self-critical Sequence Training (SCST) and Gumbel Straight-Through (ST) and demonstrate that SCST shows more stable gradient behavior and improved results over Gumbel ST.
Related Papers
Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28HalLoc: Token-level Localization of Hallucinations for Vision Language Models2025-06-12ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs2025-06-11A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning2025-06-11Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning2025-06-11Edit Flows: Flow Matching with Edit Operations2025-06-10Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings2025-06-10