Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning

Jia Cheng Hu, Roberto Cavicchioli, Alessandro Capotondi

2022-08-13Image Captioning

Abstract

We introduce a method called the Expansion mechanism that processes the input unconstrained by the number of elements in the sequence. By doing so, the model can learn more effectively compared to traditional attention-based approaches. To support this claim, we design a novel architecture ExpansionNet v2 that achieved strong results on the MS COCO 2014 Image Captioning challenge and the State of the Art in its respective category, with a score of 143.7 CIDErD in the offline test split, 140.8 CIDErD in the online evaluation server and 72.9 AllCIDEr on the nocaps validation set. Additionally, we introduce an End to End training algorithm up to 2.8 times faster than established alternatives. Source code available at: https://github.com/jchenghu/ExpansionNet_v2

Results

Task	Dataset	Metric	Value	Model
Image Captioning	COCO Captions	BLEU-1	83.5	ExpansionNet v2 (No VL pretraining)
Image Captioning	COCO Captions	BLEU-4	42.7	ExpansionNet v2 (No VL pretraining)
Image Captioning	COCO Captions	CIDER	143.7	ExpansionNet v2 (No VL pretraining)
Image Captioning	COCO Captions	METEOR	30.6	ExpansionNet v2 (No VL pretraining)
Image Captioning	COCO Captions	ROUGE-L	61.1	ExpansionNet v2 (No VL pretraining)
Image Captioning	COCO Captions	SPICE	24.7	ExpansionNet v2 (No VL pretraining)
Image Captioning	COCO (Common Objects in Context)	CIDEr	143.7	ExpansionNet v2

Related Papers

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28 HalLoc: Token-level Localization of Hallucinations for Vision Language Models2025-06-12 ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs2025-06-11 A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning2025-06-11 Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning2025-06-11 Edit Flows: Flow Matching with Edit Operations2025-06-10 Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings2025-06-10