TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Exploiting Multiple Sequence Lengths in Fast End to End Tr...

Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning

Jia Cheng Hu, Roberto Cavicchioli, Alessandro Capotondi

2022-08-13Image Captioning
PaperPDFCode(official)

Abstract

We introduce a method called the Expansion mechanism that processes the input unconstrained by the number of elements in the sequence. By doing so, the model can learn more effectively compared to traditional attention-based approaches. To support this claim, we design a novel architecture ExpansionNet v2 that achieved strong results on the MS COCO 2014 Image Captioning challenge and the State of the Art in its respective category, with a score of 143.7 CIDErD in the offline test split, 140.8 CIDErD in the online evaluation server and 72.9 AllCIDEr on the nocaps validation set. Additionally, we introduce an End to End training algorithm up to 2.8 times faster than established alternatives. Source code available at: https://github.com/jchenghu/ExpansionNet_v2

Results

TaskDatasetMetricValueModel
Image CaptioningCOCO CaptionsBLEU-183.5ExpansionNet v2 (No VL pretraining)
Image CaptioningCOCO CaptionsBLEU-442.7ExpansionNet v2 (No VL pretraining)
Image CaptioningCOCO CaptionsCIDER143.7ExpansionNet v2 (No VL pretraining)
Image CaptioningCOCO CaptionsMETEOR30.6ExpansionNet v2 (No VL pretraining)
Image CaptioningCOCO CaptionsROUGE-L61.1ExpansionNet v2 (No VL pretraining)
Image CaptioningCOCO CaptionsSPICE24.7ExpansionNet v2 (No VL pretraining)
Image CaptioningCOCO (Common Objects in Context)CIDEr143.7ExpansionNet v2

Related Papers

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28HalLoc: Token-level Localization of Hallucinations for Vision Language Models2025-06-12ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs2025-06-11A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning2025-06-11Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning2025-06-11Edit Flows: Flow Matching with Edit Operations2025-06-10Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings2025-06-10