TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Improved Bengali Image Captioning via deep convolutional n...

Improved Bengali Image Captioning via deep convolutional neural network based encoder-decoder model

Mohammad Faiyaz Khan, S. M. Sadiq-Ur-Rahman Shifath, Md. Saiful Islam

2021-02-14Image Captioning
PaperPDFCode(official)

Abstract

Image Captioning is an arduous task of producing syntactically and semantically correct textual descriptions of an image in natural language with context related to the image. Existing notable pieces of research in Bengali Image Captioning (BIC) are based on encoder-decoder architecture. This paper presents an end-to-end image captioning system utilizing a multimodal architecture by combining a one-dimensional convolutional neural network (CNN) to encode sequence information with a pre-trained ResNet-50 model image encoder for extracting region-based visual features. We investigate our approach's performance on the BanglaLekhaImageCaptions dataset using the existing evaluation metrics and perform a human evaluation for qualitative analysis. Experiments show that our approach's language encoder captures the fine-grained information in the caption, and combined with the image features, it generates accurate and diversified caption. Our work outperforms all the existing BIC works and achieves a new state-of-the-art (SOTA) performance by scoring 0.651 on BLUE-1, 0.572 on CIDEr, 0.297 on METEOR, 0.434 on ROUGE, and 0.357 on SPICE.

Results

TaskDatasetMetricValueModel
Image CaptioningBanglaLekhaImageCaptionsBLEU-165.1CNN + 1D CNN
Image CaptioningBanglaLekhaImageCaptionsBLEU-242.6CNN + 1D CNN
Image CaptioningBanglaLekhaImageCaptionsBLEU-327.8CNN + 1D CNN
Image CaptioningBanglaLekhaImageCaptionsBLEU-417.5CNN + 1D CNN
Image CaptioningBanglaLekhaImageCaptionsCIDEr57.2CNN + 1D CNN
Image CaptioningBanglaLekhaImageCaptionsMETEOR29.7CNN + 1D CNN
Image CaptioningBanglaLekhaImageCaptionsROUGE-L43.4CNN + 1D CNN
Image CaptioningBanglaLekhaImageCaptionsSPICE35.7CNN + 1D CNN

Related Papers

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28HalLoc: Token-level Localization of Hallucinations for Vision Language Models2025-06-12ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs2025-06-11A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning2025-06-11Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning2025-06-11Edit Flows: Flow Matching with Edit Operations2025-06-10Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings2025-06-10