TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SEM-POS: Grammatically and Semantically Correct Video Capt...

SEM-POS: Grammatically and Semantically Correct Video Captioning

Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin Mustafa

2023-03-26POSVideo Captioning
PaperPDF

Abstract

Generating grammatically and semantically correct captions in video captioning is a challenging task. The captions generated from the existing methods are either word-by-word that do not align with grammatical structure or miss key information from the input videos. To address these issues, we introduce a novel global-local fusion network, with a Global-Local Fusion Block (GLFB) that encodes and fuses features from different parts of speech (POS) components with visual-spatial features. We use novel combinations of different POS components - 'determinant + subject', 'auxiliary verb', 'verb', and 'determinant + object' for supervision of the POS blocks - Det + Subject, Aux Verb, Verb, and Det + Object respectively. The novel global-local fusion network together with POS blocks helps align the visual features with language description to generate grammatically and semantically correct captions. Extensive qualitative and quantitative experiments on benchmark MSVD and MSRVTT datasets demonstrate that the proposed approach generates more grammatically and semantically correct captions compared to the existing methods, achieving the new state-of-the-art. Ablations on the POS blocks and the GLFB demonstrate the impact of the contributions on the proposed method.

Results

TaskDatasetMetricValueModel
Video CaptioningMSR-VTTBLEU-445.2SEM-POS
Video CaptioningMSR-VTTCIDEr53.1SEM-POS
Video CaptioningMSR-VTTGS192.6SEM-POS
Video CaptioningMSR-VTTMETEOR30.7SEM-POS
Video CaptioningMSR-VTTROUGE-L64.1SEM-POS
Video CaptioningMSVD-CTNCIDEr37.16SEM-POS
Video CaptioningMSVD-CTNROUGE-L25.39SEM-POS
Video CaptioningMSVD-CTNSPICE14.46SEM-POS
Video CaptioningMSVDBLEU-460.1SEM-POS
Video CaptioningMSVDCIDEr108.3SEM-POS
Video CaptioningMSVDGS607.1SEM-POS
Video CaptioningMSVDMETEOR38.5SEM-POS
Video CaptioningMSVDROUGE-L76SEM-POS
Video CaptioningMSRVTT-CTNCIDEr26.01SEM-POS
Video CaptioningMSRVTT-CTNROUGE-L20.11SEM-POS
Video CaptioningMSRVTT-CTNSPICE12.09SEM-POS

Related Papers

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization2025-06-25Dense Video Captioning using Graph-based Sentence Summarization2025-06-25video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models2025-06-18LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops2025-06-17Hybrid Meta-learners for Estimating Heterogeneous Treatment Effects2025-06-16VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks2025-06-10ARGUS: Hallucination and Omission Evaluation in Video-LLMs2025-06-09