TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Diverse Video Captioning by Adaptive Spatio-temporal Atten...

Diverse Video Captioning by Adaptive Spatio-temporal Attention

Zohreh Ghaderi, Leonard Salewski, Hendrik P. A. Lensch

2022-08-19Text GenerationVideo Captioning
PaperPDFCode(official)

Abstract

To generate proper captions for videos, the inference needs to identify relevant concepts and pay attention to the spatial relationships between them as well as to the temporal development in the clip. Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures, an adapted transformer for a single joint spatio-temporal video analysis as well as a self-attention-based decoder for advanced text generation. Furthermore, we introduce an adaptive frame selection scheme to reduce the number of required incoming frames while maintaining the relevant content when training both transformers. Additionally, we estimate semantic concepts relevant for video captioning by aggregating all ground truth captions of each sample. Our approach achieves state-of-the-art results on the MSVD, as well as on the large-scale MSR-VTT and the VATEX benchmark datasets considering multiple Natural Language Generation (NLG) metrics. Additional evaluations on diversity scores highlight the expressiveness and diversity in the structure of our generated captions.

Results

TaskDatasetMetricValueModel
Video CaptioningMSR-VTTBLEU-444.21VASTA (Vatex-backbone)
Video CaptioningMSR-VTTCIDEr56.08VASTA (Vatex-backbone)
Video CaptioningMSR-VTTMETEOR30.24VASTA (Vatex-backbone)
Video CaptioningMSR-VTTROUGE-L62.9VASTA (Vatex-backbone)
Video CaptioningMSR-VTTBLEU-443.4VASTA (Kinetics-backbone)
Video CaptioningMSR-VTTCIDEr55VASTA (Kinetics-backbone)
Video CaptioningMSR-VTTMETEOR30.2VASTA (Kinetics-backbone)
Video CaptioningMSR-VTTROUGE-L62.5VASTA (Kinetics-backbone)
Video CaptioningVATEXBLEU-436.25VASTA (Kinetics-backbone)
Video CaptioningVATEXCIDEr65.07VASTA (Kinetics-backbone)
Video CaptioningVATEXMETEOR25.32VASTA (Kinetics-backbone)
Video CaptioningVATEXROUGE-L51.88VASTA (Kinetics-backbone)
Video CaptioningMSVDBLEU-459.2VASTA (Vatex-backbone)
Video CaptioningMSVDCIDEr119.7VASTA (Vatex-backbone)
Video CaptioningMSVDMETEOR40.65VASTA (Vatex-backbone)
Video CaptioningMSVDROUGE-L76.7VASTA (Vatex-backbone)
Video CaptioningMSVDBLEU-456.1VASTA (Kinetics-backbone)
Video CaptioningMSVDCIDEr106.4VASTA (Kinetics-backbone)
Video CaptioningMSVDMETEOR39.1VASTA (Kinetics-backbone)
Video CaptioningMSVDROUGE-L74.5VASTA (Kinetics-backbone)

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs2025-07-15Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15Exploiting Leaderboards for Large-Scale Distribution of Malicious Models2025-07-11CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs2025-07-09