TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/A Better Use of Audio-Visual Cues: Dense Video Captioning ...

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Vladimir Iashin, Esa Rahtu

2020-05-17Temporal Action Proposal GenerationVideo CaptioningDense Video Captioning
PaperPDFCodeCode(official)

Abstract

Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Transformer which generalizes the Transformer architecture for a bi-modal input. We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task. We also show that the pre-trained bi-modal encoder as a part of the bi-modal transformer can be used as a feature extractor for a simple proposal generation module. The performance is demonstrated on a challenging ActivityNet Captions dataset where our model achieves outstanding performance. The code is available: v-iashin.github.io/bmt

Results

TaskDatasetMetricValueModel
VideoActivityNet CaptionsAverage F160.27BMT
VideoActivityNet CaptionsAverage Precision48.23BMT
VideoActivityNet CaptionsAverage Recall80.31BMT
Temporal Action LocalizationActivityNet CaptionsAverage F160.27BMT
Temporal Action LocalizationActivityNet CaptionsAverage Precision48.23BMT
Temporal Action LocalizationActivityNet CaptionsAverage Recall80.31BMT
Zero-Shot LearningActivityNet CaptionsAverage F160.27BMT
Zero-Shot LearningActivityNet CaptionsAverage Precision48.23BMT
Zero-Shot LearningActivityNet CaptionsAverage Recall80.31BMT
Action LocalizationActivityNet CaptionsAverage F160.27BMT
Action LocalizationActivityNet CaptionsAverage Precision48.23BMT
Action LocalizationActivityNet CaptionsAverage Recall80.31BMT
Video CaptioningActivityNet CaptionsBLEU-33.84BMT
Video CaptioningActivityNet CaptionsBLEU-41.88BMT
Video CaptioningActivityNet CaptionsMETEOR8.44BMT
Dense Video CaptioningActivityNet CaptionsBLEU-33.84BMT
Dense Video CaptioningActivityNet CaptionsBLEU-41.88BMT
Dense Video CaptioningActivityNet CaptionsMETEOR8.44BMT

Related Papers

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization2025-06-25Dense Video Captioning using Graph-based Sentence Summarization2025-06-25video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models2025-06-18VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks2025-06-10ARGUS: Hallucination and Omission Evaluation in Video-LLMs2025-06-09Temporal Object Captioning for Street Scene Videos from LiDAR Tracks2025-05-22FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks2025-05-19