A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Vladimir Iashin, Esa Rahtu

2020-05-17Temporal Action Proposal Generation Video Captioning Dense Video Captioning

Abstract

Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Transformer which generalizes the Transformer architecture for a bi-modal input. We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task. We also show that the pre-trained bi-modal encoder as a part of the bi-modal transformer can be used as a feature extractor for a simple proposal generation module. The performance is demonstrated on a challenging ActivityNet Captions dataset where our model achieves outstanding performance. The code is available: v-iashin.github.io/bmt

Results

Task	Dataset	Metric	Value	Model
Video	ActivityNet Captions	Average F1	60.27	BMT
Video	ActivityNet Captions	Average Precision	48.23	BMT
Video	ActivityNet Captions	Average Recall	80.31	BMT
Temporal Action Localization	ActivityNet Captions	Average F1	60.27	BMT
Temporal Action Localization	ActivityNet Captions	Average Precision	48.23	BMT
Temporal Action Localization	ActivityNet Captions	Average Recall	80.31	BMT
Zero-Shot Learning	ActivityNet Captions	Average F1	60.27	BMT
Zero-Shot Learning	ActivityNet Captions	Average Precision	48.23	BMT
Zero-Shot Learning	ActivityNet Captions	Average Recall	80.31	BMT
Action Localization	ActivityNet Captions	Average F1	60.27	BMT
Action Localization	ActivityNet Captions	Average Precision	48.23	BMT
Action Localization	ActivityNet Captions	Average Recall	80.31	BMT
Video Captioning	ActivityNet Captions	BLEU-3	3.84	BMT
Video Captioning	ActivityNet Captions	BLEU-4	1.88	BMT
Video Captioning	ActivityNet Captions	METEOR	8.44	BMT
Dense Video Captioning	ActivityNet Captions	BLEU-3	3.84	BMT
Dense Video Captioning	ActivityNet Captions	BLEU-4	1.88	BMT
Dense Video Captioning	ActivityNet Captions	METEOR	8.44	BMT

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Abstract

Results

Related Papers

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Abstract

Results

Related Papers