Vladimir Iashin, Esa Rahtu
Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Transformer which generalizes the Transformer architecture for a bi-modal input. We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task. We also show that the pre-trained bi-modal encoder as a part of the bi-modal transformer can be used as a feature extractor for a simple proposal generation module. The performance is demonstrated on a challenging ActivityNet Captions dataset where our model achieves outstanding performance. The code is available: v-iashin.github.io/bmt
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | ActivityNet Captions | Average F1 | 60.27 | BMT |
| Video | ActivityNet Captions | Average Precision | 48.23 | BMT |
| Video | ActivityNet Captions | Average Recall | 80.31 | BMT |
| Temporal Action Localization | ActivityNet Captions | Average F1 | 60.27 | BMT |
| Temporal Action Localization | ActivityNet Captions | Average Precision | 48.23 | BMT |
| Temporal Action Localization | ActivityNet Captions | Average Recall | 80.31 | BMT |
| Zero-Shot Learning | ActivityNet Captions | Average F1 | 60.27 | BMT |
| Zero-Shot Learning | ActivityNet Captions | Average Precision | 48.23 | BMT |
| Zero-Shot Learning | ActivityNet Captions | Average Recall | 80.31 | BMT |
| Action Localization | ActivityNet Captions | Average F1 | 60.27 | BMT |
| Action Localization | ActivityNet Captions | Average Precision | 48.23 | BMT |
| Action Localization | ActivityNet Captions | Average Recall | 80.31 | BMT |
| Video Captioning | ActivityNet Captions | BLEU-3 | 3.84 | BMT |
| Video Captioning | ActivityNet Captions | BLEU-4 | 1.88 | BMT |
| Video Captioning | ActivityNet Captions | METEOR | 8.44 | BMT |
| Dense Video Captioning | ActivityNet Captions | BLEU-3 | 3.84 | BMT |
| Dense Video Captioning | ActivityNet Captions | BLEU-4 | 1.88 | BMT |
| Dense Video Captioning | ActivityNet Captions | METEOR | 8.44 | BMT |