Diverse Video Captioning by Adaptive Spatio-temporal Attention

Zohreh Ghaderi, Leonard Salewski, Hendrik P. A. Lensch

2022-08-19Text Generation Video Captioning

Abstract

To generate proper captions for videos, the inference needs to identify relevant concepts and pay attention to the spatial relationships between them as well as to the temporal development in the clip. Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures, an adapted transformer for a single joint spatio-temporal video analysis as well as a self-attention-based decoder for advanced text generation. Furthermore, we introduce an adaptive frame selection scheme to reduce the number of required incoming frames while maintaining the relevant content when training both transformers. Additionally, we estimate semantic concepts relevant for video captioning by aggregating all ground truth captions of each sample. Our approach achieves state-of-the-art results on the MSVD, as well as on the large-scale MSR-VTT and the VATEX benchmark datasets considering multiple Natural Language Generation (NLG) metrics. Additional evaluations on diversity scores highlight the expressiveness and diversity in the structure of our generated captions.

Results

Task	Dataset	Metric	Value	Model
Video Captioning	MSR-VTT	BLEU-4	44.21	VASTA (Vatex-backbone)
Video Captioning	MSR-VTT	CIDEr	56.08	VASTA (Vatex-backbone)
Video Captioning	MSR-VTT	METEOR	30.24	VASTA (Vatex-backbone)
Video Captioning	MSR-VTT	ROUGE-L	62.9	VASTA (Vatex-backbone)
Video Captioning	MSR-VTT	BLEU-4	43.4	VASTA (Kinetics-backbone)
Video Captioning	MSR-VTT	CIDEr	55	VASTA (Kinetics-backbone)
Video Captioning	MSR-VTT	METEOR	30.2	VASTA (Kinetics-backbone)
Video Captioning	MSR-VTT	ROUGE-L	62.5	VASTA (Kinetics-backbone)
Video Captioning	VATEX	BLEU-4	36.25	VASTA (Kinetics-backbone)
Video Captioning	VATEX	CIDEr	65.07	VASTA (Kinetics-backbone)
Video Captioning	VATEX	METEOR	25.32	VASTA (Kinetics-backbone)
Video Captioning	VATEX	ROUGE-L	51.88	VASTA (Kinetics-backbone)
Video Captioning	MSVD	BLEU-4	59.2	VASTA (Vatex-backbone)
Video Captioning	MSVD	CIDEr	119.7	VASTA (Vatex-backbone)
Video Captioning	MSVD	METEOR	40.65	VASTA (Vatex-backbone)
Video Captioning	MSVD	ROUGE-L	76.7	VASTA (Vatex-backbone)
Video Captioning	MSVD	BLEU-4	56.1	VASTA (Kinetics-backbone)
Video Captioning	MSVD	CIDEr	106.4	VASTA (Kinetics-backbone)
Video Captioning	MSVD	METEOR	39.1	VASTA (Kinetics-backbone)
Video Captioning	MSVD	ROUGE-L	74.5	VASTA (Kinetics-backbone)

Diverse Video Captioning by Adaptive Spatio-temporal Attention

Abstract

Results

Related Papers

Diverse Video Captioning by Adaptive Spatio-temporal Attention

Abstract

Results

Related Papers