A Joint Sequence Fusion Model for Video Question Answering and Retrieval

Youngjae Yu, Jongseok Kim, Gunhee Kim

2018-08-07ECCV 2018 9Question Answering Video Retrieval Video Question Answering Semantic Similarity Semantic Textual Similarity Retrieval Visual Question Answering (VQA)Multiple-choice

Paper PDF Code Code

Abstract

We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence). Our multimodal matching network consists of two key components. First, the Joint Semantic Tensor composes a dense pairwise representation of two sequence data into a 3D tensor. Then, the Convolutional Hierarchical Decoder computes their similarity score by discovering hidden hierarchical matches between the two sequence modalities. Both modules leverage hierarchical attention mechanisms that learn to promote well-matched representation patterns while prune out misaligned ones in a bottom-up manner. Although the JSFusion is a universal model to be applicable to any multimodal sequence data, this work focuses on video-language tasks including multimodal retrieval and video QA. We evaluate the JSFusion model in three retrieval and VQA tasks in LSMDC, for which our model achieves the best performance reported so far. We also perform multiple-choice and movie retrieval tasks for the MSR-VTT dataset, on which our approach outperforms many state-of-the-art methods.

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video Median Rank	13	JSFusion
Video	MSR-VTT-1kA	text-to-video R@1	10.2	JSFusion
Video	MSR-VTT-1kA	text-to-video R@10	43.2	JSFusion
Video	MSR-VTT-1kA	text-to-video R@5	31.2	JSFusion
Video	MSR-VTT	text-to-video Median Rank	13	JSFusion
Video	MSR-VTT	text-to-video R@1	10.2	JSFusion
Video	MSR-VTT	text-to-video R@10	43.2	JSFusion
Video	MSR-VTT	video-to-text R@5	31.2	JSFusion
Video	LSMDC	text-to-video Median Rank	36	JSFusion
Video	LSMDC	text-to-video R@1	9.1	JSFusion
Video	LSMDC	text-to-video R@10	34.1	JSFusion
Video	LSMDC	text-to-video R@5	21.2	JSFusion
Video Retrieval	MSR-VTT-1kA	text-to-video Median Rank	13	JSFusion
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	10.2	JSFusion
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	43.2	JSFusion
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	31.2	JSFusion
Video Retrieval	MSR-VTT	text-to-video Median Rank	13	JSFusion
Video Retrieval	MSR-VTT	text-to-video R@1	10.2	JSFusion
Video Retrieval	MSR-VTT	text-to-video R@10	43.2	JSFusion
Video Retrieval	MSR-VTT	video-to-text R@5	31.2	JSFusion
Video Retrieval	LSMDC	text-to-video Median Rank	36	JSFusion
Video Retrieval	LSMDC	text-to-video R@1	9.1	JSFusion
Video Retrieval	LSMDC	text-to-video R@10	34.1	JSFusion
Video Retrieval	LSMDC	text-to-video R@5	21.2	JSFusion

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

Abstract

Results

Related Papers

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

Abstract

Results

Related Papers