MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling

Jiaqi Xu, Bo Liu, Yunkuo Chen, Mengli Cheng, Xing Shi

2023-03-10Question Answering Video Retrieval Video Question Answering TGIF-Transition Retrieval Visual Question Answering (VQA)TGIF-Action Multi-Label Classification TGIF-Frame Multiple-choice

Paper PDF

Abstract

Video-and-language understanding has a variety of applications in the industry, such as video question answering, text-video retrieval, and multi-label classification. Existing video-and-language understanding methods generally adopt heavy multi-modal encoders and feature fusion modules, which consume high computational costs. Specially, they have difficulty dealing with dense video frames or long text prevalent in industrial applications. This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model that achieves efficient and effective feature fusion and rapid adaptation to downstream tasks. Specifically, we design a Text-Guided MultiWay-Sampler based on adapt-pooling residual mapping and self-attention modules to sample long sequences and fuse multi-modal features, which reduces the computational costs and addresses performance degradation caused by previous samplers. Therefore, MuLTI can handle longer sequences with limited computational costs. Then, to further enhance the model's performance and fill in the lack of pretraining tasks in the video question answering, we propose a new pretraining task named Multiple Choice Modeling. This task bridges the gap between pretraining and downstream tasks and improves the model's ability to align video and text features. Benefiting from the efficient feature fusion module and the new pretraining task, MuLTI achieves state-of-the-art performance on multiple datasets. Implementation and pretrained models will be released.

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video R@1	54.7	MuLTI
Video	MSR-VTT-1kA	text-to-video R@10	86	MuLTI
Video	MSR-VTT-1kA	text-to-video R@5	77.7	MuLTI
Video	DiDeMo	text-to-video R@1	56.5	MuLTI
Video	DiDeMo	text-to-video R@10	87	MuLTI
Video	DiDeMo	text-to-video R@5	80.2	MuLTI
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.478	MuLTI
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.547	MuLTI
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	54.7	MuLTI
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	86	MuLTI
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	77.7	MuLTI
Video Retrieval	DiDeMo	text-to-video R@1	56.5	MuLTI
Video Retrieval	DiDeMo	text-to-video R@10	87	MuLTI
Video Retrieval	DiDeMo	text-to-video R@5	80.2	MuLTI

MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling

Abstract

Results

Related Papers

MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling

Abstract

Results

Related Papers