TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MuLTI: Efficient Video-and-Language Understanding with Tex...

MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling

Jiaqi Xu, Bo Liu, Yunkuo Chen, Mengli Cheng, Xing Shi

2023-03-10Question AnsweringVideo RetrievalVideo Question AnsweringTGIF-TransitionRetrievalVisual Question Answering (VQA)TGIF-ActionMulti-Label ClassificationTGIF-FrameMultiple-choice
PaperPDF

Abstract

Video-and-language understanding has a variety of applications in the industry, such as video question answering, text-video retrieval, and multi-label classification. Existing video-and-language understanding methods generally adopt heavy multi-modal encoders and feature fusion modules, which consume high computational costs. Specially, they have difficulty dealing with dense video frames or long text prevalent in industrial applications. This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model that achieves efficient and effective feature fusion and rapid adaptation to downstream tasks. Specifically, we design a Text-Guided MultiWay-Sampler based on adapt-pooling residual mapping and self-attention modules to sample long sequences and fuse multi-modal features, which reduces the computational costs and addresses performance degradation caused by previous samplers. Therefore, MuLTI can handle longer sequences with limited computational costs. Then, to further enhance the model's performance and fill in the lack of pretraining tasks in the video question answering, we propose a new pretraining task named Multiple Choice Modeling. This task bridges the gap between pretraining and downstream tasks and improves the model's ability to align video and text features. Benefiting from the efficient feature fusion module and the new pretraining task, MuLTI achieves state-of-the-art performance on multiple datasets. Implementation and pretrained models will be released.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video R@154.7MuLTI
VideoMSR-VTT-1kAtext-to-video R@1086MuLTI
VideoMSR-VTT-1kAtext-to-video R@577.7MuLTI
VideoDiDeMotext-to-video R@156.5MuLTI
VideoDiDeMotext-to-video R@1087MuLTI
VideoDiDeMotext-to-video R@580.2MuLTI
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.478MuLTI
Visual Question Answering (VQA)MSVD-QAAccuracy0.547MuLTI
Video RetrievalMSR-VTT-1kAtext-to-video R@154.7MuLTI
Video RetrievalMSR-VTT-1kAtext-to-video R@1086MuLTI
Video RetrievalMSR-VTT-1kAtext-to-video R@577.7MuLTI
Video RetrievalDiDeMotext-to-video R@156.5MuLTI
Video RetrievalDiDeMotext-to-video R@1087MuLTI
Video RetrievalDiDeMotext-to-video R@580.2MuLTI

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17