TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/An Empirical Study of End-to-End Video-Language Transforme...

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, Zicheng Liu

2022-09-04CVPR 2023 1Question AnsweringFill MaskVideo RetrievalOptical Flow EstimationText to Video RetrievalVideo Question AnsweringVideo CaptioningTGIF-TransitionRetrievalVisual Question Answering (VQA)TGIF-ActionTGIF-Frame
PaperPDFCode(official)

Abstract

Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropagated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens, and latent visual features. We conduct comprehensive experiments and provide insights into the factors leading to effective MVM training, resulting in an enhanced model VIOLETv2. Empirically, we show VIOLETv2 pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.

Results

TaskDatasetMetricValueModel
VideoDiDeMotext-to-video R@147.9VIOLETv2
VideoDiDeMotext-to-video R@1084.1VIOLETv2
VideoDiDeMotext-to-video R@576.5VIOLETv2
VideoMSR-VTTtext-to-video R@137.2VIOLETv2
VideoMSR-VTTtext-to-video R@1075.8VIOLETv2
VideoMSR-VTTtext-to-video R@564.8VIOLETv2
VideoLSMDCtext-to-video R@124VIOLETv2
VideoLSMDCtext-to-video R@1054.1VIOLETv2
VideoLSMDCtext-to-video R@543.5VIOLETv2
Visual Question Answering (VQA)MSVD-QAAccuracy0.547VIOLETv2
Video Question AnsweringMSRVTT-QAAccuracy44.5VIOLETv2
Video Question AnsweringLSMDC-MCAccuracy84.4VIOLETv2
Video Question AnsweringMSRVTT-MCAccuracy97.6VIOLETv2
Video CaptioningMSR-VTTCIDEr58VIOLETv2
Video CaptioningMSVDCIDEr139.2VIOLETv2
Video RetrievalDiDeMotext-to-video R@147.9VIOLETv2
Video RetrievalDiDeMotext-to-video R@1084.1VIOLETv2
Video RetrievalDiDeMotext-to-video R@576.5VIOLETv2
Video RetrievalMSR-VTTtext-to-video R@137.2VIOLETv2
Video RetrievalMSR-VTTtext-to-video R@1075.8VIOLETv2
Video RetrievalMSR-VTTtext-to-video R@564.8VIOLETv2
Video RetrievalLSMDCtext-to-video R@124VIOLETv2
Video RetrievalLSMDCtext-to-video R@1054.1VIOLETv2
Video RetrievalLSMDCtext-to-video R@543.5VIOLETv2

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17