TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Unmasked Teacher: Towards Training-Efficient Video Foundat...

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, LiMin Wang, Yu Qiao

2023-03-28ICCV 2023 1Video RetrievalAction ClassificationZero-Shot Video Retrievalcross-modal alignmentSpatio-Temporal Action LocalizationVideo Question AnsweringAction RecognitionVisual Question Answering (VQA)
PaperPDFCode(official)

Abstract

Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain. Although VideoMAE has trained a robust ViT from limited data, its low-level reconstruction poses convergence difficulties and conflicts with high-level cross-modal alignment. This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods. To increase data efficiency, we mask out most of the low-semantics video tokens, but selectively align the unmasked tokens with IFM, which serves as the UnMasked Teacher (UMT). By providing semantic guidance, our method enables faster convergence and multimodal friendliness. With a progressive pre-training framework, our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding. Using only public sources for pre-training in 6 days on 32 A100 GPUs, our scratch-built ViT-L/16 achieves state-of-the-art performances on various video tasks. The code and models will be released at https://github.com/OpenGVLab/unmasked_teacher.

Results

TaskDatasetMetricValueModel
VideoSSv2-template retrievaltext-to-video R@190.8UMT-L (ViT-L/16)
VideoSSv2-template retrievaltext-to-video R@10100UMT-L (ViT-L/16)
VideoSSv2-template retrievaltext-to-video R@5100UMT-L (ViT-L/16)
VideoVATEXtext-to-video R@172Unmasked Teacher
VideoVATEXtext-to-video R@1097.8Unmasked Teacher
VideoVATEXtext-to-video R@595.1Unmasked Teacher
VideoVATEXvideo-to-text R@186Unmasked Teacher
VideoVATEXvideo-to-text R@1099.6Unmasked Teacher
VideoActivityNettext-to-video R@166.8UMT-L (ViT-L/16)
VideoActivityNettext-to-video R@1094.9UMT-L (ViT-L/16)
VideoActivityNettext-to-video R@589.1UMT-L (ViT-L/16)
VideoActivityNetvideo-to-text R@164.4UMT-L (ViT-L/16)
VideoActivityNetvideo-to-text R@1094.8UMT-L (ViT-L/16)
VideoActivityNetvideo-to-text R@589.1UMT-L (ViT-L/16)
VideoSSv2-label retrievaltext-to-video R@173.3UMT-L (ViT-L/16)
VideoSSv2-label retrievaltext-to-video R@1096.6UMT-L (ViT-L/16)
VideoSSv2-label retrievaltext-to-video R@592.7UMT-L (ViT-L/16)
VideoDiDeMotext-to-video R@170.4UMT-L (ViT-L/16)
VideoDiDeMotext-to-video R@1093.5UMT-L (ViT-L/16)
VideoDiDeMotext-to-video R@590.1UMT-L (ViT-L/16)
VideoDiDeMovideo-to-text R@165.7UMT-L (ViT-L/16)
VideoDiDeMovideo-to-text R@1093.3UMT-L (ViT-L/16)
VideoDiDeMovideo-to-text R@589.6UMT-L (ViT-L/16)
VideoMSR-VTTtext-to-video R@158.8UMT-L (ViT-L/16)
VideoMSR-VTTtext-to-video R@1087.1UMT-L (ViT-L/16)
VideoMSR-VTTtext-to-video R@581UMT-L (ViT-L/16)
VideoMSR-VTTvideo-to-text R@158.6UMT-L (ViT-L/16)
VideoMSR-VTTvideo-to-text R@1086.5UMT-L (ViT-L/16)
VideoMSR-VTTvideo-to-text R@581.6UMT-L (ViT-L/16)
VideoLSMDCtext-to-video R@143UMT-L (ViT-L/16)
VideoLSMDCtext-to-video R@1073UMT-L (ViT-L/16)
VideoLSMDCtext-to-video R@565.5UMT-L (ViT-L/16)
VideoLSMDCvideo-to-text R@141.4UMT-L (ViT-L/16)
VideoLSMDCvideo-to-text R@1071.5UMT-L (ViT-L/16)
VideoLSMDCvideo-to-text R@564.3UMT-L (ViT-L/16)
VideoKinetics-700Top-1 Accuracy83.6UMT-L (ViT-L/16)
VideoKinetics-700Top-5 Accuracy96.7UMT-L (ViT-L/16)
VideoMiTTop 1 Accuracy48.7UMT-L (ViT-L/16)
VideoMiTTop 5 Accuracy78.2UMT-L (ViT-L/16)
VideoKinetics-400Acc@190.6Unmasked Teacher (ViT-L)
VideoKinetics-400Acc@598.7Unmasked Teacher (ViT-L)
VideoKinetics-400Parameters (M)304Unmasked Teacher (ViT-L)
VideoKinetics-400Acc@190.6UMT-L (ViT-L/16)
VideoKinetics-400Acc@598.7UMT-L (ViT-L/16)
VideoKinetics-600Top-1 Accuracy90.5UMT-L (ViT-L/16)
VideoKinetics-600Top-5 Accuracy98.8UMT-L (ViT-L/16)
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.471UMT-L (ViT-L/16)
Visual Question Answering (VQA)MSVD-QAAccuracy0.552UMT-L (ViT-L/16)
Video Question AnsweringActivityNet-QAAccuracy47.9UMT-L (ViT-L/16)
Activity RecognitionAVA v2.2mAP39.8UMT-L (ViT-L/16)
Action RecognitionAVA v2.2mAP39.8UMT-L (ViT-L/16)
Video RetrievalSSv2-template retrievaltext-to-video R@190.8UMT-L (ViT-L/16)
Video RetrievalSSv2-template retrievaltext-to-video R@10100UMT-L (ViT-L/16)
Video RetrievalSSv2-template retrievaltext-to-video R@5100UMT-L (ViT-L/16)
Video RetrievalVATEXtext-to-video R@172Unmasked Teacher
Video RetrievalVATEXtext-to-video R@1097.8Unmasked Teacher
Video RetrievalVATEXtext-to-video R@595.1Unmasked Teacher
Video RetrievalVATEXvideo-to-text R@186Unmasked Teacher
Video RetrievalVATEXvideo-to-text R@1099.6Unmasked Teacher
Video RetrievalActivityNettext-to-video R@166.8UMT-L (ViT-L/16)
Video RetrievalActivityNettext-to-video R@1094.9UMT-L (ViT-L/16)
Video RetrievalActivityNettext-to-video R@589.1UMT-L (ViT-L/16)
Video RetrievalActivityNetvideo-to-text R@164.4UMT-L (ViT-L/16)
Video RetrievalActivityNetvideo-to-text R@1094.8UMT-L (ViT-L/16)
Video RetrievalActivityNetvideo-to-text R@589.1UMT-L (ViT-L/16)
Video RetrievalSSv2-label retrievaltext-to-video R@173.3UMT-L (ViT-L/16)
Video RetrievalSSv2-label retrievaltext-to-video R@1096.6UMT-L (ViT-L/16)
Video RetrievalSSv2-label retrievaltext-to-video R@592.7UMT-L (ViT-L/16)
Video RetrievalDiDeMotext-to-video R@170.4UMT-L (ViT-L/16)
Video RetrievalDiDeMotext-to-video R@1093.5UMT-L (ViT-L/16)
Video RetrievalDiDeMotext-to-video R@590.1UMT-L (ViT-L/16)
Video RetrievalDiDeMovideo-to-text R@165.7UMT-L (ViT-L/16)
Video RetrievalDiDeMovideo-to-text R@1093.3UMT-L (ViT-L/16)
Video RetrievalDiDeMovideo-to-text R@589.6UMT-L (ViT-L/16)
Video RetrievalMSR-VTTtext-to-video R@158.8UMT-L (ViT-L/16)
Video RetrievalMSR-VTTtext-to-video R@1087.1UMT-L (ViT-L/16)
Video RetrievalMSR-VTTtext-to-video R@581UMT-L (ViT-L/16)
Video RetrievalMSR-VTTvideo-to-text R@158.6UMT-L (ViT-L/16)
Video RetrievalMSR-VTTvideo-to-text R@1086.5UMT-L (ViT-L/16)
Video RetrievalMSR-VTTvideo-to-text R@581.6UMT-L (ViT-L/16)
Video RetrievalLSMDCtext-to-video R@143UMT-L (ViT-L/16)
Video RetrievalLSMDCtext-to-video R@1073UMT-L (ViT-L/16)
Video RetrievalLSMDCtext-to-video R@565.5UMT-L (ViT-L/16)
Video RetrievalLSMDCvideo-to-text R@141.4UMT-L (ViT-L/16)
Video RetrievalLSMDCvideo-to-text R@1071.5UMT-L (ViT-L/16)
Video RetrievalLSMDCvideo-to-text R@564.3UMT-L (ViT-L/16)
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@142.6UMT-L (ViT-L/16)
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1073.1UMT-L (ViT-L/16)
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@564.4UMT-L (ViT-L/16)
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@138.6UMT-L (ViT-L/16)
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@1069.6UMT-L (ViT-L/16)
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@559.8UMT-L (ViT-L/16)
Zero-Shot Video RetrievalMSVDtext-to-video R@149UMT-L (ViT-L/16)
Zero-Shot Video RetrievalMSVDtext-to-video R@1084.7UMT-L (ViT-L/16)
Zero-Shot Video RetrievalMSVDtext-to-video R@576.9UMT-L (ViT-L/16)
Zero-Shot Video RetrievalMSVDvideo-to-text R@174.5UMT-L (ViT-L/16)
Zero-Shot Video RetrievalMSVDvideo-to-text R@1092.8UMT-L (ViT-L/16)
Zero-Shot Video RetrievalMSVDvideo-to-text R@589.7UMT-L (ViT-L/16)
Zero-Shot Video RetrievalDiDeMotext-to-video R@148.6UMT-L (ViT-L/16)
Zero-Shot Video RetrievalDiDeMotext-to-video R@1079UMT-L (ViT-L/16)
Zero-Shot Video RetrievalDiDeMotext-to-video R@572.9UMT-L (ViT-L/16)
Zero-Shot Video RetrievalDiDeMovideo-to-text R@149.9UMT-L (ViT-L/16)
Zero-Shot Video RetrievalDiDeMovideo-to-text R@1081.4UMT-L (ViT-L/16)
Zero-Shot Video RetrievalDiDeMovideo-to-text R@574.8UMT-L (ViT-L/16)
Zero-Shot Video RetrievalLSMDCtext-to-video R@125.2UMT-L (ViT-L/16)
Zero-Shot Video RetrievalLSMDCtext-to-video R@1050.5UMT-L (ViT-L/16)
Zero-Shot Video RetrievalLSMDCtext-to-video R@543UMT-L (ViT-L/16)
Zero-Shot Video RetrievalLSMDCvideo-to-text R@123.2UMT-L (ViT-L/16)
Zero-Shot Video RetrievalLSMDCvideo-to-text R@1044.2UMT-L (ViT-L/16)
Zero-Shot Video RetrievalLSMDCvideo-to-text R@537.7UMT-L (ViT-L/16)
Zero-Shot Video RetrievalActivityNettext-to-video R@142.8UMT-L (ViT-L/16)
Zero-Shot Video RetrievalActivityNettext-to-video R@1079.8UMT-L (ViT-L/16)
Zero-Shot Video RetrievalActivityNettext-to-video R@569.6UMT-L (ViT-L/16)
Zero-Shot Video RetrievalActivityNetvideo-to-text R@140.7UMT-L (ViT-L/16)
Zero-Shot Video RetrievalActivityNetvideo-to-text R@1078.6UMT-L (ViT-L/16)
Zero-Shot Video RetrievalActivityNetvideo-to-text R@567.6UMT-L (ViT-L/16)

Related Papers

Transformer-based Spatial Grounding: A Comprehensive Survey2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16CATVis: Context-Aware Thought Visualization2025-07-15Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection2025-07-15Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09