TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Advancing High-Resolution Video-Language Representation wi...

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, Baining Guo

2021-11-19CVPR 2022 1Super-ResolutionVideo RetrievalVocal Bursts Intensity PredictionZero-Shot Video RetrievalText to Video RetrievalRetrieval
PaperPDFCode(official)

Abstract

We study joint video and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream VL tasks. Existing works either extract low-quality video features or learn limited text embedding, while neglecting that high-resolution videos and diversified semantics can significantly improve cross-modality learning. In this paper, we propose a novel High-resolution and Diversified VIdeo-LAnguage pre-training model (HD-VILA) for many visual tasks. In particular, we collect a large dataset with two distinct properties: 1) the first high-resolution dataset including 371.5k hours of 720p videos, and 2) the most diversified dataset covering 15 popular YouTube categories. To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned video features with diversified texts. Our pre-training model achieves new state-of-the-art results in 10 VL understanding tasks and 2 more novel text-to-visual generation tasks. For example, we outperform SOTA models with relative increases of 40.4% R@1 in zero-shot MSR-VTT text-to-video retrieval task and 55.4% in high-resolution dataset LSMDC. The learned VL embedding is also effective in generating visually pleasing and semantically relevant results in text-to-visual editing and super-resolution tasks.

Results

TaskDatasetMetricValueModel
VideoActivityNettext-to-video Median Rank4HD-VILA
VideoActivityNettext-to-video R@128.5HD-VILA
VideoActivityNettext-to-video R@557.4HD-VILA
VideoActivityNettext-to-video R@5094HD-VILA
VideoDiDeMotext-to-video Median Rank4HD-VILA
VideoDiDeMotext-to-video R@128.8HD-VILA
VideoDiDeMotext-to-video R@1069.1HD-VILA
VideoDiDeMotext-to-video R@557.4HD-VILA
VideoMSR-VTTtext-to-video MedianR3HD-VILA
VideoMSR-VTTtext-to-video R@135.6HD-VILA
VideoMSR-VTTtext-to-video R@1078HD-VILA
VideoMSR-VTTtext-to-video R@565.3HD-VILA
VideoLSMDCtext-to-video Median Rank15HD-VILA
VideoLSMDCtext-to-video R@117.4HD-VILA
VideoLSMDCtext-to-video R@1044.1HD-VILA
VideoLSMDCtext-to-video R@534.1HD-VILA
Video RetrievalActivityNettext-to-video Median Rank4HD-VILA
Video RetrievalActivityNettext-to-video R@128.5HD-VILA
Video RetrievalActivityNettext-to-video R@557.4HD-VILA
Video RetrievalActivityNettext-to-video R@5094HD-VILA
Video RetrievalDiDeMotext-to-video Median Rank4HD-VILA
Video RetrievalDiDeMotext-to-video R@128.8HD-VILA
Video RetrievalDiDeMotext-to-video R@1069.1HD-VILA
Video RetrievalDiDeMotext-to-video R@557.4HD-VILA
Video RetrievalMSR-VTTtext-to-video MedianR3HD-VILA
Video RetrievalMSR-VTTtext-to-video R@135.6HD-VILA
Video RetrievalMSR-VTTtext-to-video R@1078HD-VILA
Video RetrievalMSR-VTTtext-to-video R@565.3HD-VILA
Video RetrievalLSMDCtext-to-video Median Rank15HD-VILA
Video RetrievalLSMDCtext-to-video R@117.4HD-VILA
Video RetrievalLSMDCtext-to-video R@1044.1HD-VILA
Video RetrievalLSMDCtext-to-video R@534.1HD-VILA
Zero-Shot Video RetrievalMSR-VTTtext-to-video Median Rank15HD-VILA
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@114.6HD-VILA
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1044.1HD-VILA
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@534.4HD-VILA

Related Papers

SpectraLift: Physics-Guided Spectral-Inversion Network for Self-Supervised Hyperspectral Image Super-Resolution2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16