TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CLIP Meets Video Captioning: Concept-Aware Representation ...

CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter

Bang Yang, Tong Zhang, Yuexian Zou

2021-11-30Representation LearningCaption GenerationVideo Captioning
PaperPDFCode(official)

Abstract

For video captioning, "pre-training and fine-tuning" has become a de facto paradigm, where ImageNet Pre-training (INP) is usually used to encode the video content, then a task-oriented network is fine-tuned from scratch to cope with caption generation. This paper first investigates the impact of the recently proposed CLIP (Contrastive Language-Image Pre-training) on video captioning. Through the empirical study on INP vs. CLIP, we identify the potential deficiencies of INP and explore the key factors for accurate description generation. The results show that the INP-based model is tricky to capture concepts' semantics and sensitive to irrelevant background information. By contrast, the CLIP-based model significantly improves the caption quality and highlights the importance of concept-aware representation learning. With these findings, we propose Dual Concept Detection (DCD) further to inject concept knowledge into the model during training. DCD is an auxiliary task that requires a caption model to learn the correspondence between video content and concepts and the co-occurrence relations between concepts. Experiments on MSR-VTT and VATEX demonstrate the effectiveness of DCD, and the visualization results further reveal the necessity of learning concept-aware representations.

Results

TaskDatasetMetricValueModel
Video CaptioningMSR-VTTBLEU-448.2CLIP-DCD
Video CaptioningMSR-VTTCIDEr58.7CLIP-DCD
Video CaptioningMSR-VTTMETEOR31.3CLIP-DCD
Video CaptioningMSR-VTTROUGE-L64.8CLIP-DCD

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction2025-07-15UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15