TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/COSA: Concatenated Sample Pretrained Vision-Language Found...

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Sihan Chen, Xingjian He, Handong Li, Xiaojie Jin, Jiashi Feng, Jing Liu

2023-06-15Question AnsweringVideo Captioning on MSR-VTTVideo RetrievalFormVideo Question AnsweringVideo CaptioningRetrievalVisual Question Answering (VQA)TGIF-Frame
PaperPDFCode(official)

Abstract

Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations. To address this issue, we propose COSA, a COncatenated SAmple pretrained vision-language foundation model. COSA jointly models visual contents and event-level temporal cues using only image-text corpora. We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining. This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus, enabling richer scene transformations and explicit event-description correspondence. Extensive experiments demonstrate that COSA consistently improves performance across a broad range of downstream tasks, including long-form/short-form video-text tasks and image-text tasks such as retrieval, captioning, and question answering. Notably, COSA achieves state-of-the-art results on various competitive benchmarks. Code and model are released at https://github.com/TXH-mercury/COSA.

Results

TaskDatasetMetricValueModel
VideoActivityNettext-to-video R@167.3COSA
VideoDiDeMotext-to-video R@170.5COSA
VideoMSR-VTTtext-to-video R@157.9COSA
VideoLSMDCtext-to-video R@139.4COSA
Visual Question Answering (VQA)MSVD-QAAccuracy0.6COSA
Video Question AnsweringActivityNet-QAAccuracy49.9COSA
Video Question AnsweringMSRVTT-QAAccuracy49.2COSA
Video CaptioningMSR-VTTBLEU-453.7COSA
Video CaptioningMSR-VTTCIDEr74.7COSA
Video CaptioningVATEXBLEU-443.7COSA
Video CaptioningVATEXCIDEr96.5COSA
Video CaptioningTVCBLEU-418.8COSA
Video CaptioningTVCCIDEr70.7COSA
Video CaptioningYouCook2BLEU-410.1COSA
Video CaptioningYouCook2CIDEr1.31COSA
Video CaptioningMSVDBLEU-476.5COSA
Video CaptioningMSVDCIDEr178.5COSA
Video RetrievalActivityNettext-to-video R@167.3COSA
Video RetrievalDiDeMotext-to-video R@170.5COSA
Video RetrievalMSR-VTTtext-to-video R@157.9COSA
Video RetrievalLSMDCtext-to-video R@139.4COSA

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17