TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CoVR-2: Automatic Data Construction for Composed Video Ret...

CoVR-2: Automatic Data Construction for Composed Video Retrieval

Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol

2023-08-28Composed Video Retrieval (CoVR)Composed Image Retrieval (CoIR)Video RetrievalLarge Language ModelRetrievalZero-Shot Composed Image Retrieval (ZS-CIR)Language ModellingImage Retrieval
PaperPDFCode(official)

Abstract

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/ ventural/covr.

Results

TaskDatasetMetricValueModel
VideoWebVid-CoVRR@159.82BLIP-2
Image RetrievalCIRRR@150.43CoVR-BLIP-2
Image RetrievalCIRRR@581.08CoVR-BLIP-2
Image RetrievalFashion IQ(Recall@10+Recall@50)/260.57CoVR-BLIP-2
Image RetrievalFashion IQR@1049.96CoVR-BLIP-2
Image RetrievalFashion IQR@5071.17CoVR-BLIP-2
Image RetrievalFashion IQ(Recall@10+Recall@50)/248.3CoVR-BLIP-2
Image RetrievalFashion IQR@1038.15CoVR-BLIP-2
Image RetrievalFashion IQR@5058.44CoVR-BLIP-2
Image RetrievalCIRCOmAP@1029.55CoVR-BLIP-2
Image RetrievalCIRRR@143.74CoVR-BLIP-2
Image RetrievalCIRRR@1083.95CoVR-BLIP-2
Image RetrievalCIRRR@573.61CoVR-BLIP-2
Image RetrievalCIRRR@5096.1CoVR-BLIP-2
Video RetrievalWebVid-CoVRR@159.82BLIP-2
Composed Image Retrieval (CoIR)CIRRR@150.43CoVR-BLIP-2
Composed Image Retrieval (CoIR)CIRRR@581.08CoVR-BLIP-2
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/260.57CoVR-BLIP-2
Composed Image Retrieval (CoIR)Fashion IQR@1049.96CoVR-BLIP-2
Composed Image Retrieval (CoIR)Fashion IQR@5071.17CoVR-BLIP-2
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/248.3CoVR-BLIP-2
Composed Image Retrieval (CoIR)Fashion IQR@1038.15CoVR-BLIP-2
Composed Image Retrieval (CoIR)Fashion IQR@5058.44CoVR-BLIP-2
Composed Image Retrieval (CoIR)CIRCOmAP@1029.55CoVR-BLIP-2
Composed Image Retrieval (CoIR)CIRRR@143.74CoVR-BLIP-2
Composed Image Retrieval (CoIR)CIRRR@1083.95CoVR-BLIP-2
Composed Image Retrieval (CoIR)CIRRR@573.61CoVR-BLIP-2
Composed Image Retrieval (CoIR)CIRRR@5096.1CoVR-BLIP-2

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17