TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Ziyang Wang, Yi-Lin Sung, Feng Cheng, Gedas Bertasius, Mohit Bansal

2023-09-18ICCV 2023 1Video RetrievalVideo-Text RetrievalText RetrievalText to Video RetrievalRetrieval
PaperPDFCode(official)

Abstract

The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained alignment between visual and textual information. However, retrieving the correct video according to the text query is often challenging as it requires the ability to reason about both high-level (scene) and low-level (object) visual clues and how they relate to the text query. To this end, we propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA. Specifically, our model captures the cross-modal similarity information at different granularity levels. To alleviate the effect of irrelevant visual clues, we also apply an Interactive Similarity Aggregation module (ISA) to consider the importance of different visual features while aggregating the cross-modal similarity to obtain a similarity score for each granularity. Finally, we apply the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them, alleviating over- and under-representation issues at different levels. By jointly considering the crossmodal similarity of different granularity, UCoFiA allows the effective unification of multi-grained alignments. Empirically, UCoFiA outperforms previous state-of-the-art CLIP-based methods on multiple video-text retrieval benchmarks, achieving 2.4%, 1.4% and 1.3% improvements in text-to-video retrieval R@1 on MSR-VTT, Activity-Net, and DiDeMo, respectively. Our code is publicly available at https://github.com/Ziyang412/UCoFiA.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video R@149.4UCoFiA
VideoMSR-VTT-1kAtext-to-video R@1083.5UCoFiA
VideoMSR-VTT-1kAtext-to-video R@572.1UCoFiA
VideoMSR-VTT-1kAvideo-to-text R@147.1UCoFiA
VideoMSR-VTT-1kAvideo-to-text R@1083UCoFiA
VideoMSR-VTT-1kAvideo-to-text R@574.3UCoFiA
VideoMSR-VTTtext-to-video R@149.4UCoFiA
VideoMSR-VTTtext-to-video R@1083.5UCoFiA
VideoMSR-VTTtext-to-video R@572.1UCoFiA
Video RetrievalMSR-VTT-1kAtext-to-video R@149.4UCoFiA
Video RetrievalMSR-VTT-1kAtext-to-video R@1083.5UCoFiA
Video RetrievalMSR-VTT-1kAtext-to-video R@572.1UCoFiA
Video RetrievalMSR-VTT-1kAvideo-to-text R@147.1UCoFiA
Video RetrievalMSR-VTT-1kAvideo-to-text R@1083UCoFiA
Video RetrievalMSR-VTT-1kAvideo-to-text R@574.3UCoFiA
Video RetrievalMSR-VTTtext-to-video R@149.4UCoFiA
Video RetrievalMSR-VTTtext-to-video R@1083.5UCoFiA
Video RetrievalMSR-VTTtext-to-video R@572.1UCoFiA

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15