TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Rudder: A Cross Lingual Video and Text Retrieval Dataset

Rudder: A Cross Lingual Video and Text Retrieval Dataset

Jayaprakash A, abhishek, Rishabh Dabral, Ganesh Ramakrishnan, Preethi Jyothi

2021-03-09Video RetrievalVideo-Text RetrievalText RetrievalRetrievalNatural Language Queries
PaperPDFCode(official)

Abstract

Video retrieval using natural language queries requires learning semantically meaningful joint embeddings between the text and the audio-visual input. Often, such joint embeddings are learnt using pairwise (or triplet) contrastive loss objectives which cannot give enough attention to 'difficult-to-retrieve' samples during training. This problem is especially pronounced in data-scarce settings where the data is relatively small (10% of the large scale MSR-VTT) to cover the rather complex audio-visual embedding space. In this context, we introduce Rudder - a multilingual video-text retrieval dataset that includes audio and textual captions in Marathi, Hindi, Tamil, Kannada, Malayalam and Telugu. Furthermore, we propose to compensate for data scarcity by using domain knowledge to augment supervision. To this end, in addition to the conventional three samples of a triplet (anchor, positive, and negative), we introduce a fourth term - a partial - to define a differential margin based partialorder loss. The partials are heuristically sampled such that they semantically lie in the overlap zone between the positives and the negatives, thereby resulting in broader embedding coverage. Our proposals consistently outperform the conventional max-margin and triplet losses and improve the state-of-the-art on MSR-VTT and DiDeMO datasets. We report benchmark results on Rudder while also observing significant gains using the proposed partial order loss, especially when the language specific retrieval models are jointly trained by availing the cross-lingual alignment across the language-specific datasets.

Results

TaskDatasetMetricValueModel
VideoRUDDERtext-to-video Mean Rank66PO Loss
VideoRUDDERtext-to-video Median Rank153.14PO Loss
VideoRUDDERtext-to-video R@14.48PO Loss
VideoRUDDERtext-to-video R@1020.02PO Loss
VideoRUDDERtext-to-video R@513.47PO Loss
VideoRUDDERtext-to-video R@5042.49PO Loss
VideoRUDDERvideo-to-text Mean Rank73PO Loss
VideoRUDDERvideo-to-text Median Rank151.63PO Loss
VideoRUDDERvideo-to-text R@13.87PO Loss
VideoRUDDERvideo-to-text R@1019.09PO Loss
VideoRUDDERvideo-to-text R@512.13PO Loss
VideoDiDeMotext-to-video Mean Rank40.2PO Loss
VideoDiDeMotext-to-video Median Rank8PO Loss
VideoDiDeMotext-to-video R@116.3PO Loss
VideoDiDeMotext-to-video R@1056.5PO Loss
VideoDiDeMovideo-to-text Mean Rank39.6PO Loss
VideoDiDeMovideo-to-text Median Rank8PO Loss
VideoDiDeMovideo-to-text R@115PO Loss
VideoDiDeMovideo-to-text R@1054.9PO Loss
VideoCharades-STAtext-to-video Mean Rank162.3PO Loss
VideoCharades-STAtext-to-video Median Rank77PO Loss
VideoCharades-STAtext-to-video R@13.6PO Loss
VideoCharades-STAtext-to-video R@1015.9PO Loss
VideoCharades-STAvideo-to-text Mean Rank164.6PO Loss
VideoCharades-STAvideo-to-text Median Rank83PO Loss
VideoCharades-STAvideo-to-text R@13.2PO Loss
VideoCharades-STAvideo-to-text R@1014.9PO Loss
Video RetrievalRUDDERtext-to-video Mean Rank66PO Loss
Video RetrievalRUDDERtext-to-video Median Rank153.14PO Loss
Video RetrievalRUDDERtext-to-video R@14.48PO Loss
Video RetrievalRUDDERtext-to-video R@1020.02PO Loss
Video RetrievalRUDDERtext-to-video R@513.47PO Loss
Video RetrievalRUDDERtext-to-video R@5042.49PO Loss
Video RetrievalRUDDERvideo-to-text Mean Rank73PO Loss
Video RetrievalRUDDERvideo-to-text Median Rank151.63PO Loss
Video RetrievalRUDDERvideo-to-text R@13.87PO Loss
Video RetrievalRUDDERvideo-to-text R@1019.09PO Loss
Video RetrievalRUDDERvideo-to-text R@512.13PO Loss
Video RetrievalDiDeMotext-to-video Mean Rank40.2PO Loss
Video RetrievalDiDeMotext-to-video Median Rank8PO Loss
Video RetrievalDiDeMotext-to-video R@116.3PO Loss
Video RetrievalDiDeMotext-to-video R@1056.5PO Loss
Video RetrievalDiDeMovideo-to-text Mean Rank39.6PO Loss
Video RetrievalDiDeMovideo-to-text Median Rank8PO Loss
Video RetrievalDiDeMovideo-to-text R@115PO Loss
Video RetrievalDiDeMovideo-to-text R@1054.9PO Loss
Video RetrievalCharades-STAtext-to-video Mean Rank162.3PO Loss
Video RetrievalCharades-STAtext-to-video Median Rank77PO Loss
Video RetrievalCharades-STAtext-to-video R@13.6PO Loss
Video RetrievalCharades-STAtext-to-video R@1015.9PO Loss
Video RetrievalCharades-STAvideo-to-text Mean Rank164.6PO Loss
Video RetrievalCharades-STAvideo-to-text Median Rank83PO Loss
Video RetrievalCharades-STAvideo-to-text R@13.2PO Loss
Video RetrievalCharades-STAvideo-to-text R@1014.9PO Loss

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15