TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Prototype-based Aleatoric Uncertainty Quantification for C...

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

Hao Li, Jingkuan Song, Lianli Gao, Xiaosu Zhu, Heng Tao Shen

2023-09-29NeurIPS 2023 11Uncertainty QuantificationCross-Modal RetrievalVideo RetrievalImage-text matchingVideo-Text RetrievalText RetrievalText to Video RetrievalImage-to-Text RetrievalRetrievalVideo to Text Retrieval
PaperPDFCode(official)

Abstract

Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space. However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity. Concretely, we first construct a set of various learnable prototypes for each modality to represent the entire semantics subspace. Then Dempster-Shafer Theory and Subjective Logic Theory are utilized to build an evidential theoretical framework by associating evidence with Dirichlet Distribution parameters. The PAU model induces accurate uncertainty and reliable predictions for cross-modal retrieval. Extensive experiments are performed on four major benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the effectiveness of our method. The code is accessible at https://github.com/leolee99/PAU.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank14PAU
VideoMSR-VTT-1kAtext-to-video Median Rank2PAU
VideoMSR-VTT-1kAtext-to-video R@148.5PAU
VideoMSR-VTT-1kAtext-to-video R@1082.5PAU
VideoMSR-VTT-1kAtext-to-video R@572.7PAU
VideoMSR-VTT-1kAvideo-to-text Mean Rank9.7PAU
VideoMSR-VTT-1kAvideo-to-text Median Rank2PAU
VideoMSR-VTT-1kAvideo-to-text R@148.3PAU
VideoMSR-VTT-1kAvideo-to-text R@1083.2PAU
VideoMSR-VTT-1kAvideo-to-text R@573PAU
VideoDiDeMotext-to-video Mean Rank12.9PAU
VideoDiDeMotext-to-video Median Rank2PAU
VideoDiDeMotext-to-video R@148.6PAU
VideoDiDeMotext-to-video R@1084.5PAU
VideoDiDeMotext-to-video R@576PAU
VideoDiDeMovideo-to-text Mean Rank9.8PAU
VideoDiDeMovideo-to-text Median Rank2PAU
VideoDiDeMovideo-to-text R@148.1PAU
VideoDiDeMovideo-to-text R@1085.7PAU
VideoDiDeMovideo-to-text R@574.2PAU
VideoMSVDtext-to-video Mean Rank9.6PAU
VideoMSVDtext-to-video Median Rank2PAU
VideoMSVDtext-to-video R@147.3PAU
VideoMSVDtext-to-video R@1085.5PAU
VideoMSVDtext-to-video R@577.4PAU
VideoMSVDvideo-to-text Mean Rank2.4PAU
VideoMSVDvideo-to-text Median Rank1PAU
VideoMSVDvideo-to-text R@168.9PAU
VideoMSVDvideo-to-text R@1097.1PAU
VideoMSVDvideo-to-text R@593.1PAU
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank14PAU
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank2PAU
Video RetrievalMSR-VTT-1kAtext-to-video R@148.5PAU
Video RetrievalMSR-VTT-1kAtext-to-video R@1082.5PAU
Video RetrievalMSR-VTT-1kAtext-to-video R@572.7PAU
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank9.7PAU
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank2PAU
Video RetrievalMSR-VTT-1kAvideo-to-text R@148.3PAU
Video RetrievalMSR-VTT-1kAvideo-to-text R@1083.2PAU
Video RetrievalMSR-VTT-1kAvideo-to-text R@573PAU
Video RetrievalDiDeMotext-to-video Mean Rank12.9PAU
Video RetrievalDiDeMotext-to-video Median Rank2PAU
Video RetrievalDiDeMotext-to-video R@148.6PAU
Video RetrievalDiDeMotext-to-video R@1084.5PAU
Video RetrievalDiDeMotext-to-video R@576PAU
Video RetrievalDiDeMovideo-to-text Mean Rank9.8PAU
Video RetrievalDiDeMovideo-to-text Median Rank2PAU
Video RetrievalDiDeMovideo-to-text R@148.1PAU
Video RetrievalDiDeMovideo-to-text R@1085.7PAU
Video RetrievalDiDeMovideo-to-text R@574.2PAU
Video RetrievalMSVDtext-to-video Mean Rank9.6PAU
Video RetrievalMSVDtext-to-video Median Rank2PAU
Video RetrievalMSVDtext-to-video R@147.3PAU
Video RetrievalMSVDtext-to-video R@1085.5PAU
Video RetrievalMSVDtext-to-video R@577.4PAU
Video RetrievalMSVDvideo-to-text Mean Rank2.4PAU
Video RetrievalMSVDvideo-to-text Median Rank1PAU
Video RetrievalMSVDvideo-to-text R@168.9PAU
Video RetrievalMSVDvideo-to-text R@1097.1PAU
Video RetrievalMSVDvideo-to-text R@593.1PAU

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Distributional Reinforcement Learning on Path-dependent Options2025-07-16Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16