TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Noise Estimation Using Density Estimation for Self-Supervi...

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

Elad Amrani, Rami Ben-Ari, Daniel Rotman, Alex Bronstein

2020-03-06Question AnsweringVideo RetrievalRepresentation LearningZero-Shot Video RetrievalDensity EstimationText to Video RetrievalVideo Question AnsweringNoise EstimationRetrievalVisual Question Answering (VQA)Visual Question Answering
PaperPDFCode(official)

Abstract

One of the key factors of enabling machine learning models to comprehend and solve real-world tasks is to leverage multimodal data. Unfortunately, annotation of multimodal data is challenging and expensive. Recently, self-supervised multimodal methods that combine vision and language were proposed to learn multimodal representations without annotation. However, these methods often choose to ignore the presence of high levels of noise and thus yield sub-optimal results. In this work, we show that the problem of noise estimation for multimodal data can be reduced to a multimodal density estimation task. Using multimodal density estimation, we propose a noise estimation building block for multimodal representation learning that is based strictly on the inherent correlation between different modalities. We demonstrate how our noise estimation can be broadly integrated and achieves comparable results to state-of-the-art performance on five different benchmark datasets for two challenging multimodal tasks: Video Question Answering and Text-To-Video Retrieval. Furthermore, we provide a theoretical probabilistic error bound substantiating our empirical results and analyze failure cases. Code: https://github.com/elad-amrani/ssml.

Results

TaskDatasetMetricValueModel
VideoMSVDtext-to-video Median Rank6SSML
VideoMSVDtext-to-video R@120.3SSML
VideoMSVDtext-to-video R@1063.3SSML
VideoMSVDtext-to-video R@549SSML
Visual Question Answering (VQA)MSVD-QAAccuracy0.351SSML
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.35SSML
Video RetrievalMSVDtext-to-video Median Rank6SSML
Video RetrievalMSVDtext-to-video R@120.3SSML
Video RetrievalMSVDtext-to-video R@1063.3SSML
Video RetrievalMSVDtext-to-video R@549SSML
Visual Question AnsweringMSRVTT-QAAccuracy0.35SSML
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@18SSML
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1029.3SSML
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@521.3SSML
Zero-Shot Video RetrievalMSVDtext-to-video R@113.66SSML
Zero-Shot Video RetrievalMSVDtext-to-video R@1047.74SSML
Zero-Shot Video RetrievalMSVDtext-to-video R@535.7SSML
Zero-Shot Video RetrievalLSMDCtext-to-video R@14.2SSML
Zero-Shot Video RetrievalLSMDCtext-to-video R@1017.1SSML
Zero-Shot Video RetrievalLSMDCtext-to-video R@511.6SSML

Related Papers

Missing value imputation with adversarial random forests -- MissARF2025-07-21Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17