Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

Elad Amrani, Rami Ben-Ari, Daniel Rotman, Alex Bronstein

2020-03-06Question Answering Video Retrieval Representation Learning Zero-Shot Video Retrieval Density Estimation Text to Video Retrieval Video Question Answering Noise Estimation Retrieval Visual Question Answering (VQA)Visual Question Answering

Paper PDF Code(official)

Abstract

One of the key factors of enabling machine learning models to comprehend and solve real-world tasks is to leverage multimodal data. Unfortunately, annotation of multimodal data is challenging and expensive. Recently, self-supervised multimodal methods that combine vision and language were proposed to learn multimodal representations without annotation. However, these methods often choose to ignore the presence of high levels of noise and thus yield sub-optimal results. In this work, we show that the problem of noise estimation for multimodal data can be reduced to a multimodal density estimation task. Using multimodal density estimation, we propose a noise estimation building block for multimodal representation learning that is based strictly on the inherent correlation between different modalities. We demonstrate how our noise estimation can be broadly integrated and achieves comparable results to state-of-the-art performance on five different benchmark datasets for two challenging multimodal tasks: Video Question Answering and Text-To-Video Retrieval. Furthermore, we provide a theoretical probabilistic error bound substantiating our empirical results and analyze failure cases. Code: https://github.com/elad-amrani/ssml.

Results

Task	Dataset	Metric	Value	Model
Video	MSVD	text-to-video Median Rank	6	SSML
Video	MSVD	text-to-video R@1	20.3	SSML
Video	MSVD	text-to-video R@10	63.3	SSML
Video	MSVD	text-to-video R@5	49	SSML
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.351	SSML
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.35	SSML
Video Retrieval	MSVD	text-to-video Median Rank	6	SSML
Video Retrieval	MSVD	text-to-video R@1	20.3	SSML
Video Retrieval	MSVD	text-to-video R@10	63.3	SSML
Video Retrieval	MSVD	text-to-video R@5	49	SSML
Visual Question Answering	MSRVTT-QA	Accuracy	0.35	SSML
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	8	SSML
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	29.3	SSML
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	21.3	SSML
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	13.66	SSML
Zero-Shot Video Retrieval	MSVD	text-to-video R@10	47.74	SSML
Zero-Shot Video Retrieval	MSVD	text-to-video R@5	35.7	SSML
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	4.2	SSML
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	17.1	SSML
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	11.6	SSML

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

Abstract

Results

Related Papers

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

Abstract

Results

Related Papers