Revealing Single Frame Bias for Video-and-Language Learning

Jie Lei, Tamara L. Berg, Mohit Bansal

2022-06-07Question Answering Video Retrieval Zero-Shot Video Retrieval Fine-grained Action Recognition Text to Video Retrieval Video Question Answering Action Recognition Retrieval Language Modelling

Paper PDF Code Code(official)

Abstract

Training an effective video-and-language model intuitively requires multiple frames as model inputs. However, it is unclear whether using multiple frames is beneficial to downstream tasks, and if yes, whether the performance gain is worth the drastically-increased computation and memory costs resulting from using more frames. In this work, we explore single-frame models for video-and-language learning. On a diverse set of video-and-language tasks (including text-to-video retrieval and video question answering), we show the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training. This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets. Therefore, to allow for a more comprehensive evaluation of video-and-language models, we propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling. Our code is available at https://github.com/jayleicn/singularity

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video R@1	41.5	Singularity
Video	MSR-VTT-1kA	text-to-video R@10	77	Singularity
Video	MSR-VTT-1kA	text-to-video R@5	68.7	Singularity
Video	SSv2-template retrieval	text-to-video R@1	77.6	Singularity-temporal
Video	SSv2-template retrieval	text-to-video R@10	98.9	Singularity-temporal
Video	SSv2-template retrieval	text-to-video R@5	96	Singularity-temporal
Video	ActivityNet	text-to-video R@1	47.1	Singularity
Video	ActivityNet	text-to-video R@10	85.5	Singularity
Video	ActivityNet	text-to-video R@5	75.5	Singularity
Video	SSv2-label retrieval	text-to-video R@1	47.4	Singularity-temporal
Video	SSv2-label retrieval	text-to-video R@10	84	Singularity-temporal
Video	SSv2-label retrieval	text-to-video R@5	75.9	Singularity-temporal
Video	DiDeMo	text-to-video R@1	53.9	Singularity
Video	DiDeMo	text-to-video R@10	86.9	Singularity
Video	DiDeMo	text-to-video R@5	79.4	Singularity
Video Question Answering	ActivityNet-QA	Accuracy	44.1	Singularity-temporal
Video Question Answering	ActivityNet-QA	Accuracy	43.1	Singularity
Video Question Answering	MSRVTT-QA	Accuracy	43.9	Singularity-temporal
Video Question Answering	MSRVTT-QA	Accuracy	43.5	Singularity
Video Question Answering	MSRVTT-MC	Accuracy	93.7	Singularity-temporal
Video Question Answering	MSRVTT-MC	Accuracy	92.1	Singularity
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	41.5	Singularity
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	77	Singularity
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	68.7	Singularity
Video Retrieval	SSv2-template retrieval	text-to-video R@1	77.6	Singularity-temporal
Video Retrieval	SSv2-template retrieval	text-to-video R@10	98.9	Singularity-temporal
Video Retrieval	SSv2-template retrieval	text-to-video R@5	96	Singularity-temporal
Video Retrieval	ActivityNet	text-to-video R@1	47.1	Singularity
Video Retrieval	ActivityNet	text-to-video R@10	85.5	Singularity
Video Retrieval	ActivityNet	text-to-video R@5	75.5	Singularity
Video Retrieval	SSv2-label retrieval	text-to-video R@1	47.4	Singularity-temporal
Video Retrieval	SSv2-label retrieval	text-to-video R@10	84	Singularity-temporal
Video Retrieval	SSv2-label retrieval	text-to-video R@5	75.9	Singularity-temporal
Video Retrieval	DiDeMo	text-to-video R@1	53.9	Singularity
Video Retrieval	DiDeMo	text-to-video R@10	86.9	Singularity
Video Retrieval	DiDeMo	text-to-video R@5	79.4	Singularity
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	34	Singularity-17M
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	66.7	Singularity-17M
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	56.7	Singularity-17M
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	28.4	Singularity-5M
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	59.5	Singularity-5M
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	50.2	Singularity-5M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	37.1	Singularity-17M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	69.9	Singularity-17M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	61.7	Singularity-17M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	36.9	Singularity-5M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	69.3	Singularity-5M
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	61.1	Singularity-5M
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@1	30.8	Singularity-temporal-5M
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@10	66.3	Singularity-temporal-5M
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@5	55.9	Singularity-temporal-5M
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@1	30.6	Singularity-temporal-17M
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@10	66.9	Singularity-temporal-17M
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@5	55.6	Singularity-temporal-17M

Revealing Single Frame Bias for Video-and-Language Learning

Abstract

Results

Related Papers

Revealing Single Frame Bias for Video-and-Language Learning

Abstract

Results

Related Papers