Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen

2023-03-25CVPR 2023 1Question Answering Video Retrieval Representation Learning Video Question Answering Contrastive Learning Retrieval Visual Question Answering (VQA)

Paper PDF Code Code(official)Code Code

Abstract

Contrastive learning-based video-language representation learning approaches, e.g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs. To clarify this coarse-grained global interaction and move a step further, we have to encounter challenging shell-breaking interactions for fine-grained cross-modal learning. In this paper, we creatively model video-text as game players with multivariate cooperative game theory to wisely handle the uncertainty during fine-grained semantic interaction with diverse granularity, flexible combination, and vague intensity. Concretely, we propose Hierarchical Banzhaf Interaction (HBI) to value possible correspondence between video frames and text words for sensitive and explainable cross-modal contrast. To efficiently realize the cooperative game of multiple video frames and multiple text words, the proposed method clusters the original video frames (text words) and computes the Banzhaf Interaction between the merged tokens. By stacking token merge modules, we achieve cooperative games at different semantic levels. Extensive experiments on commonly used text-video retrieval and video-question answering benchmarks with superior performances justify the efficacy of our HBI. More encouragingly, it can also serve as a visualization tool to promote the understanding of cross-modal interaction, which have a far-reaching impact on the community. Project page is available at https://jpthu17.github.io/HBI/.

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video Mean Rank	12	HBI
Video	MSR-VTT-1kA	text-to-video Median Rank	2	HBI
Video	MSR-VTT-1kA	text-to-video R@1	48.6	HBI
Video	MSR-VTT-1kA	text-to-video R@10	83.4	HBI
Video	MSR-VTT-1kA	text-to-video R@5	74.6	HBI
Video	MSR-VTT-1kA	video-to-text Mean Rank	8.9	HBI
Video	MSR-VTT-1kA	video-to-text Median Rank	2	HBI
Video	MSR-VTT-1kA	video-to-text R@1	46.8	HBI
Video	MSR-VTT-1kA	video-to-text R@10	84.3	HBI
Video	MSR-VTT-1kA	video-to-text R@5	74.3	HBI
Video	ActivityNet	text-to-video Mean Rank	6.6	HBI
Video	ActivityNet	text-to-video Median Rank	2	HBI
Video	ActivityNet	text-to-video R@1	42.2	HBI
Video	ActivityNet	text-to-video R@10	84.6	HBI
Video	ActivityNet	text-to-video R@5	73	HBI
Video	ActivityNet	video-to-text Mean Rank	6.5	HBI
Video	ActivityNet	video-to-text Median Rank	2	HBI
Video	ActivityNet	video-to-text R@1	42.4	HBI
Video	ActivityNet	video-to-text R@10	86	HBI
Video	ActivityNet	video-to-text R@5	73	HBI
Video	DiDeMo	text-to-video Mean Rank	12.1	HBI
Video	DiDeMo	text-to-video Median Rank	2	HBI
Video	DiDeMo	text-to-video R@1	46.9	HBI
Video	DiDeMo	text-to-video R@10	82.7	HBI
Video	DiDeMo	text-to-video R@5	74.9	HBI
Video	DiDeMo	video-to-text Mean Rank	8.7	HBI
Video	DiDeMo	video-to-text Median Rank	2	HBI
Video	DiDeMo	video-to-text R@1	46.2	HBI
Video	DiDeMo	video-to-text R@10	82.7	HBI
Video	DiDeMo	video-to-text R@5	73	HBI
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.462	HBI
Video Question Answering	MSRVTT-QA	Accuracy	46.2	HBI
Video Retrieval	MSR-VTT-1kA	text-to-video Mean Rank	12	HBI
Video Retrieval	MSR-VTT-1kA	text-to-video Median Rank	2	HBI
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	48.6	HBI
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	83.4	HBI
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	74.6	HBI
Video Retrieval	MSR-VTT-1kA	video-to-text Mean Rank	8.9	HBI
Video Retrieval	MSR-VTT-1kA	video-to-text Median Rank	2	HBI
Video Retrieval	MSR-VTT-1kA	video-to-text R@1	46.8	HBI
Video Retrieval	MSR-VTT-1kA	video-to-text R@10	84.3	HBI
Video Retrieval	MSR-VTT-1kA	video-to-text R@5	74.3	HBI
Video Retrieval	ActivityNet	text-to-video Mean Rank	6.6	HBI
Video Retrieval	ActivityNet	text-to-video Median Rank	2	HBI
Video Retrieval	ActivityNet	text-to-video R@1	42.2	HBI
Video Retrieval	ActivityNet	text-to-video R@10	84.6	HBI
Video Retrieval	ActivityNet	text-to-video R@5	73	HBI
Video Retrieval	ActivityNet	video-to-text Mean Rank	6.5	HBI
Video Retrieval	ActivityNet	video-to-text Median Rank	2	HBI
Video Retrieval	ActivityNet	video-to-text R@1	42.4	HBI
Video Retrieval	ActivityNet	video-to-text R@10	86	HBI
Video Retrieval	ActivityNet	video-to-text R@5	73	HBI
Video Retrieval	DiDeMo	text-to-video Mean Rank	12.1	HBI
Video Retrieval	DiDeMo	text-to-video Median Rank	2	HBI
Video Retrieval	DiDeMo	text-to-video R@1	46.9	HBI
Video Retrieval	DiDeMo	text-to-video R@10	82.7	HBI
Video Retrieval	DiDeMo	text-to-video R@5	74.9	HBI
Video Retrieval	DiDeMo	video-to-text Mean Rank	8.7	HBI
Video Retrieval	DiDeMo	video-to-text Median Rank	2	HBI
Video Retrieval	DiDeMo	video-to-text R@1	46.2	HBI
Video Retrieval	DiDeMo	video-to-text R@10	82.7	HBI
Video Retrieval	DiDeMo	video-to-text R@5	73	HBI

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Abstract

Results

Related Papers

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Abstract

Results

Related Papers