TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Video-Text as Game Players: Hierarchical Banzhaf Interacti...

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen

2023-03-25CVPR 2023 1Question AnsweringVideo RetrievalRepresentation LearningVideo Question AnsweringContrastive LearningRetrievalVisual Question Answering (VQA)
PaperPDFCodeCode(official)CodeCode

Abstract

Contrastive learning-based video-language representation learning approaches, e.g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs. To clarify this coarse-grained global interaction and move a step further, we have to encounter challenging shell-breaking interactions for fine-grained cross-modal learning. In this paper, we creatively model video-text as game players with multivariate cooperative game theory to wisely handle the uncertainty during fine-grained semantic interaction with diverse granularity, flexible combination, and vague intensity. Concretely, we propose Hierarchical Banzhaf Interaction (HBI) to value possible correspondence between video frames and text words for sensitive and explainable cross-modal contrast. To efficiently realize the cooperative game of multiple video frames and multiple text words, the proposed method clusters the original video frames (text words) and computes the Banzhaf Interaction between the merged tokens. By stacking token merge modules, we achieve cooperative games at different semantic levels. Extensive experiments on commonly used text-video retrieval and video-question answering benchmarks with superior performances justify the efficacy of our HBI. More encouragingly, it can also serve as a visualization tool to promote the understanding of cross-modal interaction, which have a far-reaching impact on the community. Project page is available at https://jpthu17.github.io/HBI/.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank12HBI
VideoMSR-VTT-1kAtext-to-video Median Rank2HBI
VideoMSR-VTT-1kAtext-to-video R@148.6HBI
VideoMSR-VTT-1kAtext-to-video R@1083.4HBI
VideoMSR-VTT-1kAtext-to-video R@574.6HBI
VideoMSR-VTT-1kAvideo-to-text Mean Rank8.9HBI
VideoMSR-VTT-1kAvideo-to-text Median Rank2HBI
VideoMSR-VTT-1kAvideo-to-text R@146.8HBI
VideoMSR-VTT-1kAvideo-to-text R@1084.3HBI
VideoMSR-VTT-1kAvideo-to-text R@574.3HBI
VideoActivityNettext-to-video Mean Rank6.6HBI
VideoActivityNettext-to-video Median Rank2HBI
VideoActivityNettext-to-video R@142.2HBI
VideoActivityNettext-to-video R@1084.6HBI
VideoActivityNettext-to-video R@573HBI
VideoActivityNetvideo-to-text Mean Rank6.5HBI
VideoActivityNetvideo-to-text Median Rank2HBI
VideoActivityNetvideo-to-text R@142.4HBI
VideoActivityNetvideo-to-text R@1086HBI
VideoActivityNetvideo-to-text R@573HBI
VideoDiDeMotext-to-video Mean Rank12.1HBI
VideoDiDeMotext-to-video Median Rank2HBI
VideoDiDeMotext-to-video R@146.9HBI
VideoDiDeMotext-to-video R@1082.7HBI
VideoDiDeMotext-to-video R@574.9HBI
VideoDiDeMovideo-to-text Mean Rank8.7HBI
VideoDiDeMovideo-to-text Median Rank2HBI
VideoDiDeMovideo-to-text R@146.2HBI
VideoDiDeMovideo-to-text R@1082.7HBI
VideoDiDeMovideo-to-text R@573HBI
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.462HBI
Video Question AnsweringMSRVTT-QAAccuracy46.2HBI
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank12HBI
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank2HBI
Video RetrievalMSR-VTT-1kAtext-to-video R@148.6HBI
Video RetrievalMSR-VTT-1kAtext-to-video R@1083.4HBI
Video RetrievalMSR-VTT-1kAtext-to-video R@574.6HBI
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank8.9HBI
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank2HBI
Video RetrievalMSR-VTT-1kAvideo-to-text R@146.8HBI
Video RetrievalMSR-VTT-1kAvideo-to-text R@1084.3HBI
Video RetrievalMSR-VTT-1kAvideo-to-text R@574.3HBI
Video RetrievalActivityNettext-to-video Mean Rank6.6HBI
Video RetrievalActivityNettext-to-video Median Rank2HBI
Video RetrievalActivityNettext-to-video R@142.2HBI
Video RetrievalActivityNettext-to-video R@1084.6HBI
Video RetrievalActivityNettext-to-video R@573HBI
Video RetrievalActivityNetvideo-to-text Mean Rank6.5HBI
Video RetrievalActivityNetvideo-to-text Median Rank2HBI
Video RetrievalActivityNetvideo-to-text R@142.4HBI
Video RetrievalActivityNetvideo-to-text R@1086HBI
Video RetrievalActivityNetvideo-to-text R@573HBI
Video RetrievalDiDeMotext-to-video Mean Rank12.1HBI
Video RetrievalDiDeMotext-to-video Median Rank2HBI
Video RetrievalDiDeMotext-to-video R@146.9HBI
Video RetrievalDiDeMotext-to-video R@1082.7HBI
Video RetrievalDiDeMotext-to-video R@574.9HBI
Video RetrievalDiDeMovideo-to-text Mean Rank8.7HBI
Video RetrievalDiDeMovideo-to-text Median Rank2HBI
Video RetrievalDiDeMovideo-to-text R@146.2HBI
Video RetrievalDiDeMovideo-to-text R@1082.7HBI
Video RetrievalDiDeMovideo-to-text R@573HBI

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17