Paxion: Patching Action Knowledge in Video-Language Foundation Models

Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, Heng Ji

2023-05-18NeurIPS 2023 11Object Recognition Video Question Answering Diagnostic Action Understanding

Abstract

Action knowledge involves the understanding of textual, visual, and temporal aspects of actions. We introduce the Action Dynamics Benchmark (ActionBench) containing two carefully designed probing tasks: Action Antonym and Video Reversal, which targets multimodal alignment capabilities and temporal understanding skills of the model, respectively. Despite recent video-language models' (VidLM) impressive performance on various benchmark tasks, our diagnostic tasks reveal their surprising deficiency (near-random performance) in action knowledge, suggesting that current models rely on object recognition abilities as a shortcut for action understanding. To remedy this, we propose a novel framework, Paxion, along with a new Discriminative Video Dynamics Modeling (DVDM) objective. The Paxion framework utilizes a Knowledge Patcher network to encode new action knowledge and a Knowledge Fuser component to integrate the Patcher into frozen VidLMs without compromising their existing capabilities. Due to limitations of the widely-used Video-Text Contrastive (VTC) loss for learning action knowledge, we introduce the DVDM objective to train the Knowledge Patcher. DVDM forces the model to encode the correlation between the action text and the correct ordering of video frames. Our extensive analyses show that Paxion and DVDM together effectively fill the gap in action knowledge understanding (~50% to 80%), while maintaining or improving performance on a wide spectrum of both object- and action-centric downstream tasks. The code and data will be made publicly available for research purposes at https://github.com/MikeWangWZHL/Paxion.git.

Results

Task	Dataset	Metric	Value	Model
Video Question Answering	NExT-QA	Accuracy	56.9	PAXION

Related Papers

Smart fault detection in satellite electrical power system2025-07-18 Demographic-aware fine-grained classification of pediatric wrist fractures2025-07-17 Trustworthy Tree-based Machine Learning by $MoS_2$ Flash-based Analog CAM with Inherent Soft Boundaries2025-07-16 LRMR: LLM-Driven Relational Multi-node Ranking for Lymph Node Metastasis Assessment in Rectal Cancer2025-07-15 AI-Enhanced Pediatric Pneumonia Detection: A CNN-Based Approach Using Data Augmentation and Generative Adversarial Networks (GANs)2025-07-13 Landmark Detection for Medical Images using a General-purpose Segmentation Model2025-07-13 Attend-and-Refine: Interactive keypoint estimation and quantitative cervical vertebrae analysis for bone age assessment2025-07-10 An Enhanced Privacy-preserving Federated Few-shot Learning Framework for Respiratory Disease Diagnosis2025-07-10