Zero-Shot Video Question Answering with Procedural Programs

Rohan Choudhury, Koichiro Niinuma, Kris M. Kitani, László A. Jeni

2023-12-01Zero-Shot Video Question Answer Question Answering Video Editing Multi-Object Tracking Video Question Answering Object Tracking Large Language Model Code Generation Video Understanding Visual Question Answering (VQA)Language Modelling

Paper PDF

Abstract

We propose to answer zero-shot questions about videos by generating short procedural programs that derive a final answer from solving a sequence of visual subtasks. We present Procedural Video Querying (ProViQ), which uses a large language model to generate such programs from an input question and an API of visual modules in the prompt, then executes them to obtain the output. Recent similar procedural approaches have proven successful for image question answering, but videos remain challenging: we provide ProViQ with modules intended for video understanding, allowing it to generalize to a wide variety of videos. This code generation framework additionally enables ProViQ to perform other video tasks in addition to question answering, such as multi-object tracking or basic video editing. ProViQ achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, and multimodal video question-answering datasets. Our project page is at https://rccchoudhury.github.io/proviq2023.

Results

Task	Dataset	Metric	Value	Model
Question Answering	NExT-QA	Accuracy	64.6	ProViQ
Video Question Answering	NExT-QA	Accuracy	64.6	ProViQ

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18 CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning2025-07-18 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 MVA 2025 Small Multi-Object Tracking for Spotting Birds Challenge: Dataset, Methods, and Results2025-07-17