Hierarchical Conditional Relation Networks for Video Question Answering

Thao Minh Le, Vuong Le, Svetha Venkatesh, Truyen Tran

2020-02-25CVPR 2020 6Question Answering Video Question Answering Visual Question Answering (VQA)Audio-Visual Question Answering (AVQA)

Paper PDF Code(official)

Abstract

Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts. We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN) that serves as a building block to construct more sophisticated structures for representation and reasoning over video. CRN takes as input an array of tensorial objects and a conditioning feature, and computes an array of encoded output objects. Model building becomes a simple exercise of replication, rearrangement and stacking of these reusable units for diverse modalities and contextual information. This design thus supports high-order relational and multi-step reasoning. The resulting architecture for VideoQA is a CRN hierarchy whose branches represent sub-videos or clips, all sharing the same question as the contextual condition. Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.356	HCRN
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.361	HCRN
Video Question Answering	SUTD-TrafficQA	1/2	63.79	HCRN
Video Question Answering	SUTD-TrafficQA	1/4	36.49	HCRN

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16 MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16