AVQACL: A Novel Benchmark for Audio-Visual Question Answering Continual Learning

Kaixuan Wu, Xinde Li, Xinling Li, Chuanfei Hu, Guoliang Wu

2025-01-01CVPR 2025 1Continual Learning Question Answering Scene Understanding Audio-visual Question Answering Knowledge Distillation Visual Question Answering

Paper PDF Code(official)

Abstract

In this paper, a novel benchmark for audio-visual question answering continual learning (AVQACL) is introduced, aiming to study fine-grained scene understanding and spatial-temporal reasoning in videos under a continual learning setting. To facilitate this multimodal continual leaning task, we create two audio-visual question answering continual learning datasets, named Split-AVQA and Split-MUSIC-AVQA based on the AVQA and MUSIC-AVQA datasets, respectively. The experimental results suggest that the model exhibits limited cognitive and reasoning abilities and experiences catastrophic forgetting when processing three modalities simultaneously in a continuous data stream. To address above challenges, we propose a novel continual learning method that incorporates question-guided cross-modal information fusion (QCIF) to focus on question-relevant details for improved feature representation and task-specific knowledge distillation with spatial-temporal feature constraints (TKD-STFC) to preserve the spatial-temporal reasoning knowledge acquired from previous dynamic scenarios. Furthermore, a question semantic consistency constraint (QSCC) is employed to ensure that the model maintains a consistent understanding of question semantics across tasks throughout the continual learning process. Extensive experimental results on Split-AVQA and Split-MUSIC-AVQA datasets illustrate that our method achieves state-of-the-art audio-visual question answering continual learning performance. The code is available at https://github.com/kx-wu/CVPR2025_AVQACL.

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17 Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17 Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17