TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CAT: Enhancing Multimodal Large Language Model to Answer Q...

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, Xiaochun Cao

2024-03-07Zero-Shot Video Question AnswerQuestion AnsweringVideo-based Generative Performance BenchmarkingMultimodal Large Language ModelAudio-visual Question AnsweringLarge Language ModelLanguage ModellingVisual Question AnsweringAudio-Visual Question Answering (AVQA)
PaperPDFCode(official)

Abstract

This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are sometimes ambiguous and fail to describe specific audio-visual events. To overcome this limitation, we introduce the CAT, which enhances MLLM in three ways: 1) besides straightforwardly bridging audio and video, we design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required for large language models. 2) CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios. Notably, we collect an audio-visual joint instruction dataset named AVinstruct, to further enhance the capacity of CAT to model cross-semantic correlations. 3) we propose AI-assisted ambiguity-aware direct preference optimization, a strategy specialized in retraining the model to favor the non-ambiguity response and improve the ability to localize specific audio-visual objects. Extensive experimental results demonstrate that CAT outperforms existing methods on multimodal tasks, especially in Audio-Visual Question Answering (AVQA) tasks. The codes and the collected instructions are released at https://github.com/rikeilong/Bay-CAT.

Results

TaskDatasetMetricValueModel
Question AnsweringMSRVTT-QAAccuracy62.1CAT-7B
Question AnsweringMSRVTT-QAConfidence Score3.5CAT-7B
Question AnsweringActivityNet-QAAccuracy50.2CAT-7B
Question AnsweringActivityNet-QAConfidence Score3.5CAT-7B
Visual Question Answering (VQA)VideoInstructConsistency2.89CAT-7B
Visual Question Answering (VQA)VideoInstructContextual Understanding3.49CAT-7B
Visual Question Answering (VQA)VideoInstructCorrectness of Information3.08CAT-7B
Visual Question Answering (VQA)VideoInstructDetail Orientation2.95CAT-7B
Visual Question Answering (VQA)VideoInstructTemporal Understanding2.81CAT-7B
Visual Question Answering (VQA)VideoInstructmean3.07CAT-7B
Video Question AnsweringMSRVTT-QAAccuracy62.1CAT-7B
Video Question AnsweringMSRVTT-QAConfidence Score3.5CAT-7B
Video Question AnsweringActivityNet-QAAccuracy50.2CAT-7B
Video Question AnsweringActivityNet-QAConfidence Score3.5CAT-7B
Generative Visual Question AnsweringVideoInstructConsistency2.89CAT-7B
Generative Visual Question AnsweringVideoInstructContextual Understanding3.49CAT-7B
Generative Visual Question AnsweringVideoInstructCorrectness of Information3.08CAT-7B
Generative Visual Question AnsweringVideoInstructDetail Orientation2.95CAT-7B
Generative Visual Question AnsweringVideoInstructTemporal Understanding2.81CAT-7B
Generative Visual Question AnsweringVideoInstructmean3.07CAT-7B
Video-based Generative Performance BenchmarkingVideoInstructConsistency2.89CAT-7B
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding3.49CAT-7B
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information3.08CAT-7B
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation2.95CAT-7B
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding2.81CAT-7B
Video-based Generative Performance BenchmarkingVideoInstructmean3.07CAT-7B

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17