CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, Xiaochun Cao

2024-03-07Zero-Shot Video Question Answer Question Answering Video-based Generative Performance Benchmarking Multimodal Large Language Model Audio-visual Question Answering Large Language Model Language Modelling Visual Question Answering Audio-Visual Question Answering (AVQA)

Paper PDF Code(official)

Abstract

This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are sometimes ambiguous and fail to describe specific audio-visual events. To overcome this limitation, we introduce the CAT, which enhances MLLM in three ways: 1) besides straightforwardly bridging audio and video, we design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required for large language models. 2) CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios. Notably, we collect an audio-visual joint instruction dataset named AVinstruct, to further enhance the capacity of CAT to model cross-semantic correlations. 3) we propose AI-assisted ambiguity-aware direct preference optimization, a strategy specialized in retraining the model to favor the non-ambiguity response and improve the ability to localize specific audio-visual objects. Extensive experimental results demonstrate that CAT outperforms existing methods on multimodal tasks, especially in Audio-Visual Question Answering (AVQA) tasks. The codes and the collected instructions are released at https://github.com/rikeilong/Bay-CAT.

Results

Task	Dataset	Metric	Value	Model
Question Answering	MSRVTT-QA	Accuracy	62.1	CAT-7B
Question Answering	MSRVTT-QA	Confidence Score	3.5	CAT-7B
Question Answering	ActivityNet-QA	Accuracy	50.2	CAT-7B
Question Answering	ActivityNet-QA	Confidence Score	3.5	CAT-7B
Visual Question Answering (VQA)	VideoInstruct	Consistency	2.89	CAT-7B
Visual Question Answering (VQA)	VideoInstruct	Contextual Understanding	3.49	CAT-7B
Visual Question Answering (VQA)	VideoInstruct	Correctness of Information	3.08	CAT-7B
Visual Question Answering (VQA)	VideoInstruct	Detail Orientation	2.95	CAT-7B
Visual Question Answering (VQA)	VideoInstruct	Temporal Understanding	2.81	CAT-7B
Visual Question Answering (VQA)	VideoInstruct	mean	3.07	CAT-7B
Video Question Answering	MSRVTT-QA	Accuracy	62.1	CAT-7B
Video Question Answering	MSRVTT-QA	Confidence Score	3.5	CAT-7B
Video Question Answering	ActivityNet-QA	Accuracy	50.2	CAT-7B
Video Question Answering	ActivityNet-QA	Confidence Score	3.5	CAT-7B
Generative Visual Question Answering	VideoInstruct	Consistency	2.89	CAT-7B
Generative Visual Question Answering	VideoInstruct	Contextual Understanding	3.49	CAT-7B
Generative Visual Question Answering	VideoInstruct	Correctness of Information	3.08	CAT-7B
Generative Visual Question Answering	VideoInstruct	Detail Orientation	2.95	CAT-7B
Generative Visual Question Answering	VideoInstruct	Temporal Understanding	2.81	CAT-7B
Generative Visual Question Answering	VideoInstruct	mean	3.07	CAT-7B
Video-based Generative Performance Benchmarking	VideoInstruct	Consistency	2.89	CAT-7B
Video-based Generative Performance Benchmarking	VideoInstruct	Contextual Understanding	3.49	CAT-7B
Video-based Generative Performance Benchmarking	VideoInstruct	Correctness of Information	3.08	CAT-7B
Video-based Generative Performance Benchmarking	VideoInstruct	Detail Orientation	2.95	CAT-7B
Video-based Generative Performance Benchmarking	VideoInstruct	Temporal Understanding	2.81	CAT-7B
Video-based Generative Performance Benchmarking	VideoInstruct	mean	3.07	CAT-7B

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Abstract

Results

Related Papers

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Abstract

Results

Related Papers