Visual Question Answering (VQA) on ScanQA Test w/ objects

Metric: BLEU-4 (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	BLEU-4▼	Extra Data	Paper	Date↕	Code
1	BridgeQA	24.06	No	Bridging the Gap between 2D and 3D Visual Questi...	2024-02-24	Code
2	LLaVA-3D	16.4	No	LLaVA-3D: A Simple yet Effective Pathway to Empo...	2024-09-26	-
3	ChatScene	14.3	No	Chat-Scene: Bridging 3D Scene and Large Language...	2023-12-13	Code
4	Chat-3D v2	14	No	Chat-Scene: Bridging 3D Scene and Large Language...	2023-12-13	Code
5	NaviLLM	13.9	No	Towards Learning a Generalist Model for Embodied...	2023-12-04	Code
6	LL3DA	13.5	No	Visual Instruction Tuning	2023-04-17	Code
7	LEO	13.2	No	An Embodied Generalist Agent in 3D World	2023-11-18	Code
8	ScanQA	12.04	No	ScanQA: 3D Question Answering for Spatial Scene ...	2021-12-20	Code
9	Scene-LLM	12	No	Scene-LLM: Extending Language Model for 3D Visua...	2024-03-18	-
10	3D-LLM (BLIP2-flant5)	11.6	No	3D-LLM: Injecting the 3D World into Large Langua...	2023-07-24	Code
11	3D-LLM (BLIP2-opt)	10.7	No	3D-LLM: Injecting the 3D World into Large Langua...	2023-07-24	Code
12	3D-VisTA	10.4	No	3D-VisTA: Pre-trained Transformer for 3D Vision ...	2023-08-08	Code
13	LLaVA-NeXT-Video	9.8	No	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06	Code
14	VideoChat2	9.6	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
15	3D-LLM (flamingo)	8.4	No	3D-LLM: Injecting the 3D World into Large Langua...	2023-07-24	Code
16	ScanRefer+MCAN	7.46	No	ScanQA: 3D Question Answering for Spatial Scene ...	2021-12-20	Code
17	VoteNet+MCAN	6.08	No	ScanQA: 3D Question Answering for Spatial Scene ...	2021-12-20	Code

#1BridgeQASOTA
24.06
BLEU-4· 2024-02-24
Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA Code
#2LLaVA-3D
16.4
BLEU-4· 2024-09-26
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
#3ChatSceneSOTA
14.3
BLEU-4· 2023-12-13
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers Code
#4Chat-3D v2
14
BLEU-4· 2023-12-13
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers Code
#5NaviLLMSOTA
13.9
BLEU-4· 2023-12-04
Towards Learning a Generalist Model for Embodied Navigation Code
#6LL3DASOTA
13.5
BLEU-4· 2023-04-17
Visual Instruction Tuning Code
#7LEO
13.2
BLEU-4· 2023-11-18
An Embodied Generalist Agent in 3D World Code
#8ScanQASOTA
12.04
BLEU-4· 2021-12-20
ScanQA: 3D Question Answering for Spatial Scene Understanding Code
#9Scene-LLM
12
BLEU-4· 2024-03-18
Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning
#103D-LLM (BLIP2-flant5)
11.6
BLEU-4· 2023-07-24
3D-LLM: Injecting the 3D World into Large Language Models Code
#113D-LLM (BLIP2-opt)
10.7
BLEU-4· 2023-07-24
3D-LLM: Injecting the 3D World into Large Language Models Code
#123D-VisTA
10.4
BLEU-4· 2023-08-08
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment Code
#13LLaVA-NeXT-Video
9.8
BLEU-4· 2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer Code
#14VideoChat2
9.6
BLEU-4· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#153D-LLM (flamingo)
8.4
BLEU-4· 2023-07-24
3D-LLM: Injecting the 3D World into Large Language Models Code
#16ScanRefer+MCAN
7.46
BLEU-4· 2021-12-20
ScanQA: 3D Question Answering for Spatial Scene Understanding Code
#17VoteNet+MCAN
6.08
BLEU-4· 2021-12-20
ScanQA: 3D Question Answering for Spatial Scene Understanding Code