Visual Question Answering (VQA) on ScanQA Test w/ objects

Metric: CIDEr (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	CIDEr▼	Extra Data	Paper	Date↕	Code
1	LLaVA-3D	103.1	No	LLaVA-3D: A Simple yet Effective Pathway to Empo...	2024-09-26	-
2	Video-3D LLM	102.1	No	Video-3D LLM: Learning Position-Aware Video Repr...	2024-11-30	Code
3	LEO	101.4	No	An Embodied Generalist Agent in 3D World	2023-11-18	Code
4	ChatScene	87.7	No	Chat-Scene: Bridging 3D Scene and Large Language...	2023-12-13	Code
5	Chat-3D v2	87.6	No	Chat-Scene: Bridging 3D Scene and Large Language...	2023-12-13	Code
6	BridgeQA	83.75	No	Bridging the Gap between 2D and 3D Visual Questi...	2024-02-24	Code
7	NaviLLM	80.77	No	Towards Learning a Generalist Model for Embodied...	2023-12-04	Code
8	Scene-LLM	80	No	Scene-LLM: Extending Language Model for 3D Visua...	2024-03-18	-
9	LL3DA	76.8	No	Visual Instruction Tuning	2023-04-17	Code
10	3D-VisTA	69.6	No	3D-VisTA: Pre-trained Transformer for 3D Vision ...	2023-08-08	Code
11	3D-LLM (BLIP2-flant5)	69.6	No	3D-LLM: Injecting the 3D World into Large Langua...	2023-07-24	Code
12	ScanQA	67.29	No	ScanQA: 3D Question Answering for Spatial Scene ...	2021-12-20	Code
13	3D-LLM (BLIP2-opt)	67.1	No	3D-LLM: Injecting the 3D World into Large Langua...	2023-07-24	Code
14	3D-LLM (flamingo)	65.6	No	3D-LLM: Injecting the 3D World into Large Langua...	2023-07-24	Code
15	VoteNet+MCAN	58.23	No	ScanQA: 3D Question Answering for Spatial Scene ...	2021-12-20	Code
16	ScanRefer+MCAN	57.56	No	ScanQA: 3D Question Answering for Spatial Scene ...	2021-12-20	Code
17	VideoChat2	49.2	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
18	LLaVA-NeXT-Video	46.2	No	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06	Code

#1LLaVA-3DSOTA
103.1
CIDEr· 2024-09-26
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
#2Video-3D LLM
102.1
CIDEr· 2024-11-30
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding Code
#3LEOSOTA
101.4
CIDEr· 2023-11-18
An Embodied Generalist Agent in 3D World Code
#4ChatScene
87.7
CIDEr· 2023-12-13
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers Code
#5Chat-3D v2
87.6
CIDEr· 2023-12-13
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers Code
#6BridgeQA
83.75
CIDEr· 2024-02-24
Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA Code
#7NaviLLM
80.77
CIDEr· 2023-12-04
Towards Learning a Generalist Model for Embodied Navigation Code
#8Scene-LLM
80
CIDEr· 2024-03-18
Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning
#9LL3DASOTA
76.8
CIDEr· 2023-04-17
Visual Instruction Tuning Code
#103D-VisTA
69.6
CIDEr· 2023-08-08
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment Code
#113D-LLM (BLIP2-flant5)
69.6
CIDEr· 2023-07-24
3D-LLM: Injecting the 3D World into Large Language Models Code
#12ScanQASOTA
67.29
CIDEr· 2021-12-20
ScanQA: 3D Question Answering for Spatial Scene Understanding Code
#133D-LLM (BLIP2-opt)
67.1
CIDEr· 2023-07-24
3D-LLM: Injecting the 3D World into Large Language Models Code
#143D-LLM (flamingo)
65.6
CIDEr· 2023-07-24
3D-LLM: Injecting the 3D World into Large Language Models Code
#15VoteNet+MCAN
58.23
CIDEr· 2021-12-20
ScanQA: 3D Question Answering for Spatial Scene Understanding Code
#16ScanRefer+MCAN
57.56
CIDEr· 2021-12-20
ScanQA: 3D Question Answering for Spatial Scene Understanding Code
#17VideoChat2
49.2
CIDEr· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#18LLaVA-NeXT-Video
46.2
CIDEr· 2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer Code