Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Visual Question Answering (VQA)
/
ScanQA Test w/ objects
Visual Question Answering (VQA) on ScanQA Test w/ objects
Metric: CIDEr (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
CIDEr (best first)
CIDEr (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
CIDEr
▼
Extra Data
Paper
Date
↕
Code
1
LLaVA-3D
103.1
No
LLaVA-3D: A Simple yet Effective Pathway to Empo...
2024-09-26
-
2
Video-3D LLM
102.1
No
Video-3D LLM: Learning Position-Aware Video Repr...
2024-11-30
Code
3
LEO
101.4
No
An Embodied Generalist Agent in 3D World
2023-11-18
Code
4
ChatScene
87.7
No
Chat-Scene: Bridging 3D Scene and Large Language...
2023-12-13
Code
5
Chat-3D v2
87.6
No
Chat-Scene: Bridging 3D Scene and Large Language...
2023-12-13
Code
6
BridgeQA
83.75
No
Bridging the Gap between 2D and 3D Visual Questi...
2024-02-24
Code
7
NaviLLM
80.77
No
Towards Learning a Generalist Model for Embodied...
2023-12-04
Code
8
Scene-LLM
80
No
Scene-LLM: Extending Language Model for 3D Visua...
2024-03-18
-
9
LL3DA
76.8
No
Visual Instruction Tuning
2023-04-17
Code
10
3D-VisTA
69.6
No
3D-VisTA: Pre-trained Transformer for 3D Vision ...
2023-08-08
Code
11
3D-LLM (BLIP2-flant5)
69.6
No
3D-LLM: Injecting the 3D World into Large Langua...
2023-07-24
Code
12
ScanQA
67.29
No
ScanQA: 3D Question Answering for Spatial Scene ...
2021-12-20
Code
13
3D-LLM (BLIP2-opt)
67.1
No
3D-LLM: Injecting the 3D World into Large Langua...
2023-07-24
Code
14
3D-LLM (flamingo)
65.6
No
3D-LLM: Injecting the 3D World into Large Langua...
2023-07-24
Code
15
VoteNet+MCAN
58.23
No
ScanQA: 3D Question Answering for Spatial Scene ...
2021-12-20
Code
16
ScanRefer+MCAN
57.56
No
ScanQA: 3D Question Answering for Spatial Scene ...
2021-12-20
Code
17
VideoChat2
49.2
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
18
LLaVA-NeXT-Video
46.2
No
LLaVA-OneVision: Easy Visual Task Transfer
2024-08-06
Code
#1
LLaVA-3D
SOTA
103.1
CIDEr
· 2024-09-26
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
#2
Video-3D LLM
102.1
CIDEr
· 2024-11-30
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
Code
#3
LEO
SOTA
101.4
CIDEr
· 2023-11-18
An Embodied Generalist Agent in 3D World
Code
#4
ChatScene
87.7
CIDEr
· 2023-12-13
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers
Code
#5
Chat-3D v2
87.6
CIDEr
· 2023-12-13
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers
Code
#6
BridgeQA
83.75
CIDEr
· 2024-02-24
Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA
Code
#7
NaviLLM
80.77
CIDEr
· 2023-12-04
Towards Learning a Generalist Model for Embodied Navigation
Code
#8
Scene-LLM
80
CIDEr
· 2024-03-18
Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning
#9
LL3DA
SOTA
76.8
CIDEr
· 2023-04-17
Visual Instruction Tuning
Code
#10
3D-VisTA
69.6
CIDEr
· 2023-08-08
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
Code
#11
3D-LLM (BLIP2-flant5)
69.6
CIDEr
· 2023-07-24
3D-LLM: Injecting the 3D World into Large Language Models
Code
#12
ScanQA
SOTA
67.29
CIDEr
· 2021-12-20
ScanQA: 3D Question Answering for Spatial Scene Understanding
Code
#13
3D-LLM (BLIP2-opt)
67.1
CIDEr
· 2023-07-24
3D-LLM: Injecting the 3D World into Large Language Models
Code
#14
3D-LLM (flamingo)
65.6
CIDEr
· 2023-07-24
3D-LLM: Injecting the 3D World into Large Language Models
Code
#15
VoteNet+MCAN
58.23
CIDEr
· 2021-12-20
ScanQA: 3D Question Answering for Spatial Scene Understanding
Code
#16
ScanRefer+MCAN
57.56
CIDEr
· 2021-12-20
ScanQA: 3D Question Answering for Spatial Scene Understanding
Code
#17
VideoChat2
49.2
CIDEr
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#18
LLaVA-NeXT-Video
46.2
CIDEr
· 2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer
Code