Visual Question Answering (VQA) on SQA3D

Metric: Exact Match (higher is better)

LeaderboardDataset

Loading chart...

Results

Submit a result

Sort:

#	Model↕	Exact Match▼	Extra Data	Paper	Date↕	Code
1	LLaVA-3D	60.1	No	LLaVA-3D: A Simple yet Effective Pathway to Empo...	2024-09-26	-
2	Video-3D LLM	58.6	No	Video-3D LLM: Learning Position-Aware Video Repr...	2024-11-30	Code
3	Chat-3D v2	54.7	No	Chat-Scene: Bridging 3D Scene and Large Language...	2023-12-13	Code
4	ChatScene	54.6	No	Chat-Scene: Bridging 3D Scene and Large Language...	2023-12-13	Code
5	Scene-LLM	54.2	No	Scene-LLM: Extending Language Model for 3D Visua...	2024-03-18	-
6	LEO	50	No	An Embodied Generalist Agent in 3D World	2023-11-18	Code
7	LLaVA-Video	48.5	No	Video Instruction Tuning With Synthetic Data	2024-10-03	-
8	3D-VisTA	48.5	No	3D-VisTA: Pre-trained Transformer for 3D Vision ...	2023-08-08	Code
9	ScanQA	47.2	No	ScanQA: 3D Question Answering for Spatial Scene ...	2021-12-20	Code
10	PQ3D	47.1	No	Unifying 3D Vision-Language Understanding via Pr...	2024-05-19	-
11	Scan2Cap	41	No	Scan2Cap: Context-aware Dense Captioning in RGB-...	2020-12-03	-
12	VideoChat2	37.3	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
13	LLaVA-NeXT-Video	34.2	No	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06	Code

#1LLaVA-3DSOTA
60.1
Exact Match· 2024-09-26
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
#2Video-3D LLM
58.6
Exact Match· 2024-11-30
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding Code
#3Chat-3D v2SOTA
54.7
Exact Match· 2023-12-13
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers Code
#4ChatScene
54.6
Exact Match· 2023-12-13
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers Code
#5Scene-LLM
54.2
Exact Match· 2024-03-18
Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning
#6LEOSOTA
50
Exact Match· 2023-11-18
An Embodied Generalist Agent in 3D World Code
#7LLaVA-Video
48.5
Exact Match· 2024-10-03
Video Instruction Tuning With Synthetic Data
#83D-VisTASOTA
48.5
Exact Match· 2023-08-08
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment Code
#9ScanQASOTA
47.2
Exact Match· 2021-12-20
ScanQA: 3D Question Answering for Spatial Scene Understanding Code
#10PQ3D
47.1
Exact Match· 2024-05-19
Unifying 3D Vision-Language Understanding via Promptable Queries
#11Scan2CapSOTA
41
Exact Match· 2020-12-03
Scan2Cap: Context-aware Dense Captioning in RGB-D Scans
#12VideoChat2
37.3
Exact Match· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#13LLaVA-NeXT-Video
34.2
Exact Match· 2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer Code