Temporal Relation Extraction on Vinoground

Metric: Text Score (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Text Score▼	Extra Data	Paper	Date↕	Code
1	GPT-4o (CoT)	59.2	No	-	-	-
2	GPT-4o	54	No	-	-	-
3	Qwen2-VL-72B	50.4	No	Qwen2-VL: Enhancing Vision-Language Model's Perc...	2024-09-18	Code
4	LLaVA-OneVision-Qwen2-72B	48.4	No	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06	Code
5	LLaVA-OneVision-Qwen2-7B	41.6	No	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06	Code
6	Qwen2-VL-7B	40.2	No	Qwen2-VL: Enhancing Vision-Language Model's Perc...	2024-09-18	Code
7	Gemini-1.5-Pro (CoT)	37	No	Gemini 1.5: Unlocking multimodal understanding a...	2024-03-08	Code
8	VideoLLaMA2-72B	36.2	No	VideoLLaMA 2: Advancing Spatial-Temporal Modelin...	2024-06-11	Code
9	Gemini-1.5-Pro	35.8	No	Gemini 1.5: Unlocking multimodal understanding a...	2024-03-08	Code
10	Claude 3.5 Sonnet	32.8	No	-	-	-
11	MiniCPM-2.6	32.6	No	MiniCPM-V: A GPT-4V Level MLLM on Your Phone	2024-08-03	Code
12	InternLM-XC-2.5 (CoT)	30.8	No	InternLM-XComposer-2.5: A Versatile Large Vision...	2024-07-03	Code
13	InternLM-XC-2.5	28.8	No	InternLM-XComposer-2.5: A Versatile Large Vision...	2024-07-03	Code
14	LLaVA-NeXT-Video-34B (CoT)	25.8	No	-	-	-
15	Video-LLaVA-7B	24.8	No	Video-LLaVA: Learning United Visual Representati...	2023-11-16	Code
16	Phi-3.5-Vision	24	No	-	-	-
17	MA-LMM-Vicuna-7B	23.8	No	MA-LMM: Memory-Augmented Large Multimodal Model ...	2024-04-08	Code
18	LLaVA-NeXT-Video-34B	23	No	-	-	-
19	LLaVA-NeXT-Video-7B (CoT)	21.8	No	-	-	-
20	LLaVA-NeXT-Video-7B	21.8	No	-	-	-
21	VTimeLLM	19.4	No	VTimeLLM: Empower LLM to Grasp Video Moments	2023-11-30	Code
22	VideoCLIP	17	No	VideoCLIP: Contrastive Pre-training for Zero-sho...	2021-09-28	Code
23	LanguageBind	10.6	No	LanguageBind: Extending Video-Language Pretraini...	2023-10-03	Code
24	ImageBind	9.4	No	ImageBind: One Embedding Space To Bind Them All	2023-05-09	Code

#1GPT-4o (CoT)
59.2
Text Score
No paper
#2GPT-4o
54
Text Score
No paper
#3Qwen2-VL-72BSOTA
50.4
Text Score· 2024-09-18
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Code
#4LLaVA-OneVision-Qwen2-72BSOTA
48.4
Text Score· 2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer Code
#5LLaVA-OneVision-Qwen2-7B
41.6
Text Score· 2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer Code
#6Qwen2-VL-7B
40.2
Text Score· 2024-09-18
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Code
#7Gemini-1.5-Pro (CoT)SOTA
37
Text Score· 2024-03-08
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Code
#8VideoLLaMA2-72B
36.2
Text Score· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Code
#9Gemini-1.5-Pro
35.8
Text Score· 2024-03-08
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Code
#10Claude 3.5 Sonnet
32.8
Text Score
No paper
#11MiniCPM-2.6
32.6
Text Score· 2024-08-03
MiniCPM-V: A GPT-4V Level MLLM on Your Phone Code
#12InternLM-XC-2.5 (CoT)
30.8
Text Score· 2024-07-03
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output Code
#13InternLM-XC-2.5
28.8
Text Score· 2024-07-03
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output Code
#14LLaVA-NeXT-Video-34B (CoT)
25.8
Text Score
No paper
#15Video-LLaVA-7BSOTA
24.8
Text Score· 2023-11-16
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Code
#16Phi-3.5-Vision
24
Text Score
No paper
#17MA-LMM-Vicuna-7B
23.8
Text Score· 2024-04-08
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Code
#18LLaVA-NeXT-Video-34B
23
Text Score
No paper
#19LLaVA-NeXT-Video-7B (CoT)
21.8
Text Score
No paper
#20LLaVA-NeXT-Video-7B
21.8
Text Score
No paper
#21VTimeLLM
19.4
Text Score· 2023-11-30
VTimeLLM: Empower LLM to Grasp Video Moments Code
#22VideoCLIPSOTA
17
Text Score· 2021-09-28
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding Code
#23LanguageBind
10.6
Text Score· 2023-10-03
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment Code
#24ImageBind
9.4
Text Score· 2023-05-09
ImageBind: One Embedding Space To Bind Them All Code