Temporal Relation Extraction on Vinoground

Metric: Video Score (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Video Score▼	Extra Data	Paper	Date↕	Code
1	GPT-4o (CoT)	51	No	-	-	-
2	GPT-4o	38.2	No	-	-	-
3	LLaVA-OneVision-Qwen2-72B	35.2	No	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06	Code
4	Qwen2-VL-72B	32.6	No	Qwen2-VL: Enhancing Vision-Language Model's Perc...	2024-09-18	Code
5	Qwen2-VL-7B	32.4	No	Qwen2-VL: Enhancing Vision-Language Model's Perc...	2024-09-18	Code
6	LLaVA-OneVision-Qwen2-7B	29.4	No	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06	Code
7	MiniCPM-2.6	29.2	No	MiniCPM-V: A GPT-4V Level MLLM on Your Phone	2024-08-03	Code
8	Claude 3.5 Sonnet	28.8	No	-	-	-
9	InternLM-XC-2.5 (CoT)	28.4	No	InternLM-XComposer-2.5: A Versatile Large Vision...	2024-07-03	Code
10	InternLM-XC-2.5	27.8	No	InternLM-XComposer-2.5: A Versatile Large Vision...	2024-07-03	Code
11	Gemini-1.5-Pro (CoT)	27.6	No	Gemini 1.5: Unlocking multimodal understanding a...	2024-03-08	Code
12	VTimeLLM	27	No	VTimeLLM: Empower LLM to Grasp Video Moments	2023-11-30	Code
13	LLaVA-NeXT-Video-7B (CoT)	26.2	No	-	-	-
14	Video-LLaVA-7B	25.8	No	Video-LLaVA: Learning United Visual Representati...	2023-11-16	Code
15	MA-LMM-Vicuna-7B	25.6	No	MA-LMM: Memory-Augmented Large Multimodal Model ...	2024-04-08	Code
16	LLaVA-NeXT-Video-7B	25.6	No	-	-	-
17	Gemini-1.5-Pro	22.6	No	Gemini 1.5: Unlocking multimodal understanding a...	2024-03-08	Code
18	Phi-3.5-Vision	22.4	No	-	-	-
19	LLaVA-NeXT-Video-34B (CoT)	22.2	No	-	-	-
20	VideoLLaMA2-72B	21.8	No	VideoLLaMA 2: Advancing Spatial-Temporal Modelin...	2024-06-11	Code
21	LLaVA-NeXT-Video-34B	21.2	No	-	-	-
22	LanguageBind	5	No	LanguageBind: Extending Video-Language Pretraini...	2023-10-03	Code
23	ImageBind	3.4	No	ImageBind: One Embedding Space To Bind Them All	2023-05-09	Code
24	VideoCLIP	2.8	No	VideoCLIP: Contrastive Pre-training for Zero-sho...	2021-09-28	Code

#1GPT-4o (CoT)
51
Video Score
No paper
#2GPT-4o
38.2
Video Score
No paper
#3LLaVA-OneVision-Qwen2-72BSOTA
35.2
Video Score· 2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer Code
#4Qwen2-VL-72B
32.6
Video Score· 2024-09-18
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Code
#5Qwen2-VL-7B
32.4
Video Score· 2024-09-18
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Code
#6LLaVA-OneVision-Qwen2-7B
29.4
Video Score· 2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer Code
#7MiniCPM-2.6SOTA
29.2
Video Score· 2024-08-03
MiniCPM-V: A GPT-4V Level MLLM on Your Phone Code
#8Claude 3.5 Sonnet
28.8
Video Score
No paper
#9InternLM-XC-2.5 (CoT)SOTA
28.4
Video Score· 2024-07-03
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output Code
#10InternLM-XC-2.5
27.8
Video Score· 2024-07-03
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output Code
#11Gemini-1.5-Pro (CoT)SOTA
27.6
Video Score· 2024-03-08
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Code
#12VTimeLLMSOTA
27
Video Score· 2023-11-30
VTimeLLM: Empower LLM to Grasp Video Moments Code
#13LLaVA-NeXT-Video-7B (CoT)
26.2
Video Score
No paper
#14Video-LLaVA-7BSOTA
25.8
Video Score· 2023-11-16
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Code
#15MA-LMM-Vicuna-7B
25.6
Video Score· 2024-04-08
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Code
#16LLaVA-NeXT-Video-7B
25.6
Video Score
No paper
#17Gemini-1.5-Pro
22.6
Video Score· 2024-03-08
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Code
#18Phi-3.5-Vision
22.4
Video Score
No paper
#19LLaVA-NeXT-Video-34B (CoT)
22.2
Video Score
No paper
#20VideoLLaMA2-72B
21.8
Video Score· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Code
#21LLaVA-NeXT-Video-34B
21.2
Video Score
No paper
#22LanguageBindSOTA
5
Video Score· 2023-10-03
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment Code
#23ImageBindSOTA
3.4
Video Score· 2023-05-09
ImageBind: One Embedding Space To Bind Them All Code
#24VideoCLIPSOTA
2.8
Video Score· 2021-09-28
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding Code