Text-to-Video Generation on MSR-VTT

Metric: FVD (higher is better)

LeaderboardDataset

Loading chart...

Results

Submit a result

Sort:

#	Model↕	FVD▼	Extra Data	Paper	Date↕	Code
1	MagicVideo	998	No	MagicVideo: Efficient Video Generation With Late...	2022-11-20	-
2	VideoComposer	580	No	VideoComposer: Compositional Video Synthesis wit...	2023-06-03	Code
3	ModelScopeT2V	550	No	ModelScope Text-to-Video Technical Report	2023-08-12	Code
4	Show-1	538	No	Show-1: Marrying Pixel and Latent Diffusion Mode...	2023-09-27	Code
5	TF-T2V	441	No	A Recipe for Scaling up Text-to-Video Generation...	2023-12-25	Code
6	HiGen	406	No	Hierarchical Spatio-temporal Decoupling for Text...	2023-12-07	Code
7	PixelDance	381	No	Make Pixels Dance: High-Dynamic Video Generation	2023-11-18	-
8	VideoPoet	213	No	VideoPoet: A Large Language Model for Zero-Shot ...	2023-12-21	-
9	Video-LaVIT	188.36	No	Video-LaVIT: Unified Video-Language Pre-training...	2024-02-05	Code
10	Snap Video (288×288)	110.4	No	Snap Video: Scaled Spatiotemporal Transformers f...	2024-02-22	-
11	Snap Video (512x288)	104	No	Snap Video: Scaled Spatiotemporal Transformers f...	2024-02-22	-

#1MagicVideoSOTA
998
FVD· 2022-11-20
MagicVideo: Efficient Video Generation With Latent Diffusion Models
#2VideoComposer
580
FVD· 2023-06-03
VideoComposer: Compositional Video Synthesis with Motion Controllability Code
#3ModelScopeT2V
550
FVD· 2023-08-12
ModelScope Text-to-Video Technical Report Code
#4Show-1
538
FVD· 2023-09-27
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation Code
#5TF-T2V
441
FVD· 2023-12-25
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos Code
#6HiGen
406
FVD· 2023-12-07
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation Code
#7PixelDance
381
FVD· 2023-11-18
Make Pixels Dance: High-Dynamic Video Generation
#8VideoPoet
213
FVD· 2023-12-21
VideoPoet: A Large Language Model for Zero-Shot Video Generation
#9Video-LaVIT
188.36
FVD· 2024-02-05
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Code
#10Snap Video (288×288)
110.4
FVD· 2024-02-22
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
#11Snap Video (512x288)
104
FVD· 2024-02-22
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis