Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Video
/
UCF-101
Video on UCF-101
Metric: FVD16 (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
FVD16 (best first)
FVD16 (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
FVD16
▼
Extra Data
Paper
Date
↕
Code
1
MCVD
2460
No
Latent Video Diffusion Models for High-Fidelity ...
2022-11-23
Code
2
VDM
1396
No
Latent Video Diffusion Models for High-Fidelity ...
2022-11-23
Code
3
TGAN-v2 (128x128)
1209
No
Latent Video Diffusion Models for High-Fidelity ...
2022-11-23
Code
4
MCVD (64x64)
1143
No
MCVD: Masked Conditional Video Diffusion for Pre...
2022-05-19
Code
5
MoCoGAN-HD (256x256, unconditional)
700
No
A Good Image Generator Is What You Need for High...
2021-04-30
Code
6
MagicVideo (256x256, text-conditional)
699
No
MagicVideo: Efficient Video Generation With Late...
2022-11-20
-
7
TATS (256x256)
635
No
Long Video Generation with Time-Agnostic VQGAN a...
2022-04-07
Code
8
DIGAN (128x128, unconditional)
577
No
Generating Videos with Dynamics-aware Implicit G...
2022-02-21
Code
9
LVDM (256x256, unconditional)
552
No
Latent Video Diffusion Models for High-Fidelity ...
2022-11-23
Code
10
Video LDM (320x512, text-conditional)
550.61
No
Align your Latents: High-Resolution Video Synthe...
2023-04-18
Code
11
LAVIE (320x512, text-conditional)
526.3
No
LAVIE: High-Quality Video Generation with Cascad...
2023-09-26
Code
12
DIGAN (128x128, class-conditional)
465
No
Generating Videos with Dynamics-aware Implicit G...
2022-02-21
Code
13
MeBT (128x128, unconditional)
438
No
Towards End-to-End Generative Modeling of Long V...
2023-03-20
Code
14
TATS (128x128, unconditional)
420
No
Long Video Generation with Time-Agnostic VQGAN a...
2022-04-07
Code
15
MMVG (128x128, unconditional)
395
No
Tell Me What Happened: Unifying Text-guided Vide...
2022-11-23
Code
16
LVDM (256x256, unconditional)
372
No
Latent Video Diffusion Models for High-Fidelity ...
2022-11-23
Code
17
Make-A-Video (Zero-shot, 256x256, class-conditional)
367.23
No
Make-A-Video: Text-to-Video Generation without T...
2022-09-29
Code
18
PYoCo (Zero-shot, 64x64, text-conditional)
355.19
No
Preserve Your Own Correlation: A Noise Prior for...
2023-05-17
-
19
VideoPoet (text-conditional)
355
No
VideoPoet: A Large Language Model for Zero-Shot ...
2023-12-21
-
20
VideoAssembler (Zero-shot, 256x256, class-conditional)
346.84
No
MagDiff: Multi-Alignment Diffusion for High-Fide...
2023-11-29
Code
21
GridDiff (Zero-shot)
340
No
Grid Diffusion Models for Text-to-Video Generation
2024-03-30
-
22
Lumiere (Zero-shot. 1024x1024, text-conditional)
332.49
No
Lumiere: A Space-Time Diffusion Model for Video ...
2024-01-23
Code
23
TATS (128x128, class-conditional)
332
No
Long Video Generation with Time-Agnostic VQGAN a...
2022-04-07
Code
24
MMVG (128x128, class-conditional)
328
No
Tell Me What Happened: Unifying Text-guided Vide...
2022-11-23
Code
25
PYoCo (Zero-shot, 64x64, unconditional)
310
No
Preserve Your Own Correlation: A Noise Prior for...
2023-05-17
-
26
CogVideo (128x128, class-conditional)
305
No
CogVideo: Large-scale Pretraining for Text-to-Vi...
2022-05-29
Code
27
VIDM (256x256, unconditional)
294.7
No
VIDM: Video Implicit Diffusion Models
2022-12-01
Code
28
Video-LaVIT
280.57
No
Video-LaVIT: Unified Video-Language Pre-training...
2024-02-05
Code
29
MAGVIT (AR)
265
No
MAGVIT: Masked Generative Video Transformer
2022-12-10
Code
30
W.A.L.T 3B (text-conditional)
258.1
No
Photorealistic Video Generation with Diffusion M...
2023-12-11
-
31
PixelDance (256x256, text-conditional)
242.82
No
Make Pixels Dance: High-Dynamic Video Generation
2023-11-18
-
32
VideoFusion (128x128, unconditional)
220
No
VideoFusion: Decomposed Diffusion Models for Hig...
2023-03-15
Code
33
OmniTokenizer-AR
191
No
OmniTokenizer: A Joint Image-Video Tokenizer for...
2024-06-13
Code
34
VideoFusion (128x128, class-conditional)
173
No
VideoFusion: Decomposed Diffusion Models for Hig...
2023-03-15
Code
35
Latte + LeanVAE
164.45
No
LeanVAE: An Ultra-Efficient Reconstruction VAE f...
2025-03-18
Code
36
REGIS-Fuse (Finetuning, 128x128, text-conditional)
141
No
-
-
Code
37
MAGVIT-v2 (AR)
109
No
Language Model Beats Diffusion -- Tokenizer is K...
2023-10-09
Code
38
ACDiT
90
No
ACDiT: Interpolating Autoregressive Conditional ...
2024-12-10
Code
39
Make-A-Video (Finetuning, 256x256, class-conditional)
81.25
No
Make-A-Video: Text-to-Video Generation without T...
2022-09-29
Code
40
HPDM-L
66.32
No
Hierarchical Patch Diffusion Models for High-Res...
2024-06-12
-
41
LARP
57
No
LARP: Tokenizing Videos with a Learned Autoregre...
2024-10-28
Code
42
FAR
57
No
Long-Context Autoregressive Video Modeling with ...
2025-03-25
Code
43
Video-GPT
53
Yes
Video-GPT via Next Clip Diffusion
2025-05-18
Code
#1
MCVD
SOTA
2460
FVD16
· 2022-11-23
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Code
#2
VDM
1396
FVD16
· 2022-11-23
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Code
#3
TGAN-v2 (128x128)
1209
FVD16
· 2022-11-23
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Code
#4
MCVD (64x64)
SOTA
1143
FVD16
· 2022-05-19
MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation
Code
#5
MoCoGAN-HD (256x256, unconditional)
SOTA
700
FVD16
· 2021-04-30
A Good Image Generator Is What You Need for High-Resolution Video Synthesis
Code
#6
MagicVideo (256x256, text-conditional)
699
FVD16
· 2022-11-20
MagicVideo: Efficient Video Generation With Latent Diffusion Models
#7
TATS (256x256)
635
FVD16
· 2022-04-07
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
Code
#8
DIGAN (128x128, unconditional)
577
FVD16
· 2022-02-21
Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks
Code
#9
LVDM (256x256, unconditional)
552
FVD16
· 2022-11-23
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Code
#10
Video LDM (320x512, text-conditional)
550.61
FVD16
· 2023-04-18
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
Code
#11
LAVIE (320x512, text-conditional)
526.3
FVD16
· 2023-09-26
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
Code
#12
DIGAN (128x128, class-conditional)
465
FVD16
· 2022-02-21
Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks
Code
#13
MeBT (128x128, unconditional)
438
FVD16
· 2023-03-20
Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers
Code
#14
TATS (128x128, unconditional)
420
FVD16
· 2022-04-07
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
Code
#15
MMVG (128x128, unconditional)
395
FVD16
· 2022-11-23
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
Code
#16
LVDM (256x256, unconditional)
372
FVD16
· 2022-11-23
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Code
#17
Make-A-Video (Zero-shot, 256x256, class-conditional)
367.23
FVD16
· 2022-09-29
Make-A-Video: Text-to-Video Generation without Text-Video Data
Code
#18
PYoCo (Zero-shot, 64x64, text-conditional)
355.19
FVD16
· 2023-05-17
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
#19
VideoPoet (text-conditional)
355
FVD16
· 2023-12-21
VideoPoet: A Large Language Model for Zero-Shot Video Generation
#20
VideoAssembler (Zero-shot, 256x256, class-conditional)
346.84
FVD16
· 2023-11-29
MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing
Code
#21
GridDiff (Zero-shot)
340
FVD16
· 2024-03-30
Grid Diffusion Models for Text-to-Video Generation
#22
Lumiere (Zero-shot. 1024x1024, text-conditional)
332.49
FVD16
· 2024-01-23
Lumiere: A Space-Time Diffusion Model for Video Generation
Code
#23
TATS (128x128, class-conditional)
332
FVD16
· 2022-04-07
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
Code
#24
MMVG (128x128, class-conditional)
328
FVD16
· 2022-11-23
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
Code
#25
PYoCo (Zero-shot, 64x64, unconditional)
310
FVD16
· 2023-05-17
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
#26
CogVideo (128x128, class-conditional)
305
FVD16
· 2022-05-29
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Code
#27
VIDM (256x256, unconditional)
294.7
FVD16
· 2022-12-01
VIDM: Video Implicit Diffusion Models
Code
#28
Video-LaVIT
280.57
FVD16
· 2024-02-05
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Code
#29
MAGVIT (AR)
265
FVD16
· 2022-12-10
MAGVIT: Masked Generative Video Transformer
Code
#30
W.A.L.T 3B (text-conditional)
258.1
FVD16
· 2023-12-11
Photorealistic Video Generation with Diffusion Models
#31
PixelDance (256x256, text-conditional)
242.82
FVD16
· 2023-11-18
Make Pixels Dance: High-Dynamic Video Generation
#32
VideoFusion (128x128, unconditional)
220
FVD16
· 2023-03-15
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation
Code
#33
OmniTokenizer-AR
191
FVD16
· 2024-06-13
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
Code
#34
VideoFusion (128x128, class-conditional)
173
FVD16
· 2023-03-15
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation
Code
#35
Latte + LeanVAE
164.45
FVD16
· 2025-03-18
LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models
Code
#36
REGIS-Fuse (Finetuning, 128x128, text-conditional)
141
FVD16
No paper
Code
#37
MAGVIT-v2 (AR)
109
FVD16
· 2023-10-09
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Code
#38
ACDiT
90
FVD16
· 2024-12-10
ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer
Code
#39
Make-A-Video (Finetuning, 256x256, class-conditional)
81.25
FVD16
· 2022-09-29
Make-A-Video: Text-to-Video Generation without Text-Video Data
Code
#40
HPDM-L
66.32
FVD16
· 2024-06-12
Hierarchical Patch Diffusion Models for High-Resolution Video Generation
#41
LARP
57
FVD16
· 2024-10-28
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
Code
#42
FAR
57
FVD16
· 2025-03-25
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Code
#43
Video-GPT
53
FVD16
· Extra Data
· 2025-05-18
Video-GPT via Next Clip Diffusion
Code