Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wengang Zhou, Jiebo Luo, Tao Mei

2024-03-25CVPR 2024 1Denoising Super-Resolution Video Super-Resolution Video Denoising Image Super-Resolution Video Reconstruction

Paper PDF

Abstract

Diffusion models are just at a tipping point for image super-resolution task. Nevertheless, it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance from low-resolution to high-resolution videos, but also the temporal consistency across video frames. In this paper, we propose a novel approach, pursuing Spatial Adaptation and Temporal Coherence (SATeCo), for video super-resolution. SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction. Technically, SATeCo freezes all the parameters of the pre-trained UNet and VAE, and only optimizes two deliberately-designed spatial feature adaptation (SFA) and temporal feature alignment (TFA) modules, in the decoder of UNet and VAE. SFA modulates frame features via adaptively estimating affine parameters for each pixel, guaranteeing pixel-wise guidance for high-resolution frame synthesis. TFA delves into feature interaction within a 3D local window (tubelet) through self-attention, and executes cross-attention between tubelet and its low-resolution counterpart to guide temporal feature alignment. Extensive experiments conducted on the REDS4 and Vid4 datasets demonstrate the effectiveness of our approach.

Results

Task	Dataset	Metric	Value	Model
Super-Resolution	Vid4 - 4x upscaling	PSNR	27.44	SATeCo
Super-Resolution	Vid4 - 4x upscaling	SSIM	0.842	SATeCo
3D Human Pose Estimation	Vid4 - 4x upscaling	PSNR	27.44	SATeCo
3D Human Pose Estimation	Vid4 - 4x upscaling	SSIM	0.842	SATeCo
Video	Vid4 - 4x upscaling	PSNR	27.44	SATeCo
Video	Vid4 - 4x upscaling	SSIM	0.842	SATeCo
Pose Estimation	Vid4 - 4x upscaling	PSNR	27.44	SATeCo
Pose Estimation	Vid4 - 4x upscaling	SSIM	0.842	SATeCo
3D	Vid4 - 4x upscaling	PSNR	27.44	SATeCo
3D	Vid4 - 4x upscaling	SSIM	0.842	SATeCo
3D Face Animation	Vid4 - 4x upscaling	PSNR	27.44	SATeCo
3D Face Animation	Vid4 - 4x upscaling	SSIM	0.842	SATeCo
2D Human Pose Estimation	Vid4 - 4x upscaling	PSNR	27.44	SATeCo
2D Human Pose Estimation	Vid4 - 4x upscaling	SSIM	0.842	SATeCo
3D Absolute Human Pose Estimation	Vid4 - 4x upscaling	PSNR	27.44	SATeCo
3D Absolute Human Pose Estimation	Vid4 - 4x upscaling	SSIM	0.842	SATeCo
Video Super-Resolution	Vid4 - 4x upscaling	PSNR	27.44	SATeCo
Video Super-Resolution	Vid4 - 4x upscaling	SSIM	0.842	SATeCo
3D Object Super-Resolution	Vid4 - 4x upscaling	PSNR	27.44	SATeCo
3D Object Super-Resolution	Vid4 - 4x upscaling	SSIM	0.842	SATeCo
1 Image, 2*2 Stitchi	Vid4 - 4x upscaling	PSNR	27.44	SATeCo
1 Image, 2*2 Stitchi	Vid4 - 4x upscaling	SSIM	0.842	SATeCo

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

Abstract

Results

Related Papers

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

Abstract

Results

Related Papers