Fast Fourier Inception Networks for Occluded Video Prediction

Ping Li, Chenhan Zhang, Xianghua Xu

2023-06-17Video Prediction Prediction

Abstract

Video prediction is a pixel-level task that generates future frames by employing the historical frames. There often exist continuous complex motions, such as object overlapping and scene occlusion in video, which poses great challenges to this task. Previous works either fail to well capture the long-term temporal dynamics or do not handle the occlusion masks. To address these issues, we develop the fully convolutional Fast Fourier Inception Networks for video prediction, termed \textit{FFINet}, which includes two primary components, \ie, the occlusion inpainter and the spatiotemporal translator. The former adopts the fast Fourier convolutions to enlarge the receptive field, such that the missing areas (occlusion) with complex geometric structures are filled by the inpainter. The latter employs the stacked Fourier transform inception module to learn the temporal evolution by group convolutions and the spatial movement by channel-wise Fourier convolutions, which captures both the local and the global spatiotemporal features. This encourages generating more realistic and high-quality future frames. To optimize the model, the recovery loss is imposed to the objective, \ie, minimizing the mean square error between the ground-truth frame and the recovery frame. Both quantitative and qualitative experimental results on five benchmarks, including Moving MNIST, TaxiBJ, Human3.6M, Caltech Pedestrian, and KTH, have demonstrated the superiority of the proposed approach. Our code is available at GitHub.

Results

Task	Dataset	Metric	Value	Model
Video	Human3.6M	MAE	1190	FFINet
Video	Human3.6M	MSE	233	FFINet
Video	Human3.6M	SSIM	0.912	FFINet
Video Prediction	Human3.6M	MAE	1190	FFINet
Video Prediction	Human3.6M	MSE	233	FFINet
Video Prediction	Human3.6M	SSIM	0.912	FFINet

Fast Fourier Inception Networks for Occluded Video Prediction

Abstract

Results

Related Papers

Fast Fourier Inception Networks for Occluded Video Prediction

Abstract

Results

Related Papers