FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

Rui Liu, Hanming Deng, Yangyi Huang, Xiaoyu Shi, Lewei Lu, Wenxiu Sun, Xiaogang Wang, Jifeng Dai, Hongsheng Li

2021-09-07ICCV 2021 10Seeing Beyond the Visible Video Inpainting

Abstract

Transformer, as a strong and flexible architecture for modelling long-range relations, has been widely explored in vision tasks. However, when used in video inpainting that requires fine-grained representation, existed method still suffers from yielding blurry edges in detail due to the hard patch splitting. Here we aim to tackle this problem by proposing FuseFormer, a Transformer model designed for video inpainting via fine-grained feature fusion based on novel Soft Split and Soft Composition operations. The soft split divides feature map into many patches with given overlapping interval. On the contrary, the soft composition operates by stitching different patches into a whole feature map where pixels in overlapping regions are summed up. These two modules are first used in tokenization before Transformer layers and de-tokenization after Transformer layers, for effective mapping between tokens and features. Therefore, sub-patch level information interaction is enabled for more effective feature propagation between neighboring patches, resulting in synthesizing vivid content for hole regions in videos. Moreover, in FuseFormer, we elaborately insert the soft composition and soft split into the feed-forward network, enabling the 1D linear layers to have the capability of modelling 2D structure. And, the sub-patch level feature fusion ability is further enhanced. In both quantitative and qualitative evaluations, our proposed FuseFormer surpasses state-of-the-art methods. We also conduct detailed analysis to examine its superiority.

Results

Task	Dataset	Metric	Value	Model
3D	DAVIS	Ewarp	0.1362	FuseFormer
3D	DAVIS	PSNR	32.54	FuseFormer
3D	DAVIS	SSIM	0.97	FuseFormer
3D	DAVIS	VFID	0.138	FuseFormer
3D	YouTube-VOS 2018	Ewarp	0.09	FuseFormer
3D	YouTube-VOS 2018	PSNR	33.29	FuseFormer
3D	YouTube-VOS 2018	SSIM	0.9681	FuseFormer
3D	YouTube-VOS 2018	VFID	0.053	FuseFormer
3D	HQVI (240p)	LPIPS	0.0498	FuseFormer
3D	HQVI (240p)	PSNR	29.92	FuseFormer
3D	HQVI (240p)	SSIM	0.9365	FuseFormer
3D	HQVI (240p)	VFID	0.2727	FuseFormer
Video Inpainting	DAVIS	Ewarp	0.1362	FuseFormer
Video Inpainting	DAVIS	PSNR	32.54	FuseFormer
Video Inpainting	DAVIS	SSIM	0.97	FuseFormer
Video Inpainting	DAVIS	VFID	0.138	FuseFormer
Video Inpainting	YouTube-VOS 2018	Ewarp	0.09	FuseFormer
Video Inpainting	YouTube-VOS 2018	PSNR	33.29	FuseFormer
Video Inpainting	YouTube-VOS 2018	SSIM	0.9681	FuseFormer
Video Inpainting	YouTube-VOS 2018	VFID	0.053	FuseFormer
Video Inpainting	HQVI (240p)	LPIPS	0.0498	FuseFormer
Video Inpainting	HQVI (240p)	PSNR	29.92	FuseFormer
Video Inpainting	HQVI (240p)	SSIM	0.9365	FuseFormer
Video Inpainting	HQVI (240p)	VFID	0.2727	FuseFormer
Seeing Beyond the Visible	KITTI360-EX	Average PSNR	18.91	FuseFormer

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

Abstract

Results

Related Papers

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

Abstract

Results

Related Papers