Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, Yinxiao Li
Recent progress on Transformers and multi-layer perceptron (MLP) models provide new network architectural designs for computer vision tasks. Although these models proved to be effective in many vision tasks such as image recognition, there remain challenges in adapting them for low-level vision. The inflexibility to support high-resolution images and limitations of local attention are perhaps the main bottlenecks. In this work, we present a multi-axis MLP based architecture called MAXIM, that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for cross-feature conditioning. Both these modules are exclusively based on MLPs, but also benefit from being both global and `fully-convolutional', two properties that are desirable for image processing. Our extensive experimental results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement while requiring fewer or comparable numbers of parameters and FLOPs than competitive models. The source code and trained models will be available at \url{https://github.com/google-research/maxim}.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Deblurring | RealBlur-J | PSNR (sRGB) | 32.84 | MAXIM |
| Deblurring | RealBlur-J | Params(M) | 22.2 | MAXIM |
| Deblurring | RealBlur-J | SSIM (sRGB) | 0.935 | MAXIM |
| Deblurring | BASED | PSNR | 30.65728 | MAXIM (REDS) |
| Deblurring | RealBlur-R | PSNR (sRGB) | 39.45 | MAXIM |
| Deblurring | RealBlur-R | SSIM (sRGB) | 0.961 | MAXIM-3S |
| Deblurring | GoPro | PSNR | 32.86 | MAXIM-3S |
| Deblurring | RealBlur-R (trained on GoPro) | PSNR (sRGB) | 35.78 | MAXIM |
| Deblurring | MSU BASED | ERQAv2.0 | 0.74277 | MAXIM (REDS) |
| Deblurring | MSU BASED | LPIPS | 0.07836 | MAXIM (REDS) |
| Deblurring | MSU BASED | SSIM | 0.94959 | MAXIM (REDS) |
| Deblurring | MSU BASED | Subjective | 1.0081 | MAXIM (REDS) |
| Deblurring | MSU BASED | VMAF | 67.3502 | MAXIM (REDS) |
| Deblurring | MSU BASED | LPIPS | 0.09188 | MAXIM (GoPro) |
| Deblurring | MSU BASED | PSNR | 31.36344 | MAXIM (GoPro) |
| Deblurring | MSU BASED | SSIM | 0.94386 | MAXIM (GoPro) |
| Deblurring | MSU BASED | Subjective | 0.207 | MAXIM (GoPro) |
| Deblurring | MSU BASED | VMAF | 67.7557 | MAXIM (GoPro) |
| Deblurring | HIDE | PSNR | 32.83 | MAXIM-3S |
| Deblurring | RealBlur-J (trained on GoPro) | PSNR (sRGB) | 28.83 | MAXIM |
| Deblurring | RealBlur-J (trained on GoPro) | SSIM (sRGB) | 0.875 | MAXIM |
| Deblurring | HIDE (trained on GOPRO) | PSNR (sRGB) | 32.83 | MAXIM |
| Deblurring | HIDE (trained on GOPRO) | Params (M) | 22.2 | MAXIM |
| Deblurring | HIDE (trained on GOPRO) | SSIM (sRGB) | 0.956 | MAXIM |
| Image Enhancement | LOL | Average PSNR | 23.43 | MAXIM |
| Image Enhancement | LOL | SSIM | 0.863 | MAXIM |
| Rain Removal | Test1200 | SSIM | 0.922 | MAXIM |
| Rain Removal | Rain100H | SSIM | 0.903 | MAXIM |
| Rain Removal | Test2800 | PSNR | 33.8 | MAXIM |
| Rain Removal | Test100 | PSNR | 31.17 | MAXIM |
| Rain Removal | Test100 | SSIM | 0.922 | MAXIM |
| Rain Removal | Rain100L | SSIM | 0.977 | MAXIM |
| Dehazing | SOTS Indoor | PSNR | 38.11 | MAXIM-2S |
| Dehazing | SOTS Outdoor | PSNR | 34.19 | MAXIM-2S |
| Image Dehazing | SOTS Indoor | PSNR | 38.11 | MAXIM-2S |
| Image Dehazing | SOTS Outdoor | PSNR | 34.19 | MAXIM-2S |
| Denoising | SIDD | PSNR (sRGB) | 39.96 | MAXIM-3S |
| Denoising | SIDD | SSIM (sRGB) | 0.96 | MAXIM-3S |
| Denoising | DND | PSNR (sRGB) | 39.84 | MAXIM-3S |
| Denoising | DND | SSIM (sRGB) | 0.954 | MAXIM-3S |
| Image Denoising | SIDD | PSNR (sRGB) | 39.96 | MAXIM-3S |
| Image Denoising | SIDD | SSIM (sRGB) | 0.96 | MAXIM-3S |
| Image Denoising | DND | PSNR (sRGB) | 39.84 | MAXIM-3S |
| Image Denoising | DND | SSIM (sRGB) | 0.954 | MAXIM-3S |
| 2D Classification | RealBlur-J | PSNR (sRGB) | 32.84 | MAXIM |
| 2D Classification | RealBlur-J | Params(M) | 22.2 | MAXIM |
| 2D Classification | RealBlur-J | SSIM (sRGB) | 0.935 | MAXIM |
| 2D Classification | BASED | PSNR | 30.65728 | MAXIM (REDS) |
| 2D Classification | RealBlur-R | PSNR (sRGB) | 39.45 | MAXIM |
| 2D Classification | RealBlur-R | SSIM (sRGB) | 0.961 | MAXIM-3S |
| 2D Classification | GoPro | PSNR | 32.86 | MAXIM-3S |
| 2D Classification | RealBlur-R (trained on GoPro) | PSNR (sRGB) | 35.78 | MAXIM |
| 2D Classification | MSU BASED | ERQAv2.0 | 0.74277 | MAXIM (REDS) |
| 2D Classification | MSU BASED | LPIPS | 0.07836 | MAXIM (REDS) |
| 2D Classification | MSU BASED | SSIM | 0.94959 | MAXIM (REDS) |
| 2D Classification | MSU BASED | Subjective | 1.0081 | MAXIM (REDS) |
| 2D Classification | MSU BASED | VMAF | 67.3502 | MAXIM (REDS) |
| 2D Classification | MSU BASED | LPIPS | 0.09188 | MAXIM (GoPro) |
| 2D Classification | MSU BASED | PSNR | 31.36344 | MAXIM (GoPro) |
| 2D Classification | MSU BASED | SSIM | 0.94386 | MAXIM (GoPro) |
| 2D Classification | MSU BASED | Subjective | 0.207 | MAXIM (GoPro) |
| 2D Classification | MSU BASED | VMAF | 67.7557 | MAXIM (GoPro) |
| 2D Classification | HIDE | PSNR | 32.83 | MAXIM-3S |
| 2D Classification | RealBlur-J (trained on GoPro) | PSNR (sRGB) | 28.83 | MAXIM |
| 2D Classification | RealBlur-J (trained on GoPro) | SSIM (sRGB) | 0.875 | MAXIM |
| 2D Classification | HIDE (trained on GOPRO) | PSNR (sRGB) | 32.83 | MAXIM |
| 2D Classification | HIDE (trained on GOPRO) | Params (M) | 22.2 | MAXIM |
| 2D Classification | HIDE (trained on GOPRO) | SSIM (sRGB) | 0.956 | MAXIM |
| Photo Retouching | MIT-Adobe 5k | PSNR | 26.15 | MAXIM |
| Photo Retouching | MIT-Adobe 5k | SSIM | 0.945 | MAXIM |
| Image Deblurring | HIDE | SSIM | 0.956 | MAXIM-3S |
| Image Deblurring | GoPro | PSNR | 32.86 | MAXIM-3S |
| 3D Architecture | SIDD | PSNR (sRGB) | 39.96 | MAXIM-3S |
| 3D Architecture | SIDD | SSIM (sRGB) | 0.96 | MAXIM-3S |
| 3D Architecture | DND | PSNR (sRGB) | 39.84 | MAXIM-3S |
| 3D Architecture | DND | SSIM (sRGB) | 0.954 | MAXIM-3S |
| 10-shot image generation | RealBlur-J | PSNR (sRGB) | 32.84 | MAXIM |
| 10-shot image generation | RealBlur-J | Params(M) | 22.2 | MAXIM |
| 10-shot image generation | RealBlur-J | SSIM (sRGB) | 0.935 | MAXIM |
| 10-shot image generation | BASED | PSNR | 30.65728 | MAXIM (REDS) |
| 10-shot image generation | RealBlur-R | PSNR (sRGB) | 39.45 | MAXIM |
| 10-shot image generation | RealBlur-R | SSIM (sRGB) | 0.961 | MAXIM-3S |
| 10-shot image generation | GoPro | PSNR | 32.86 | MAXIM-3S |
| 10-shot image generation | RealBlur-R (trained on GoPro) | PSNR (sRGB) | 35.78 | MAXIM |
| 10-shot image generation | MSU BASED | ERQAv2.0 | 0.74277 | MAXIM (REDS) |
| 10-shot image generation | MSU BASED | LPIPS | 0.07836 | MAXIM (REDS) |
| 10-shot image generation | MSU BASED | SSIM | 0.94959 | MAXIM (REDS) |
| 10-shot image generation | MSU BASED | Subjective | 1.0081 | MAXIM (REDS) |
| 10-shot image generation | MSU BASED | VMAF | 67.3502 | MAXIM (REDS) |
| 10-shot image generation | MSU BASED | LPIPS | 0.09188 | MAXIM (GoPro) |
| 10-shot image generation | MSU BASED | PSNR | 31.36344 | MAXIM (GoPro) |
| 10-shot image generation | MSU BASED | SSIM | 0.94386 | MAXIM (GoPro) |
| 10-shot image generation | MSU BASED | Subjective | 0.207 | MAXIM (GoPro) |
| 10-shot image generation | MSU BASED | VMAF | 67.7557 | MAXIM (GoPro) |
| 10-shot image generation | HIDE | PSNR | 32.83 | MAXIM-3S |
| 10-shot image generation | RealBlur-J (trained on GoPro) | PSNR (sRGB) | 28.83 | MAXIM |
| 10-shot image generation | RealBlur-J (trained on GoPro) | SSIM (sRGB) | 0.875 | MAXIM |
| 10-shot image generation | HIDE (trained on GOPRO) | PSNR (sRGB) | 32.83 | MAXIM |
| 10-shot image generation | HIDE (trained on GOPRO) | Params (M) | 22.2 | MAXIM |
| 10-shot image generation | HIDE (trained on GOPRO) | SSIM (sRGB) | 0.956 | MAXIM |
| 10-shot image generation | HIDE | SSIM | 0.956 | MAXIM-3S |
| 10-shot image generation | GoPro | PSNR | 32.86 | MAXIM-3S |
| 1 Image, 2*2 Stitchi | HIDE | SSIM | 0.956 | MAXIM-3S |
| 1 Image, 2*2 Stitchi | GoPro | PSNR | 32.86 | MAXIM-3S |
| 16k | HIDE | SSIM | 0.956 | MAXIM-3S |
| 16k | GoPro | PSNR | 32.86 | MAXIM-3S |
| Blind Image Deblurring | RealBlur-J | PSNR (sRGB) | 32.84 | MAXIM |
| Blind Image Deblurring | RealBlur-J | Params(M) | 22.2 | MAXIM |
| Blind Image Deblurring | RealBlur-J | SSIM (sRGB) | 0.935 | MAXIM |
| Blind Image Deblurring | BASED | PSNR | 30.65728 | MAXIM (REDS) |
| Blind Image Deblurring | RealBlur-R | PSNR (sRGB) | 39.45 | MAXIM |
| Blind Image Deblurring | RealBlur-R | SSIM (sRGB) | 0.961 | MAXIM-3S |
| Blind Image Deblurring | GoPro | PSNR | 32.86 | MAXIM-3S |
| Blind Image Deblurring | RealBlur-R (trained on GoPro) | PSNR (sRGB) | 35.78 | MAXIM |
| Blind Image Deblurring | MSU BASED | ERQAv2.0 | 0.74277 | MAXIM (REDS) |
| Blind Image Deblurring | MSU BASED | LPIPS | 0.07836 | MAXIM (REDS) |
| Blind Image Deblurring | MSU BASED | SSIM | 0.94959 | MAXIM (REDS) |
| Blind Image Deblurring | MSU BASED | Subjective | 1.0081 | MAXIM (REDS) |
| Blind Image Deblurring | MSU BASED | VMAF | 67.3502 | MAXIM (REDS) |
| Blind Image Deblurring | MSU BASED | LPIPS | 0.09188 | MAXIM (GoPro) |
| Blind Image Deblurring | MSU BASED | PSNR | 31.36344 | MAXIM (GoPro) |
| Blind Image Deblurring | MSU BASED | SSIM | 0.94386 | MAXIM (GoPro) |
| Blind Image Deblurring | MSU BASED | Subjective | 0.207 | MAXIM (GoPro) |
| Blind Image Deblurring | MSU BASED | VMAF | 67.7557 | MAXIM (GoPro) |
| Blind Image Deblurring | HIDE | PSNR | 32.83 | MAXIM-3S |
| Blind Image Deblurring | RealBlur-J (trained on GoPro) | PSNR (sRGB) | 28.83 | MAXIM |
| Blind Image Deblurring | RealBlur-J (trained on GoPro) | SSIM (sRGB) | 0.875 | MAXIM |
| Blind Image Deblurring | HIDE (trained on GOPRO) | PSNR (sRGB) | 32.83 | MAXIM |
| Blind Image Deblurring | HIDE (trained on GOPRO) | Params (M) | 22.2 | MAXIM |
| Blind Image Deblurring | HIDE (trained on GOPRO) | SSIM (sRGB) | 0.956 | MAXIM |