EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

Han Cai, Junyan Li, Muyan Hu, Chuang Gan, Song Han

2022-05-29Super-Resolution Image Classification Autonomous Driving Semantic Segmentation Prediction Instance Segmentation Zero-Shot Instance Segmentation object-detection Object Detection Image Segmentation

Paper PDF Code Code(official)Code Code(official)Code Code

Abstract

High-resolution dense prediction enables many appealing real-world applications, such as computational photography, autonomous driving, etc. However, the vast computational cost makes deploying state-of-the-art high-resolution dense prediction models on hardware devices difficult. This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Unlike prior high-resolution dense prediction models that rely on heavy softmax attention, hardware-inefficient large-kernel convolution, or complicated topology structure to obtain good performances, our multi-scale linear attention achieves the global receptive field and multi-scale learning (two desirable features for high-resolution dense prediction) with only lightweight and hardware-efficient operations. As such, EfficientViT delivers remarkable performance gains over previous state-of-the-art models with significant speedup on diverse hardware platforms, including mobile CPU, edge GPU, and cloud GPU. Without performance loss on Cityscapes, our EfficientViT provides up to 13.9$\times$ and 6.2$\times$ GPU latency reduction over SegFormer and SegNeXt, respectively. For super-resolution, EfficientViT delivers up to 6.4x speedup over Restormer while providing 0.11dB gain in PSNR. For Segment Anything, EfficientViT delivers 48.9x higher throughput on A100 GPU while achieving slightly better zero-shot instance segmentation performance on COCO.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	Cityscapes val	mIoU	83.2	EfficientViT-B3 (r1184x2368)
Semantic Segmentation	ADE20K	Validation mIoU	49	EfficientViT-B3 (r512)
Image Classification	ImageNet	GFLOPs	20	EfficientViT-L2 (r384)
Image Classification	ImageNet	GFLOPs	11	EfficientViT-L2 (r288)
Image Classification	ImageNet	GFLOPs	5.3	EfficientViT-L1 (r224)
Image Classification	ImageNet	GFLOPs	6.5	EfficientViT-B3 (r288)
Image Classification	ImageNet	GFLOPs	4	EfficientViT-B3 (r224)
Image Classification	ImageNet	GFLOPs	2.1	EfficientViT-B2 (r256)
10-shot image generation	Cityscapes val	mIoU	83.2	EfficientViT-B3 (r1184x2368)
10-shot image generation	ADE20K	Validation mIoU	49	EfficientViT-B3 (r512)

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

Abstract

Results

Related Papers

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

Abstract

Results

Related Papers