TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/EfficientViT: Multi-Scale Linear Attention for High-Resolu...

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

Han Cai, Junyan Li, Muyan Hu, Chuang Gan, Song Han

2022-05-29Super-ResolutionImage ClassificationAutonomous DrivingSemantic SegmentationPredictionInstance SegmentationZero-Shot Instance Segmentationobject-detectionObject DetectionImage Segmentation
PaperPDFCodeCode(official)CodeCode(official)CodeCode

Abstract

High-resolution dense prediction enables many appealing real-world applications, such as computational photography, autonomous driving, etc. However, the vast computational cost makes deploying state-of-the-art high-resolution dense prediction models on hardware devices difficult. This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Unlike prior high-resolution dense prediction models that rely on heavy softmax attention, hardware-inefficient large-kernel convolution, or complicated topology structure to obtain good performances, our multi-scale linear attention achieves the global receptive field and multi-scale learning (two desirable features for high-resolution dense prediction) with only lightweight and hardware-efficient operations. As such, EfficientViT delivers remarkable performance gains over previous state-of-the-art models with significant speedup on diverse hardware platforms, including mobile CPU, edge GPU, and cloud GPU. Without performance loss on Cityscapes, our EfficientViT provides up to 13.9$\times$ and 6.2$\times$ GPU latency reduction over SegFormer and SegNeXt, respectively. For super-resolution, EfficientViT delivers up to 6.4x speedup over Restormer while providing 0.11dB gain in PSNR. For Segment Anything, EfficientViT delivers 48.9x higher throughput on A100 GPU while achieving slightly better zero-shot instance segmentation performance on COCO.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCityscapes valmIoU83.2EfficientViT-B3 (r1184x2368)
Semantic SegmentationADE20KValidation mIoU49EfficientViT-B3 (r512)
Image ClassificationImageNetGFLOPs20EfficientViT-L2 (r384)
Image ClassificationImageNetGFLOPs11EfficientViT-L2 (r288)
Image ClassificationImageNetGFLOPs5.3EfficientViT-L1 (r224)
Image ClassificationImageNetGFLOPs6.5EfficientViT-B3 (r288)
Image ClassificationImageNetGFLOPs4EfficientViT-B3 (r224)
Image ClassificationImageNetGFLOPs2.1EfficientViT-B2 (r256)
10-shot image generationCityscapes valmIoU83.2EfficientViT-B3 (r1184x2368)
10-shot image generationADE20KValidation mIoU49EfficientViT-B3 (r512)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Multi-Strategy Improved Snake Optimizer Accelerated CNN-LSTM-Attention-Adaboost for Trajectory Prediction2025-07-21GEMINUS: Dual-aware Global and Scene-Adaptive Mixture-of-Experts for End-to-End Autonomous Driving2025-07-19Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18AGENTS-LLM: Augmentative GENeration of Challenging Traffic Scenarios with an Agentic LLM Framework2025-07-18SpectraLift: Physics-Guided Spectral-Inversion Network for Self-Supervised Hyperspectral Image Super-Resolution2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17