TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multi-Scale Vision Longformer: A New Vision Transformer fo...

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, Jianfeng Gao

2021-03-29ICCV 2021 10Image ClassificationInstance Segmentationobject-detectionObject Detection
PaperPDFCodeCodeCode(official)

Abstract

This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques. The first is the multi-scale model structure, which provides image encodings at multiple scales with manageable computational cost. The second is the attention mechanism of vision Longformer, which is a variant of Longformer \cite{beltagy2020longformer}, originally developed for natural language processing, and achieves a linear complexity w.r.t. the number of input tokens. A comprehensive empirical study shows that the new ViT significantly outperforms several strong baselines, including the existing ViT models and their ResNet counterparts, and the Pyramid Vision Transformer from a concurrent work \cite{wang2021pyramid}, on a range of vision tasks, including image classification, object detection, and segmentation. The models and source code are released at \url{https://github.com/microsoft/vision-longformer}.

Results

TaskDatasetMetricValueModel
Object DetectionCOCO minivalAP7547.6RetinaNet (ViL-Base, multi-scale, 3x)
Object DetectionCOCO minivalAPL58.1RetinaNet (ViL-Base, multi-scale, 3x)
Object DetectionCOCO minivalAPM48RetinaNet (ViL-Base, multi-scale, 3x)
Object DetectionCOCO minivalAPS29.9RetinaNet (ViL-Base, multi-scale, 3x)
Object DetectionCOCO minivalbox AP44.7RetinaNet (ViL-Base, multi-scale, 3x)
Object DetectionCOCO minivalAP5065.5RetinaNet (ViL-Base)
Object DetectionCOCO minivalAP7547.1RetinaNet (ViL-Base)
Object DetectionCOCO minivalAPL58.3RetinaNet (ViL-Base)
Object DetectionCOCO minivalAPM47.9RetinaNet (ViL-Base)
Object DetectionCOCO minivalAPS28.9RetinaNet (ViL-Base)
Object DetectionCOCO minivalbox AP44.3RetinaNet (ViL-Base)
Image ClassificationImageNetGFLOPs8.7ViL-Medium-D
Image ClassificationImageNetGFLOPs13.4ViL-Base-D
Image ClassificationImageNetGFLOPs4.86ViL-Small
Image ClassificationImageNetGFLOPs6.74ViL-Base-W
Image ClassificationImageNetGFLOPs1.3ViL-Tiny-RPB
3DCOCO minivalAP7547.6RetinaNet (ViL-Base, multi-scale, 3x)
3DCOCO minivalAPL58.1RetinaNet (ViL-Base, multi-scale, 3x)
3DCOCO minivalAPM48RetinaNet (ViL-Base, multi-scale, 3x)
3DCOCO minivalAPS29.9RetinaNet (ViL-Base, multi-scale, 3x)
3DCOCO minivalbox AP44.7RetinaNet (ViL-Base, multi-scale, 3x)
3DCOCO minivalAP5065.5RetinaNet (ViL-Base)
3DCOCO minivalAP7547.1RetinaNet (ViL-Base)
3DCOCO minivalAPL58.3RetinaNet (ViL-Base)
3DCOCO minivalAPM47.9RetinaNet (ViL-Base)
3DCOCO minivalAPS28.9RetinaNet (ViL-Base)
3DCOCO minivalbox AP44.3RetinaNet (ViL-Base)
Instance SegmentationCOCO minivalAP7549.9Mask R-CNN (ViL Base, multi-scale, 3x lr)
Instance SegmentationCOCO minivalmask AP45.7Mask R-CNN (ViL Base, multi-scale, 3x lr)
Instance SegmentationCOCO minivalAP5067.2Mask R-CNN (ViL Base, 1x lr)
Instance SegmentationCOCO minivalAP7549.3Mask R-CNN (ViL Base, 1x lr)
Instance SegmentationCOCO minivalmask AP45.1Mask R-CNN (ViL Base, 1x lr)
2D ClassificationCOCO minivalAP7547.6RetinaNet (ViL-Base, multi-scale, 3x)
2D ClassificationCOCO minivalAPL58.1RetinaNet (ViL-Base, multi-scale, 3x)
2D ClassificationCOCO minivalAPM48RetinaNet (ViL-Base, multi-scale, 3x)
2D ClassificationCOCO minivalAPS29.9RetinaNet (ViL-Base, multi-scale, 3x)
2D ClassificationCOCO minivalbox AP44.7RetinaNet (ViL-Base, multi-scale, 3x)
2D ClassificationCOCO minivalAP5065.5RetinaNet (ViL-Base)
2D ClassificationCOCO minivalAP7547.1RetinaNet (ViL-Base)
2D ClassificationCOCO minivalAPL58.3RetinaNet (ViL-Base)
2D ClassificationCOCO minivalAPM47.9RetinaNet (ViL-Base)
2D ClassificationCOCO minivalAPS28.9RetinaNet (ViL-Base)
2D ClassificationCOCO minivalbox AP44.3RetinaNet (ViL-Base)
2D Object DetectionCOCO minivalAP7547.6RetinaNet (ViL-Base, multi-scale, 3x)
2D Object DetectionCOCO minivalAPL58.1RetinaNet (ViL-Base, multi-scale, 3x)
2D Object DetectionCOCO minivalAPM48RetinaNet (ViL-Base, multi-scale, 3x)
2D Object DetectionCOCO minivalAPS29.9RetinaNet (ViL-Base, multi-scale, 3x)
2D Object DetectionCOCO minivalbox AP44.7RetinaNet (ViL-Base, multi-scale, 3x)
2D Object DetectionCOCO minivalAP5065.5RetinaNet (ViL-Base)
2D Object DetectionCOCO minivalAP7547.1RetinaNet (ViL-Base)
2D Object DetectionCOCO minivalAPL58.3RetinaNet (ViL-Base)
2D Object DetectionCOCO minivalAPM47.9RetinaNet (ViL-Base)
2D Object DetectionCOCO minivalAPS28.9RetinaNet (ViL-Base)
2D Object DetectionCOCO minivalbox AP44.3RetinaNet (ViL-Base)
16kCOCO minivalAP7547.6RetinaNet (ViL-Base, multi-scale, 3x)
16kCOCO minivalAPL58.1RetinaNet (ViL-Base, multi-scale, 3x)
16kCOCO minivalAPM48RetinaNet (ViL-Base, multi-scale, 3x)
16kCOCO minivalAPS29.9RetinaNet (ViL-Base, multi-scale, 3x)
16kCOCO minivalbox AP44.7RetinaNet (ViL-Base, multi-scale, 3x)
16kCOCO minivalAP5065.5RetinaNet (ViL-Base)
16kCOCO minivalAP7547.1RetinaNet (ViL-Base)
16kCOCO minivalAPL58.3RetinaNet (ViL-Base)
16kCOCO minivalAPM47.9RetinaNet (ViL-Base)
16kCOCO minivalAPS28.9RetinaNet (ViL-Base)
16kCOCO minivalbox AP44.3RetinaNet (ViL-Base)

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17