TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/EVA: Exploring the Limits of Masked Visual Representation ...

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao

2022-11-14CVPR 2023 1Self-Supervised Image ClassificationImage ClassificationAction ClassificationRepresentation LearningSegmentationTransfer LearningSemantic SegmentationInstance SegmentationAction RecognitionTemporal Action LocalizationObject Detection
PaperPDFCodeCode(official)CodeCode(official)CodeCode

Abstract

We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@189.7EVA
Semantic SegmentationADE20K valmIoU61.5EVA
Semantic SegmentationCOCO-Stuff testmIoU53.4EVA
Semantic SegmentationADE20KParams (M)1074EVA
Semantic SegmentationADE20KValidation mIoU62.3EVA
Object DetectionCOCO test-devAP5081.9EVA
Object DetectionCOCO test-devAP7571.7EVA
Object DetectionCOCO test-devAPL77.9EVA
Object DetectionCOCO test-devAPM67.7EVA
Object DetectionCOCO test-devAPS48.5EVA
Object DetectionCOCO test-devbox mAP64.7EVA
Object DetectionCOCO-OAverage mAP57.8EVA
Object DetectionCOCO-OEffective Robustness28.86EVA
Object DetectionCOCO minivalAP5082.1EVA
Object DetectionCOCO minivalAP7570.8EVA
Object DetectionCOCO minivalAPL78.5EVA
Object DetectionCOCO minivalAPM68.4EVA
Object DetectionCOCO minivalAPS49.4EVA
Object DetectionCOCO minivalbox AP64.5EVA
Object DetectionLVIS v1.0 valbox AP62.2EVA
Object DetectionLVIS v1.0 valbox APr55.1EVA
3DCOCO test-devAP5081.9EVA
3DCOCO test-devAP7571.7EVA
3DCOCO test-devAPL77.9EVA
3DCOCO test-devAPM67.7EVA
3DCOCO test-devAPS48.5EVA
3DCOCO test-devbox mAP64.7EVA
3DCOCO-OAverage mAP57.8EVA
3DCOCO-OEffective Robustness28.86EVA
3DCOCO minivalAP5082.1EVA
3DCOCO minivalAP7570.8EVA
3DCOCO minivalAPL78.5EVA
3DCOCO minivalAPM68.4EVA
3DCOCO minivalAPS49.4EVA
3DCOCO minivalbox AP64.5EVA
3DLVIS v1.0 valbox AP62.2EVA
3DLVIS v1.0 valbox APr55.1EVA
Instance SegmentationCOCO minivalAP5079.4EVA
Instance SegmentationCOCO minivalAP7560.9EVA
Instance SegmentationCOCO minivalAPL72EVA
Instance SegmentationCOCO minivalAPM58.4EVA
Instance SegmentationCOCO minivalAPS37.6EVA
Instance SegmentationCOCO minivalmask AP55EVA
Instance SegmentationCOCO test-devAP5080EVA
Instance SegmentationCOCO test-devAPL72.4EVA
Instance SegmentationCOCO test-devAPM58EVA
Instance SegmentationCOCO test-devAPS36.3EVA
Instance SegmentationCOCO test-devmask AP55.5EVA
Instance SegmentationLVIS v1.0 valmask AP55EVA
2D ClassificationCOCO test-devAP5081.9EVA
2D ClassificationCOCO test-devAP7571.7EVA
2D ClassificationCOCO test-devAPL77.9EVA
2D ClassificationCOCO test-devAPM67.7EVA
2D ClassificationCOCO test-devAPS48.5EVA
2D ClassificationCOCO test-devbox mAP64.7EVA
2D ClassificationCOCO-OAverage mAP57.8EVA
2D ClassificationCOCO-OEffective Robustness28.86EVA
2D ClassificationCOCO minivalAP5082.1EVA
2D ClassificationCOCO minivalAP7570.8EVA
2D ClassificationCOCO minivalAPL78.5EVA
2D ClassificationCOCO minivalAPM68.4EVA
2D ClassificationCOCO minivalAPS49.4EVA
2D ClassificationCOCO minivalbox AP64.5EVA
2D ClassificationLVIS v1.0 valbox AP62.2EVA
2D ClassificationLVIS v1.0 valbox APr55.1EVA
2D Object DetectionCOCO test-devAP5081.9EVA
2D Object DetectionCOCO test-devAP7571.7EVA
2D Object DetectionCOCO test-devAPL77.9EVA
2D Object DetectionCOCO test-devAPM67.7EVA
2D Object DetectionCOCO test-devAPS48.5EVA
2D Object DetectionCOCO test-devbox mAP64.7EVA
2D Object DetectionCOCO-OAverage mAP57.8EVA
2D Object DetectionCOCO-OEffective Robustness28.86EVA
2D Object DetectionCOCO minivalAP5082.1EVA
2D Object DetectionCOCO minivalAP7570.8EVA
2D Object DetectionCOCO minivalAPL78.5EVA
2D Object DetectionCOCO minivalAPM68.4EVA
2D Object DetectionCOCO minivalAPS49.4EVA
2D Object DetectionCOCO minivalbox AP64.5EVA
2D Object DetectionLVIS v1.0 valbox AP62.2EVA
2D Object DetectionLVIS v1.0 valbox APr55.1EVA
10-shot image generationADE20K valmIoU61.5EVA
10-shot image generationCOCO-Stuff testmIoU53.4EVA
10-shot image generationADE20KParams (M)1074EVA
10-shot image generationADE20KValidation mIoU62.3EVA
16kCOCO test-devAP5081.9EVA
16kCOCO test-devAP7571.7EVA
16kCOCO test-devAPL77.9EVA
16kCOCO test-devAPM67.7EVA
16kCOCO test-devAPS48.5EVA
16kCOCO test-devbox mAP64.7EVA
16kCOCO-OAverage mAP57.8EVA
16kCOCO-OEffective Robustness28.86EVA
16kCOCO minivalAP5082.1EVA
16kCOCO minivalAP7570.8EVA
16kCOCO minivalAPL78.5EVA
16kCOCO minivalAPM68.4EVA
16kCOCO minivalAPS49.4EVA
16kCOCO minivalbox AP64.5EVA
16kLVIS v1.0 valbox AP62.2EVA
16kLVIS v1.0 valbox APr55.1EVA

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17