Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Kinetics-400 | Acc@1 | 89.7 | EVA |
| Semantic Segmentation | ADE20K val | mIoU | 61.5 | EVA |
| Semantic Segmentation | COCO-Stuff test | mIoU | 53.4 | EVA |
| Semantic Segmentation | ADE20K | Params (M) | 1074 | EVA |
| Semantic Segmentation | ADE20K | Validation mIoU | 62.3 | EVA |
| Object Detection | COCO test-dev | AP50 | 81.9 | EVA |
| Object Detection | COCO test-dev | AP75 | 71.7 | EVA |
| Object Detection | COCO test-dev | APL | 77.9 | EVA |
| Object Detection | COCO test-dev | APM | 67.7 | EVA |
| Object Detection | COCO test-dev | APS | 48.5 | EVA |
| Object Detection | COCO test-dev | box mAP | 64.7 | EVA |
| Object Detection | COCO-O | Average mAP | 57.8 | EVA |
| Object Detection | COCO-O | Effective Robustness | 28.86 | EVA |
| Object Detection | COCO minival | AP50 | 82.1 | EVA |
| Object Detection | COCO minival | AP75 | 70.8 | EVA |
| Object Detection | COCO minival | APL | 78.5 | EVA |
| Object Detection | COCO minival | APM | 68.4 | EVA |
| Object Detection | COCO minival | APS | 49.4 | EVA |
| Object Detection | COCO minival | box AP | 64.5 | EVA |
| Object Detection | LVIS v1.0 val | box AP | 62.2 | EVA |
| Object Detection | LVIS v1.0 val | box APr | 55.1 | EVA |
| 3D | COCO test-dev | AP50 | 81.9 | EVA |
| 3D | COCO test-dev | AP75 | 71.7 | EVA |
| 3D | COCO test-dev | APL | 77.9 | EVA |
| 3D | COCO test-dev | APM | 67.7 | EVA |
| 3D | COCO test-dev | APS | 48.5 | EVA |
| 3D | COCO test-dev | box mAP | 64.7 | EVA |
| 3D | COCO-O | Average mAP | 57.8 | EVA |
| 3D | COCO-O | Effective Robustness | 28.86 | EVA |
| 3D | COCO minival | AP50 | 82.1 | EVA |
| 3D | COCO minival | AP75 | 70.8 | EVA |
| 3D | COCO minival | APL | 78.5 | EVA |
| 3D | COCO minival | APM | 68.4 | EVA |
| 3D | COCO minival | APS | 49.4 | EVA |
| 3D | COCO minival | box AP | 64.5 | EVA |
| 3D | LVIS v1.0 val | box AP | 62.2 | EVA |
| 3D | LVIS v1.0 val | box APr | 55.1 | EVA |
| Instance Segmentation | COCO minival | AP50 | 79.4 | EVA |
| Instance Segmentation | COCO minival | AP75 | 60.9 | EVA |
| Instance Segmentation | COCO minival | APL | 72 | EVA |
| Instance Segmentation | COCO minival | APM | 58.4 | EVA |
| Instance Segmentation | COCO minival | APS | 37.6 | EVA |
| Instance Segmentation | COCO minival | mask AP | 55 | EVA |
| Instance Segmentation | COCO test-dev | AP50 | 80 | EVA |
| Instance Segmentation | COCO test-dev | APL | 72.4 | EVA |
| Instance Segmentation | COCO test-dev | APM | 58 | EVA |
| Instance Segmentation | COCO test-dev | APS | 36.3 | EVA |
| Instance Segmentation | COCO test-dev | mask AP | 55.5 | EVA |
| Instance Segmentation | LVIS v1.0 val | mask AP | 55 | EVA |
| 2D Classification | COCO test-dev | AP50 | 81.9 | EVA |
| 2D Classification | COCO test-dev | AP75 | 71.7 | EVA |
| 2D Classification | COCO test-dev | APL | 77.9 | EVA |
| 2D Classification | COCO test-dev | APM | 67.7 | EVA |
| 2D Classification | COCO test-dev | APS | 48.5 | EVA |
| 2D Classification | COCO test-dev | box mAP | 64.7 | EVA |
| 2D Classification | COCO-O | Average mAP | 57.8 | EVA |
| 2D Classification | COCO-O | Effective Robustness | 28.86 | EVA |
| 2D Classification | COCO minival | AP50 | 82.1 | EVA |
| 2D Classification | COCO minival | AP75 | 70.8 | EVA |
| 2D Classification | COCO minival | APL | 78.5 | EVA |
| 2D Classification | COCO minival | APM | 68.4 | EVA |
| 2D Classification | COCO minival | APS | 49.4 | EVA |
| 2D Classification | COCO minival | box AP | 64.5 | EVA |
| 2D Classification | LVIS v1.0 val | box AP | 62.2 | EVA |
| 2D Classification | LVIS v1.0 val | box APr | 55.1 | EVA |
| 2D Object Detection | COCO test-dev | AP50 | 81.9 | EVA |
| 2D Object Detection | COCO test-dev | AP75 | 71.7 | EVA |
| 2D Object Detection | COCO test-dev | APL | 77.9 | EVA |
| 2D Object Detection | COCO test-dev | APM | 67.7 | EVA |
| 2D Object Detection | COCO test-dev | APS | 48.5 | EVA |
| 2D Object Detection | COCO test-dev | box mAP | 64.7 | EVA |
| 2D Object Detection | COCO-O | Average mAP | 57.8 | EVA |
| 2D Object Detection | COCO-O | Effective Robustness | 28.86 | EVA |
| 2D Object Detection | COCO minival | AP50 | 82.1 | EVA |
| 2D Object Detection | COCO minival | AP75 | 70.8 | EVA |
| 2D Object Detection | COCO minival | APL | 78.5 | EVA |
| 2D Object Detection | COCO minival | APM | 68.4 | EVA |
| 2D Object Detection | COCO minival | APS | 49.4 | EVA |
| 2D Object Detection | COCO minival | box AP | 64.5 | EVA |
| 2D Object Detection | LVIS v1.0 val | box AP | 62.2 | EVA |
| 2D Object Detection | LVIS v1.0 val | box APr | 55.1 | EVA |
| 10-shot image generation | ADE20K val | mIoU | 61.5 | EVA |
| 10-shot image generation | COCO-Stuff test | mIoU | 53.4 | EVA |
| 10-shot image generation | ADE20K | Params (M) | 1074 | EVA |
| 10-shot image generation | ADE20K | Validation mIoU | 62.3 | EVA |
| 16k | COCO test-dev | AP50 | 81.9 | EVA |
| 16k | COCO test-dev | AP75 | 71.7 | EVA |
| 16k | COCO test-dev | APL | 77.9 | EVA |
| 16k | COCO test-dev | APM | 67.7 | EVA |
| 16k | COCO test-dev | APS | 48.5 | EVA |
| 16k | COCO test-dev | box mAP | 64.7 | EVA |
| 16k | COCO-O | Average mAP | 57.8 | EVA |
| 16k | COCO-O | Effective Robustness | 28.86 | EVA |
| 16k | COCO minival | AP50 | 82.1 | EVA |
| 16k | COCO minival | AP75 | 70.8 | EVA |
| 16k | COCO minival | APL | 78.5 | EVA |
| 16k | COCO minival | APM | 68.4 | EVA |
| 16k | COCO minival | APS | 49.4 | EVA |
| 16k | COCO minival | box AP | 64.5 | EVA |
| 16k | LVIS v1.0 val | box AP | 62.2 | EVA |
| 16k | LVIS v1.0 val | box APr | 55.1 | EVA |