Florence: A New Foundation Model for Computer Vision

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, JianFeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang

2021-11-22Cross-Modal Retrieval Zero-Shot Cross-Modal Retrieval Video Retrieval Image Classification Action Classification Zero-Shot Video Retrieval Transfer Learning Zero-Shot Transfer Image Classification Action Recognition Zero-Shot Transfer Image Classification (CN)Retrieval Visual Question Answering (VQA)Action Recognition In Videos object-detection Zero-Shot Learning Object Detection Visual Question Answering

Paper PDF Code Code

Abstract

Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation, we introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition. Moreover, Florence demonstrates outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and zero-shot transfer for novel images and objects. All of these properties are critical for our vision foundation model to serve general purpose vision tasks. Florence achieves new state-of-the-art results in majority of 44 representative benchmarks, e.g., ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video R@1	37.6	Florence
Video	MSR-VTT-1kA	text-to-video R@10	72.6	Florence
Video	MSR-VTT-1kA	text-to-video R@5	63.8	Florence
Video	Kinetics-600	Top-1 Accuracy	87.8	Florence (curated FLD-900M pretrain)
Video	Kinetics-600	Top-5 Accuracy	97.9	Florence (curated FLD-900M pretrain)
Visual Question Answering (VQA)	VQA v2 test-dev	Accuracy	80.16	Florence
Visual Question Answering (VQA)	VQA v2 test-std	overall	80.36	Florence
Activity Recognition	Kinetics-600	Top-1 Accuracy	87.8	Florence
Activity Recognition	Kinetics-600	Top-5 Accuracy	97.8	Florence
Activity Recognition	Kinetics-400	Top-1 Accuracy	86.5	Florence
Activity Recognition	Kinetics-400	Top-5 Accuracy	97.3	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@1	81.8	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@5	95.2	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@1	63.2	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	85.7	Florence
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@1	90.9	Florence
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@5	99.1	Florence
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@1	76.7	Florence
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@5	93.6	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@1	64.7	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@5	85.9	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@1	47.2	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	71.4	Florence
Object Detection	COCO test-dev	box mAP	62.4	Florence-CoSwin-H
Object Detection	COCO minival	box AP	62	Florence-CoSwin-H
Image Classification	ImageNet	Top 5 Accuracy	99.02	Florence-CoSwin-H
3D	COCO test-dev	box mAP	62.4	Florence-CoSwin-H
3D	COCO minival	box AP	62	Florence-CoSwin-H
Action Recognition	Kinetics-600	Top-1 Accuracy	87.8	Florence
Action Recognition	Kinetics-600	Top-5 Accuracy	97.8	Florence
Action Recognition	Kinetics-400	Top-1 Accuracy	86.5	Florence
Action Recognition	Kinetics-400	Top-5 Accuracy	97.3	Florence
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	37.6	Florence
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	72.6	Florence
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	63.8	Florence
2D Classification	COCO test-dev	box mAP	62.4	Florence-CoSwin-H
2D Classification	COCO minival	box AP	62	Florence-CoSwin-H
2D Object Detection	COCO test-dev	box mAP	62.4	Florence-CoSwin-H
2D Object Detection	COCO minival	box AP	62	Florence-CoSwin-H
Action Recognition In Videos	Kinetics-600	Top-1 Accuracy	87.8	Florence
Action Recognition In Videos	Kinetics-600	Top-5 Accuracy	97.8	Florence
Action Recognition In Videos	Kinetics-400	Top-1 Accuracy	86.5	Florence
Action Recognition In Videos	Kinetics-400	Top-5 Accuracy	97.3	Florence
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@1	81.8	Florence
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@5	95.2	Florence
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@1	63.2	Florence
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@5	85.7	Florence
Cross-Modal Retrieval	COCO 2014	Image-to-text R@1	81.8	Florence
Cross-Modal Retrieval	COCO 2014	Image-to-text R@5	95.2	Florence
Cross-Modal Retrieval	COCO 2014	Text-to-image R@1	63.2	Florence
Cross-Modal Retrieval	COCO 2014	Text-to-image R@5	85.7	Florence
Visual Question Answering	VQA v2 test-dev	Accuracy	80.16	Florence
Visual Question Answering	VQA v2 test-std	overall	80.36	Florence
16k	COCO test-dev	box mAP	62.4	Florence-CoSwin-H
16k	COCO minival	box AP	62	Florence-CoSwin-H
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	37.6	Florence
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	72.6	Florence
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	63.8	Florence

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	MSR-VTT-1kA	text-to-video R@1	37.6	Florence
Video	MSR-VTT-1kA	text-to-video R@10	72.6	Florence
Video	MSR-VTT-1kA	text-to-video R@5	63.8	Florence
Video	Kinetics-600	Top-1 Accuracy	87.8	Florence (curated FLD-900M pretrain)
Video	Kinetics-600	Top-5 Accuracy	97.9	Florence (curated FLD-900M pretrain)
Visual Question Answering (VQA)	VQA v2 test-dev	Accuracy	80.16	Florence
Visual Question Answering (VQA)	VQA v2 test-std	overall	80.36	Florence
Activity Recognition	Kinetics-600	Top-1 Accuracy	87.8	Florence
Activity Recognition	Kinetics-600	Top-5 Accuracy	97.8	Florence
Activity Recognition	Kinetics-400	Top-1 Accuracy	86.5	Florence
Activity Recognition	Kinetics-400	Top-5 Accuracy	97.3	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@1	81.8	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@5	95.2	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@1	63.2	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	85.7	Florence
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@1	90.9	Florence
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@5	99.1	Florence
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@1	76.7	Florence
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@5	93.6	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@1	64.7	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@5	85.9	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@1	47.2	Florence
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	71.4	Florence
Object Detection	COCO test-dev	box mAP	62.4	Florence-CoSwin-H
Object Detection	COCO minival	box AP	62	Florence-CoSwin-H
Image Classification	ImageNet	Top 5 Accuracy	99.02	Florence-CoSwin-H
3D	COCO test-dev	box mAP	62.4	Florence-CoSwin-H
3D	COCO minival	box AP	62	Florence-CoSwin-H
Action Recognition	Kinetics-600	Top-1 Accuracy	87.8	Florence
Action Recognition	Kinetics-600	Top-5 Accuracy	97.8	Florence
Action Recognition	Kinetics-400	Top-1 Accuracy	86.5	Florence
Action Recognition	Kinetics-400	Top-5 Accuracy	97.3	Florence
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	37.6	Florence
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	72.6	Florence
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	63.8	Florence
2D Classification	COCO test-dev	box mAP	62.4	Florence-CoSwin-H
2D Classification	COCO minival	box AP	62	Florence-CoSwin-H
2D Object Detection	COCO test-dev	box mAP	62.4	Florence-CoSwin-H
2D Object Detection	COCO minival	box AP	62	Florence-CoSwin-H
Action Recognition In Videos	Kinetics-600	Top-1 Accuracy	87.8	Florence
Action Recognition In Videos	Kinetics-600	Top-5 Accuracy	97.8	Florence
Action Recognition In Videos	Kinetics-400	Top-1 Accuracy	86.5	Florence
Action Recognition In Videos	Kinetics-400	Top-5 Accuracy	97.3	Florence
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@1	81.8	Florence
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@5	95.2	Florence
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@1	63.2	Florence
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@5	85.7	Florence
Cross-Modal Retrieval	COCO 2014	Image-to-text R@1	81.8	Florence
Cross-Modal Retrieval	COCO 2014	Image-to-text R@5	95.2	Florence
Cross-Modal Retrieval	COCO 2014	Text-to-image R@1	63.2	Florence
Cross-Modal Retrieval	COCO 2014	Text-to-image R@5	85.7	Florence
Visual Question Answering	VQA v2 test-dev	Accuracy	80.16	Florence
Visual Question Answering	VQA v2 test-std	overall	80.36	Florence
16k	COCO test-dev	box mAP	62.4	Florence-CoSwin-H
16k	COCO minival	box AP	62	Florence-CoSwin-H
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	37.6	Florence
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	72.6	Florence
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	63.8	Florence

Florence: A New Foundation Model for Computer Vision

Abstract

Results

Related Papers

Florence: A New Foundation Model for Computer Vision

Abstract

Results

Related Papers