TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Florence: A New Foundation Model for Computer Vision

Florence: A New Foundation Model for Computer Vision

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, JianFeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang

2021-11-22Cross-Modal RetrievalZero-Shot Cross-Modal RetrievalVideo RetrievalImage ClassificationAction ClassificationZero-Shot Video RetrievalTransfer LearningZero-Shot Transfer Image ClassificationAction RecognitionZero-Shot Transfer Image Classification (CN)RetrievalVisual Question Answering (VQA)Action Recognition In Videosobject-detectionZero-Shot LearningObject DetectionVisual Question Answering
PaperPDFCodeCode

Abstract

Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation, we introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition. Moreover, Florence demonstrates outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and zero-shot transfer for novel images and objects. All of these properties are critical for our vision foundation model to serve general purpose vision tasks. Florence achieves new state-of-the-art results in majority of 44 representative benchmarks, e.g., ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video R@137.6Florence
VideoMSR-VTT-1kAtext-to-video R@1072.6Florence
VideoMSR-VTT-1kAtext-to-video R@563.8Florence
VideoKinetics-600Top-1 Accuracy87.8Florence (curated FLD-900M pretrain)
VideoKinetics-600Top-5 Accuracy97.9Florence (curated FLD-900M pretrain)
Visual Question Answering (VQA)VQA v2 test-devAccuracy80.16Florence
Visual Question Answering (VQA)VQA v2 test-stdoverall80.36Florence
Activity RecognitionKinetics-600Top-1 Accuracy87.8Florence
Activity RecognitionKinetics-600Top-5 Accuracy97.8Florence
Activity RecognitionKinetics-400Top-1 Accuracy86.5Florence
Activity RecognitionKinetics-400Top-5 Accuracy97.3Florence
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@181.8Florence
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@595.2Florence
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@163.2Florence
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@585.7Florence
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@190.9Florence
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@599.1Florence
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@176.7Florence
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@593.6Florence
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@164.7Florence
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@585.9Florence
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@147.2Florence
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@571.4Florence
Object DetectionCOCO test-devbox mAP62.4Florence-CoSwin-H
Object DetectionCOCO minivalbox AP62Florence-CoSwin-H
Image ClassificationImageNetTop 5 Accuracy99.02Florence-CoSwin-H
3DCOCO test-devbox mAP62.4Florence-CoSwin-H
3DCOCO minivalbox AP62Florence-CoSwin-H
Action RecognitionKinetics-600Top-1 Accuracy87.8Florence
Action RecognitionKinetics-600Top-5 Accuracy97.8Florence
Action RecognitionKinetics-400Top-1 Accuracy86.5Florence
Action RecognitionKinetics-400Top-5 Accuracy97.3Florence
Video RetrievalMSR-VTT-1kAtext-to-video R@137.6Florence
Video RetrievalMSR-VTT-1kAtext-to-video R@1072.6Florence
Video RetrievalMSR-VTT-1kAtext-to-video R@563.8Florence
2D ClassificationCOCO test-devbox mAP62.4Florence-CoSwin-H
2D ClassificationCOCO minivalbox AP62Florence-CoSwin-H
2D Object DetectionCOCO test-devbox mAP62.4Florence-CoSwin-H
2D Object DetectionCOCO minivalbox AP62Florence-CoSwin-H
Action Recognition In VideosKinetics-600Top-1 Accuracy87.8Florence
Action Recognition In VideosKinetics-600Top-5 Accuracy97.8Florence
Action Recognition In VideosKinetics-400Top-1 Accuracy86.5Florence
Action Recognition In VideosKinetics-400Top-5 Accuracy97.3Florence
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@181.8Florence
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@595.2Florence
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@163.2Florence
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@585.7Florence
Cross-Modal RetrievalCOCO 2014Image-to-text R@181.8Florence
Cross-Modal RetrievalCOCO 2014Image-to-text R@595.2Florence
Cross-Modal RetrievalCOCO 2014Text-to-image R@163.2Florence
Cross-Modal RetrievalCOCO 2014Text-to-image R@585.7Florence
Visual Question AnsweringVQA v2 test-devAccuracy80.16Florence
Visual Question AnsweringVQA v2 test-stdoverall80.36Florence
16kCOCO test-devbox mAP62.4Florence-CoSwin-H
16kCOCO minivalbox AP62Florence-CoSwin-H
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@137.6Florence
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1072.6Florence
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@563.8Florence

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17