TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Towards All-in-one Pre-training via Maximizing Multi-modal...

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information

Weijie Su, Xizhou Zhu, Chenxin Tao, Lewei Lu, Bin Li, Gao Huang, Yu Qiao, Xiaogang Wang, Jie zhou, Jifeng Dai

2022-11-17CVPR 2023 1Image ClassificationLong-tailed Object DetectionSemantic SegmentationAllobject-detectionObject Detection
PaperPDFCode(official)

Abstract

To effectively exploit the potential of large-scale models, various pre-training strategies supported by massive data from different sources are proposed, including supervised pre-training, weakly-supervised pre-training, and self-supervised pre-training. It has been proved that combining multiple pre-training strategies and data from various modalities/sources can greatly boost the training of large-scale models. However, current works adopt a multi-stage pre-training system, where the complex pipeline may increase the uncertainty and instability of the pre-training. It is thus desirable that these strategies can be integrated in a single-stage manner. In this paper, we first propose a general multi-modal mutual information formula as a unified optimization target and demonstrate that all existing approaches are special cases of our framework. Under this unified perspective, we propose an all-in-one single-stage pre-training approach, named Maximizing Multi-modal Mutual Information Pre-training (M3I Pre-training). Our approach achieves better performance than previous pre-training methods on various vision benchmarks, including ImageNet classification, COCO object detection, LVIS long-tailed object detection, and ADE20k semantic segmentation. Notably, we successfully pre-train a billion-level parameter image backbone and achieve state-of-the-art performance on various benchmarks. Code shall be released at https://github.com/OpenGVLab/M3I-Pretraining.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20KParams (M)1310M3I Pre-training (InternImage-H)
Semantic SegmentationADE20KValidation mIoU62.9M3I Pre-training (InternImage-H)
Object DetectionLVIS v1.0 minivalbox AP65.8M3I Pre-training (InternImage-H, single-scale)
Object DetectionCOCO test-devbox mAP65.4M3I Pre-training (InternImage-H)
Object DetectionCOCO minivalbox AP65M3I Pre-training (InternImage-H)
3DLVIS v1.0 minivalbox AP65.8M3I Pre-training (InternImage-H, single-scale)
3DCOCO test-devbox mAP65.4M3I Pre-training (InternImage-H)
3DCOCO minivalbox AP65M3I Pre-training (InternImage-H)
2D ClassificationLVIS v1.0 minivalbox AP65.8M3I Pre-training (InternImage-H, single-scale)
2D ClassificationCOCO test-devbox mAP65.4M3I Pre-training (InternImage-H)
2D ClassificationCOCO minivalbox AP65M3I Pre-training (InternImage-H)
2D Object DetectionLVIS v1.0 minivalbox AP65.8M3I Pre-training (InternImage-H, single-scale)
2D Object DetectionCOCO test-devbox mAP65.4M3I Pre-training (InternImage-H)
2D Object DetectionCOCO minivalbox AP65M3I Pre-training (InternImage-H)
10-shot image generationADE20KParams (M)1310M3I Pre-training (InternImage-H)
10-shot image generationADE20KValidation mIoU62.9M3I Pre-training (InternImage-H)
16kLVIS v1.0 minivalbox AP65.8M3I Pre-training (InternImage-H, single-scale)
16kCOCO test-devbox mAP65.4M3I Pre-training (InternImage-H)
16kCOCO minivalbox AP65M3I Pre-training (InternImage-H)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17