TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/X-Paste: Revisiting Scalable Copy-Paste for Instance Segme...

X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion

Hanqing Zhao, Dianmo Sheng, Jianmin Bao, Dongdong Chen, Dong Chen, Fang Wen, Lu Yuan, Ce Liu, Wenbo Zhou, Qi Chu, Weiming Zhang, Nenghai Yu

2022-12-07Data AugmentationSegmentationSemantic SegmentationOpen Vocabulary Object DetectionInstance SegmentationZero-Shot LearningObject Detection
PaperPDFCode(official)Code

Abstract

Copy-Paste is a simple and effective data augmentation strategy for instance segmentation. By randomly pasting object instances onto new background images, it creates new training data for free and significantly boosts the segmentation performance, especially for rare object categories. Although diverse, high-quality object instances used in Copy-Paste result in more performance gain, previous works utilize object instances either from human-annotated instance segmentation datasets or rendered from 3D object models, and both approaches are too expensive to scale up to obtain good diversity. In this paper, we revisit Copy-Paste at scale with the power of newly emerged zero-shot recognition models (e.g., CLIP) and text2image models (e.g., StableDiffusion). We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy-Paste truly scalable. To make such success happen, we design a data acquisition and processing framework, dubbed ``X-Paste", upon which a systematic study is conducted. On the LVIS dataset, X-Paste provides impressive improvements over the strong baseline CenterNet2 with Swin-L as the backbone. Specifically, it archives +2.6 box AP and +2.1 mask AP gains on all classes and even more significant gains with +6.8 box AP, +6.5 mask AP on long-tail classes. Our code and models are available at https://github.com/yoctta/XPaste.

Results

TaskDatasetMetricValueModel
Object DetectionLVIS v1.0 valbox AP50.9CenterNet2 (Swin-L w/ X-Paste + Copy-Paste)
Object DetectionLVIS v1.0 valbox APr48.7CenterNet2 (Swin-L w/ X-Paste + Copy-Paste)
Object DetectionLVIS v1.0AP novel-LVIS base training21.4X-Paste
Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training22.8X-Paste
3DLVIS v1.0 valbox AP50.9CenterNet2 (Swin-L w/ X-Paste + Copy-Paste)
3DLVIS v1.0 valbox APr48.7CenterNet2 (Swin-L w/ X-Paste + Copy-Paste)
3DLVIS v1.0AP novel-LVIS base training21.4X-Paste
3DLVIS v1.0AP novel-Unrestricted open-vocabulary training22.8X-Paste
Instance SegmentationCOCO minivalmask AP48.8CenterNet2 (Swin-L w/ X-Paste + Copy-Paste)
Instance SegmentationLVIS v1.0 valmask AP45.4CenterNet2 (Swin-L w/ X-Paste + Copy-Paste)
Instance SegmentationLVIS v1.0 valmask APr43.8CenterNet2 (Swin-L w/ X-Paste + Copy-Paste)
2D ClassificationLVIS v1.0 valbox AP50.9CenterNet2 (Swin-L w/ X-Paste + Copy-Paste)
2D ClassificationLVIS v1.0 valbox APr48.7CenterNet2 (Swin-L w/ X-Paste + Copy-Paste)
2D ClassificationLVIS v1.0AP novel-LVIS base training21.4X-Paste
2D ClassificationLVIS v1.0AP novel-Unrestricted open-vocabulary training22.8X-Paste
2D Object DetectionLVIS v1.0 valbox AP50.9CenterNet2 (Swin-L w/ X-Paste + Copy-Paste)
2D Object DetectionLVIS v1.0 valbox APr48.7CenterNet2 (Swin-L w/ X-Paste + Copy-Paste)
2D Object DetectionLVIS v1.0AP novel-LVIS base training21.4X-Paste
2D Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training22.8X-Paste
Open Vocabulary Object DetectionLVIS v1.0AP novel-LVIS base training21.4X-Paste
Open Vocabulary Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training22.8X-Paste
16kLVIS v1.0 valbox AP50.9CenterNet2 (Swin-L w/ X-Paste + Copy-Paste)
16kLVIS v1.0 valbox APr48.7CenterNet2 (Swin-L w/ X-Paste + Copy-Paste)
16kLVIS v1.0AP novel-LVIS base training21.4X-Paste
16kLVIS v1.0AP novel-Unrestricted open-vocabulary training22.8X-Paste

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17