TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Contrastive Feature Masking Open-Vocabulary Vision Transfo...

Contrastive Feature Masking Open-Vocabulary Vision Transformer

Dahun Kim, Anelia Angelova, Weicheng Kuo

2023-09-02ICCV 2023 1Image-text RetrievalText RetrievalContrastive LearningOpen Vocabulary Object DetectionRetrievalobject-detectionObject Detection
PaperPDF

Abstract

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT backbone as a region classifier, preventing the forgetting of open-vocabulary knowledge during detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT achieves a state-of-the-art 33.9 AP$r$, surpassing the best approach by 7.6 points and achieves better zero-shot detection transfer. Finally, CFM-ViT acquires strong image-level representation, outperforming the state of the art on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.

Results

TaskDatasetMetricValueModel
Object DetectionLVIS v1.0AP novel-LVIS base training33.9CFM-ViT
Object DetectionMSCOCOAP 0.534.1CFM-ViT
3DLVIS v1.0AP novel-LVIS base training33.9CFM-ViT
3DMSCOCOAP 0.534.1CFM-ViT
2D ClassificationLVIS v1.0AP novel-LVIS base training33.9CFM-ViT
2D ClassificationMSCOCOAP 0.534.1CFM-ViT
2D Object DetectionLVIS v1.0AP novel-LVIS base training33.9CFM-ViT
2D Object DetectionMSCOCOAP 0.534.1CFM-ViT
Open Vocabulary Object DetectionLVIS v1.0AP novel-LVIS base training33.9CFM-ViT
Open Vocabulary Object DetectionMSCOCOAP 0.534.1CFM-ViT
16kLVIS v1.0AP novel-LVIS base training33.9CFM-ViT
16kMSCOCOAP 0.534.1CFM-ViT

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17