TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/OvarNet: Towards Open-vocabulary Object Attribute Recognit...

OvarNet: Towards Open-vocabulary Object Attribute Recognition

Keyan Chen, XiaoLong Jiang, Yao Hu, Xu Tang, Yan Gao, Jianqi Chen, Weidi Xie

2023-01-23CVPR 2023 1Open Vocabulary Attribute DetectionAttributeScene UnderstandingOpen Vocabulary Object DetectionKnowledge Distillationobject-detectionObject Detection
PaperPDFCode(official)

Abstract

In this paper, we consider the problem of simultaneously detecting objects and inferring their visual attributes in an image, even for those with no manual annotations provided at the training stage, resembling an open-vocabulary scenario. To achieve this goal, we make the following contributions: (i) we start with a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr. The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes; (ii) we combine all available datasets and train with a federated strategy to finetune the CLIP model, aligning the visual representation with attributes, additionally, we investigate the efficacy of leveraging freely available online image-caption pairs under weakly supervised learning; (iii) in pursuit of efficiency, we train a Faster-RCNN type model end-to-end with knowledge distillation, that performs class-agnostic object proposals and classification on semantic categories and attributes with classifiers generated from a text encoder; Finally, (iv) we conduct extensive experiments on VAW, MS-COCO, LSA, and OVAD datasets, and show that recognition of semantic category and attributes is complementary for visual scene understanding, i.e., jointly training object detection and attributes prediction largely outperform existing approaches that treat the two tasks independently, demonstrating strong generalization ability to novel attributes and categories.

Results

TaskDatasetMetricValueModel
Object DetectionOVAD benchmarkmean average precision27.2OvarNet (ViT-B16)
3DOVAD benchmarkmean average precision27.2OvarNet (ViT-B16)
2D ClassificationOVAD benchmarkmean average precision27.2OvarNet (ViT-B16)
2D Object DetectionOVAD benchmarkmean average precision27.2OvarNet (ViT-B16)
Open Vocabulary Object DetectionOVAD benchmarkmean average precision27.2OvarNet (ViT-B16)
16kOVAD benchmarkmean average precision27.2OvarNet (ViT-B16)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17