TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/GEN-VLKT: Simplify Association and Enhance Interaction Und...

GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection

Yue Liao, Aixi Zhang, Miao Lu, Yongliang Wang, Xiaobo Li, Si Liu

2022-03-26CVPR 2022 1Human-Object Interaction DetectionTransfer Learning
PaperPDFCode(official)

Abstract

The task of Human-Object Interaction~(HOI) detection could be divided into two core problems, i.e., human-object association and interaction understanding. In this paper, we reveal and address the disadvantages of the conventional query-driven HOI detectors from the two aspects. For the association, previous two-branch methods suffer from complex and costly post-matching, while single-branch methods ignore the features distinction in different tasks. We propose Guided-Embedding Network~(GEN) to attain a two-branch pipeline without post-matching. In GEN, we design an instance decoder to detect humans and objects with two independent query sets and a position Guided Embedding~(p-GE) to mark the human and object in the same position as a pair. Besides, we design an interaction decoder to classify interactions, where the interaction queries are made of instance Guided Embeddings (i-GE) generated from the outputs of each instance decoder layer. For the interaction understanding, previous methods suffer from long-tailed distribution and zero-shot discovery. This paper proposes a Visual-Linguistic Knowledge Transfer (VLKT) training strategy to enhance interaction understanding by transferring knowledge from a visual-linguistic pre-trained model CLIP. In specific, we extract text embeddings for all labels with CLIP to initialize the classifier and adopt a mimic loss to minimize the visual feature distance between GEN and CLIP. As a result, GEN-VLKT outperforms the state of the art by large margins on multiple datasets, e.g., +5.05 mAP on HICO-Det. The source codes are available at https://github.com/YueLiao/gen-vlkt.

Results

TaskDatasetMetricValueModel
Human-Object Interaction DetectionHICO-DETmAP34.95GEN-VLKT-R101

Related Papers

RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17Best Practices for Large-Scale, Pixel-Wise Crop Mapping and Transfer Learning Workflows2025-07-16Robust-Multi-Task Gradient Boosting2025-07-15RoHOI: Robustness Benchmark for Human-Object Interaction Detection2025-07-12Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift2025-07-12The Bayesian Approach to Continual Learning: An Overview2025-07-11Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection2025-07-09