TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning Visual Affordance Grounding from Demonstration Vi...

Learning Visual Affordance Grounding from Demonstration Videos

Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, DaCheng Tao

2021-08-12Action RecognitionVideo-to-image Affordance Grounding
PaperPDF

Abstract

Visual affordance grounding aims to segment all possible interaction regions between people and objects from an image/video, which is beneficial for many applications, such as robot grasping and action recognition. However, existing methods mainly rely on the appearance feature of the objects to segment each region of the image, which face the following two problems: (i) there are multiple possible regions in an object that people interact with; and (ii) there are multiple possible human interactions in the same object region. To address these problems, we propose a Hand-aided Affordance Grounding Network (HAGNet) that leverages the aided clues provided by the position and action of the hand in demonstration videos to eliminate the multiple possibilities and better locate the interaction regions in the object. Specifically, HAG-Net has a dual-branch structure to process the demonstration video and object image. For the video branch, we introduce hand-aided attention to enhance the region around the hand in each video frame and then use the LSTM network to aggregate the action features. For the object branch, we introduce a semantic enhancement module (SEM) to make the network focus on different parts of the object according to the action classes and utilize a distillation loss to align the output features of the object branch with that of the video branch and transfer the knowledge in the video branch to the object branch. Quantitative and qualitative evaluations on two challenging datasets show that our method has achieved stateof-the-art results for affordance grounding. The source code will be made available to the public.

Results

TaskDatasetMetricValueModel
Video-to-image Affordance GroundingOPRA (28x28)AUC-J0.81HAG-Net (+Hand Box)
Video-to-image Affordance GroundingOPRA (28x28)KLD1.41HAG-Net (+Hand Box)
Video-to-image Affordance GroundingOPRA (28x28)SIM0.37HAG-Net (+Hand Box)
Video-to-image Affordance GroundingEPIC-HotspotAUC-J0.8HAG-Net (+Hand Box)
Video-to-image Affordance GroundingEPIC-HotspotKLD1.21HAG-Net (+Hand Box)
Video-to-image Affordance GroundingEPIC-HotspotSIM0.41HAG-Net (+Hand Box)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22Active Multimodal Distillation for Few-shot Action Recognition2025-06-16