TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Affordance Grounding from Demonstration Video to Target Im...

Affordance Grounding from Demonstration Video to Target Image

Joya Chen, Difei Gao, Kevin Qinghong Lin, Mike Zheng Shou

2023-03-26CVPR 2023 1Video-to-image Affordance Grounding
PaperPDFCode(official)

Abstract

Humans excel at learning from expert demonstrations and solving their own problems. To equip intelligent robots and assistants, such as AR glasses, with this ability, it is essential to ground human hand interactions (i.e., affordances) from demonstration videos and apply them to a target image like a user's AR glass view. The video-to-image affordance grounding task is challenging due to (1) the need to predict fine-grained affordances, and (2) the limited training data, which inadequately covers video-image discrepancies and negatively impacts grounding. To tackle them, we propose Affordance Transformer (Afformer), which has a fine-grained transformer-based decoder that gradually refines affordance grounding. Moreover, we introduce Mask Affordance Hand (MaskAHand), a self-supervised pre-training technique for synthesizing video-image data and simulating context changes, enhancing affordance grounding across video-image discrepancies. Afformer with MaskAHand pre-training achieves state-of-the-art performance on multiple benchmarks, including a substantial 37% improvement on the OPRA dataset. Code is made available at https://github.com/showlab/afformer.

Results

TaskDatasetMetricValueModel
Video-to-image Affordance GroundingOPRA (28x28)AUC-J0.89Afformer
Video-to-image Affordance GroundingOPRA (28x28)KLD1.05Afformer
Video-to-image Affordance GroundingOPRA (28x28)SIM0.53Afformer
Video-to-image Affordance GroundingEPIC-HotspotAUC-J0.88Afformer
Video-to-image Affordance GroundingEPIC-HotspotKLD0.97Afformer
Video-to-image Affordance GroundingEPIC-HotspotSIM0.56Afformer
Video-to-image Affordance GroundingOPRAKLD1.51Afformer (ViTDet-B encoder)
Video-to-image Affordance GroundingOPRATop-1 Action Accuracy52.27Afformer (ViTDet-B encoder)
Video-to-image Affordance GroundingOPRAKLD1.55Afformer (ResNet-50-FPN encoder)
Video-to-image Affordance GroundingOPRATop-1 Action Accuracy52.14Afformer (ResNet-50-FPN encoder)

Related Papers

Learning Visual Affordance Grounding from Demonstration Videos2021-08-12Grounded Human-Object Interaction Hotspots from Video2018-12-11Demo2Vec: Reasoning Object Affordances From Online Videos2018-06-01