TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Interactive Spatiotemporal Token Attention Network for Ske...

Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition

Yuhang Wen, Zixuan Tang, Yunsheng Pang, Beichen Ding, Mengyuan Liu

2023-07-143D Action RecognitionSkeleton Based Action RecognitionAction RecognitionHuman Interaction Recognition
PaperPDFCode(official)

Abstract

Recognizing interactive action plays an important role in human-robot interaction and collaboration. Previous methods use late fusion and co-attention mechanism to capture interactive relations, which have limited learning capability or inefficiency to adapt to more interacting entities. With assumption that priors of each entity are already known, they also lack evaluations on a more general setting addressing the diversity of subjects. To address these problems, we propose an Interactive Spatiotemporal Token Attention Network (ISTA-Net), which simultaneously model spatial, temporal, and interactive relations. Specifically, our network contains a tokenizer to partition Interactive Spatiotemporal Tokens (ISTs), which is a unified way to represent motions of multiple diverse entities. By extending the entity dimension, ISTs provide better interactive representations. To jointly learn along three dimensions in ISTs, multi-head self-attention blocks integrated with 3D convolutions are designed to capture inter-token correlations. When modeling correlations, a strict entity ordering is usually irrelevant for recognizing interactive actions. To this end, Entity Rearrangement is proposed to eliminate the orderliness in ISTs for interchangeable entities. Extensive experiments on four datasets verify the effectiveness of ISTA-Net by outperforming state-of-the-art methods. Our code is publicly available at https://github.com/Necolizer/ISTA-Net

Results

TaskDatasetMetricValueModel
VideoAssembly101Actions Top-128.07ISTA-Net
VideoAssembly101Object Top-131.69ISTA-Net
VideoAssembly101Verbs Top-162.66ISTA-Net
Temporal Action LocalizationAssembly101Actions Top-128.07ISTA-Net
Temporal Action LocalizationAssembly101Object Top-131.69ISTA-Net
Temporal Action LocalizationAssembly101Verbs Top-162.66ISTA-Net
Zero-Shot LearningAssembly101Actions Top-128.07ISTA-Net
Zero-Shot LearningAssembly101Object Top-131.69ISTA-Net
Zero-Shot LearningAssembly101Verbs Top-162.66ISTA-Net
Activity RecognitionH2O (2 Hands and Objects)Actions Top-189.09ISTA-Net
Activity RecognitionAssembly101Actions Top-128.07ISTA-Net
Activity RecognitionAssembly101Object Top-131.69ISTA-Net
Activity RecognitionAssembly101Verbs Top-162.66ISTA-Net
Action LocalizationAssembly101Actions Top-128.07ISTA-Net
Action LocalizationAssembly101Object Top-131.69ISTA-Net
Action LocalizationAssembly101Verbs Top-162.66ISTA-Net
Human Interaction RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.7ISTA-Net
Human Interaction RecognitionNTU RGB+D 120Accuracy (Cross-Subject)90.5ISTA-Net
3D Action RecognitionAssembly101Actions Top-128.07ISTA-Net
3D Action RecognitionAssembly101Object Top-131.69ISTA-Net
3D Action RecognitionAssembly101Verbs Top-162.66ISTA-Net
Action RecognitionH2O (2 Hands and Objects)Actions Top-189.09ISTA-Net
Action RecognitionAssembly101Actions Top-128.07ISTA-Net
Action RecognitionAssembly101Object Top-131.69ISTA-Net
Action RecognitionAssembly101Verbs Top-162.66ISTA-Net

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22Active Multimodal Distillation for Few-shot Action Recognition2025-06-16