Yuhang Wen, Zixuan Tang, Yunsheng Pang, Beichen Ding, Mengyuan Liu
Recognizing interactive action plays an important role in human-robot interaction and collaboration. Previous methods use late fusion and co-attention mechanism to capture interactive relations, which have limited learning capability or inefficiency to adapt to more interacting entities. With assumption that priors of each entity are already known, they also lack evaluations on a more general setting addressing the diversity of subjects. To address these problems, we propose an Interactive Spatiotemporal Token Attention Network (ISTA-Net), which simultaneously model spatial, temporal, and interactive relations. Specifically, our network contains a tokenizer to partition Interactive Spatiotemporal Tokens (ISTs), which is a unified way to represent motions of multiple diverse entities. By extending the entity dimension, ISTs provide better interactive representations. To jointly learn along three dimensions in ISTs, multi-head self-attention blocks integrated with 3D convolutions are designed to capture inter-token correlations. When modeling correlations, a strict entity ordering is usually irrelevant for recognizing interactive actions. To this end, Entity Rearrangement is proposed to eliminate the orderliness in ISTs for interchangeable entities. Extensive experiments on four datasets verify the effectiveness of ISTA-Net by outperforming state-of-the-art methods. Our code is publicly available at https://github.com/Necolizer/ISTA-Net
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Assembly101 | Actions Top-1 | 28.07 | ISTA-Net |
| Video | Assembly101 | Object Top-1 | 31.69 | ISTA-Net |
| Video | Assembly101 | Verbs Top-1 | 62.66 | ISTA-Net |
| Temporal Action Localization | Assembly101 | Actions Top-1 | 28.07 | ISTA-Net |
| Temporal Action Localization | Assembly101 | Object Top-1 | 31.69 | ISTA-Net |
| Temporal Action Localization | Assembly101 | Verbs Top-1 | 62.66 | ISTA-Net |
| Zero-Shot Learning | Assembly101 | Actions Top-1 | 28.07 | ISTA-Net |
| Zero-Shot Learning | Assembly101 | Object Top-1 | 31.69 | ISTA-Net |
| Zero-Shot Learning | Assembly101 | Verbs Top-1 | 62.66 | ISTA-Net |
| Activity Recognition | H2O (2 Hands and Objects) | Actions Top-1 | 89.09 | ISTA-Net |
| Activity Recognition | Assembly101 | Actions Top-1 | 28.07 | ISTA-Net |
| Activity Recognition | Assembly101 | Object Top-1 | 31.69 | ISTA-Net |
| Activity Recognition | Assembly101 | Verbs Top-1 | 62.66 | ISTA-Net |
| Action Localization | Assembly101 | Actions Top-1 | 28.07 | ISTA-Net |
| Action Localization | Assembly101 | Object Top-1 | 31.69 | ISTA-Net |
| Action Localization | Assembly101 | Verbs Top-1 | 62.66 | ISTA-Net |
| Human Interaction Recognition | NTU RGB+D 120 | Accuracy (Cross-Setup) | 91.7 | ISTA-Net |
| Human Interaction Recognition | NTU RGB+D 120 | Accuracy (Cross-Subject) | 90.5 | ISTA-Net |
| 3D Action Recognition | Assembly101 | Actions Top-1 | 28.07 | ISTA-Net |
| 3D Action Recognition | Assembly101 | Object Top-1 | 31.69 | ISTA-Net |
| 3D Action Recognition | Assembly101 | Verbs Top-1 | 62.66 | ISTA-Net |
| Action Recognition | H2O (2 Hands and Objects) | Actions Top-1 | 89.09 | ISTA-Net |
| Action Recognition | Assembly101 | Actions Top-1 | 28.07 | ISTA-Net |
| Action Recognition | Assembly101 | Object Top-1 | 31.69 | ISTA-Net |
| Action Recognition | Assembly101 | Verbs Top-1 | 62.66 | ISTA-Net |