Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling

Ryo Hachiuma, Fumiaki Sato, Taiki Sekii

2023-03-27CVPR 2023 1Action Localization Skeleton Based Action Recognition Data Augmentation Violence and Weaponized Violence Detection Spatio-Temporal Action Localization Weakly-supervised Temporal Action Localization Video Classification Action Recognition Temporal Action Localization Activity Recognition

Paper PDF

Abstract

This paper simultaneously addresses three limitations associated with conventional skeleton-based action recognition; skeleton detection and tracking errors, poor variety of the targeted actions, as well as person-wise and frame-wise action recognition. A point cloud deep-learning paradigm is introduced to the action recognition, and a unified framework along with a novel deep neural network architecture called Structured Keypoint Pooling is proposed. The proposed method sparsely aggregates keypoint features in a cascaded manner based on prior knowledge of the data structure (which is inherent in skeletons), such as the instances and frames to which each keypoint belongs, and achieves robustness against input errors. Its less constrained and tracking-free architecture enables time-series keypoints consisting of human skeletons and nonhuman object contours to be efficiently treated as an input 3D point cloud and extends the variety of the targeted action. Furthermore, we propose a Pooling-Switching Trick inspired by Structured Keypoint Pooling. This trick switches the pooling kernels between the training and inference phases to detect person-wise and frame-wise actions in a weakly supervised manner using only video-level action labels. This trick enables our training scheme to naturally introduce novel data augmentation, which mixes multiple point clouds extracted from different videos. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art skeleton-based action recognition and spatio-temporal action localization methods.

Results

Task	Dataset	Metric	Value	Model
Video	UCF101-24	mAP@0.2	61.8	Structured Keypoint Pooling
Video	Kinetics-Skeleton dataset	Accuracy	52.3	Structured Keypoint Pooling (PPNv2 skeletons+objects)
Video	Kinetics-Skeleton dataset	Accuracy	50.3	Structured Keypoint Pooling (HRNet skeletons)
Video	Kinetics-Skeleton dataset	Accuracy	43.1	Structured Keypoint Pooling (PPNv2 skeletons)
Video	UCF101	Accuracy	87.8	Structured Keypoint Pooling
Video	HMDB51	Accuracy	70.9	Structured Keypoint Pooling
Video	Hockey Fight Detection Dataset	Accuracy	99.5	Structured Keypoint Pooling
Temporal Action Localization	UCF101-24	mAP@0.2	61.8	Structured Keypoint Pooling
Temporal Action Localization	Kinetics-Skeleton dataset	Accuracy	52.3	Structured Keypoint Pooling (PPNv2 skeletons+objects)
Temporal Action Localization	Kinetics-Skeleton dataset	Accuracy	50.3	Structured Keypoint Pooling (HRNet skeletons)
Temporal Action Localization	Kinetics-Skeleton dataset	Accuracy	43.1	Structured Keypoint Pooling (PPNv2 skeletons)
Temporal Action Localization	UCF101	Accuracy	87.8	Structured Keypoint Pooling
Temporal Action Localization	HMDB51	Accuracy	70.9	Structured Keypoint Pooling
Zero-Shot Learning	UCF101-24	mAP@0.2	61.8	Structured Keypoint Pooling
Zero-Shot Learning	Kinetics-Skeleton dataset	Accuracy	52.3	Structured Keypoint Pooling (PPNv2 skeletons+objects)
Zero-Shot Learning	Kinetics-Skeleton dataset	Accuracy	50.3	Structured Keypoint Pooling (HRNet skeletons)
Zero-Shot Learning	Kinetics-Skeleton dataset	Accuracy	43.1	Structured Keypoint Pooling (PPNv2 skeletons)
Zero-Shot Learning	UCF101	Accuracy	87.8	Structured Keypoint Pooling
Zero-Shot Learning	HMDB51	Accuracy	70.9	Structured Keypoint Pooling
Activity Recognition	RWF-2000	Accuracy	93.4	Structured Keypoint Pooling
Activity Recognition	Skeleton-Mimetics	Accuracy	21.2	Structured Keypoint Pooling
Activity Recognition	Kinetics-Skeleton dataset	Accuracy	52.3	Structured Keypoint Pooling (PPNv2 skeletons+objects)
Activity Recognition	Kinetics-Skeleton dataset	Accuracy	50.3	Structured Keypoint Pooling (HRNet skeletons)
Activity Recognition	Kinetics-Skeleton dataset	Accuracy	43.1	Structured Keypoint Pooling (PPNv2 skeletons)
Activity Recognition	UCF101	Accuracy	87.8	Structured Keypoint Pooling
Activity Recognition	HMDB51	Accuracy	70.9	Structured Keypoint Pooling
Action Localization	UCF101-24	mAP@0.2	61.8	Structured Keypoint Pooling
Action Localization	Kinetics-Skeleton dataset	Accuracy	52.3	Structured Keypoint Pooling (PPNv2 skeletons+objects)
Action Localization	Kinetics-Skeleton dataset	Accuracy	50.3	Structured Keypoint Pooling (HRNet skeletons)
Action Localization	Kinetics-Skeleton dataset	Accuracy	43.1	Structured Keypoint Pooling (PPNv2 skeletons)
Action Localization	UCF101	Accuracy	87.8	Structured Keypoint Pooling
Action Localization	HMDB51	Accuracy	70.9	Structured Keypoint Pooling
Action Detection	Kinetics-Skeleton dataset	Accuracy	52.3	Structured Keypoint Pooling (PPNv2 skeletons+objects)
Action Detection	Kinetics-Skeleton dataset	Accuracy	50.3	Structured Keypoint Pooling (HRNet skeletons)
Action Detection	Kinetics-Skeleton dataset	Accuracy	43.1	Structured Keypoint Pooling (PPNv2 skeletons)
Action Detection	UCF101	Accuracy	87.8	Structured Keypoint Pooling
Action Detection	HMDB51	Accuracy	70.9	Structured Keypoint Pooling
3D Action Recognition	Kinetics-Skeleton dataset	Accuracy	52.3	Structured Keypoint Pooling (PPNv2 skeletons+objects)
3D Action Recognition	Kinetics-Skeleton dataset	Accuracy	50.3	Structured Keypoint Pooling (HRNet skeletons)
3D Action Recognition	Kinetics-Skeleton dataset	Accuracy	43.1	Structured Keypoint Pooling (PPNv2 skeletons)
3D Action Recognition	UCF101	Accuracy	87.8	Structured Keypoint Pooling
3D Action Recognition	HMDB51	Accuracy	70.9	Structured Keypoint Pooling
Action Recognition	Skeleton-Mimetics	Accuracy	21.2	Structured Keypoint Pooling
Action Recognition	Kinetics-Skeleton dataset	Accuracy	52.3	Structured Keypoint Pooling (PPNv2 skeletons+objects)
Action Recognition	Kinetics-Skeleton dataset	Accuracy	50.3	Structured Keypoint Pooling (HRNet skeletons)
Action Recognition	Kinetics-Skeleton dataset	Accuracy	43.1	Structured Keypoint Pooling (PPNv2 skeletons)
Action Recognition	UCF101	Accuracy	87.8	Structured Keypoint Pooling
Action Recognition	HMDB51	Accuracy	70.9	Structured Keypoint Pooling
Video Classification	Hockey Fight Detection Dataset	Accuracy	99.5	Structured Keypoint Pooling
Weakly-supervised Temporal Action Localization	UCF101-24	mAP@0.2	61.8	Structured Keypoint Pooling

Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling

Abstract

Results

Related Papers

Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling

Abstract

Results

Related Papers