TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Just Add $π$! Pose Induced Video Transformers for Understa...

Just Add $π$! Pose Induced Video Transformers for Understanding Activities of Daily Living

Dominick Reilly, Srijan Das

2023-11-30Action ClassificationAction Recognition
PaperPDFCode(official)

Abstract

Video transformers have become the de facto standard for human action recognition, yet their exclusive reliance on the RGB modality still limits their adoption in certain domains. One such domain is Activities of Daily Living (ADL), where RGB alone is not sufficient to distinguish between visually similar actions, or actions observed from multiple viewpoints. To facilitate the adoption of video transformers for ADL, we hypothesize that the augmentation of RGB with human pose information, known for its sensitivity to fine-grained motion and multiple viewpoints, is essential. Consequently, we introduce the first Pose Induced Video Transformer: PI-ViT (or $\pi$-ViT), a novel approach that augments the RGB representations learned by video transformers with 2D and 3D pose information. The key elements of $\pi$-ViT are two plug-in modules, 2D Skeleton Induction Module and 3D Skeleton Induction Module, that are responsible for inducing 2D and 3D pose information into the RGB representations. These modules operate by performing pose-aware auxiliary tasks, a design choice that allows $\pi$-ViT to discard the modules during inference. Notably, $\pi$-ViT achieves the state-of-the-art performance on three prominent ADL datasets, encompassing both real-world and large-scale RGB-D datasets, without requiring poses or additional computational overhead at inference.

Results

TaskDatasetMetricValueModel
VideoToyota Smarthome datasetCS72.9π-ViT
VideoToyota Smarthome datasetCV155.2π-ViT
VideoToyota Smarthome datasetCV264.8π-ViT
Activity RecognitionNTU RGB+DAccuracy (CS)96.3π-ViT (RGB + Pose)
Activity RecognitionNTU RGB+DAccuracy (CV)99π-ViT (RGB + Pose)
Activity RecognitionNTU RGB+DAccuracy (CS)94π-ViT (RGB only)
Activity RecognitionNTU RGB+DAccuracy (CV)97.9π-ViT (RGB only)
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Setup)96.1π-ViT (RGB + Pose)
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Subject)95.1π-ViT (RGB + Pose)
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.9π-ViT (RGB only)
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Subject)92.9π-ViT (RGB only)
Action RecognitionNTU RGB+DAccuracy (CS)96.3π-ViT (RGB + Pose)
Action RecognitionNTU RGB+DAccuracy (CV)99π-ViT (RGB + Pose)
Action RecognitionNTU RGB+DAccuracy (CS)94π-ViT (RGB only)
Action RecognitionNTU RGB+DAccuracy (CV)97.9π-ViT (RGB only)
Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)96.1π-ViT (RGB + Pose)
Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)95.1π-ViT (RGB + Pose)
Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.9π-ViT (RGB only)
Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)92.9π-ViT (RGB only)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22Active Multimodal Distillation for Few-shot Action Recognition2025-06-16