TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LLMs are Good Action Recognizers

LLMs are Good Action Recognizers

Haoxuan Qu, Yujun Cai, Jun Liu

2024-03-31CVPR 2024 1Skeleton Based Action RecognitionLarge Language ModelAction RecognitionLanguage Modelling
PaperPDF

Abstract

Skeleton-based action recognition has attracted lots of research attention. Recently, to build an accurate skeleton-based action recognizer, a variety of works have been proposed. Among them, some works use large model architectures as backbones of their recognizers to boost the skeleton data representation capability, while some other works pre-train their recognizers on external data to enrich the knowledge. In this work, we observe that large language models which have been extensively used in various natural language processing tasks generally hold both large model architectures and rich implicit knowledge. Motivated by this, we propose a novel LLM-AR framework, in which we investigate treating the Large Language Model as an Action Recognizer. In our framework, we propose a linguistic projection process to project each input action signal (i.e., each skeleton sequence) into its ``sentence format'' (i.e., an ``action sentence''). Moreover, we also incorporate our framework with several designs to further facilitate this linguistic projection process. Extensive experiments demonstrate the efficacy of our proposed framework.

Results

TaskDatasetMetricValueModel
VideoNTU RGB+D 120Accuracy (Cross-Setup)91.5Lit-llama
VideoNTU RGB+D 120Accuracy (Cross-Subject)88.7Lit-llama
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)91.5Lit-llama
Temporal Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)88.7Lit-llama
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Setup)91.5Lit-llama
Zero-Shot LearningNTU RGB+D 120Accuracy (Cross-Subject)88.7Lit-llama
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.5Lit-llama
Activity RecognitionNTU RGB+D 120Accuracy (Cross-Subject)88.7Lit-llama
Action LocalizationNTU RGB+D 120Accuracy (Cross-Setup)91.5Lit-llama
Action LocalizationNTU RGB+D 120Accuracy (Cross-Subject)88.7Lit-llama
Action DetectionNTU RGB+D 120Accuracy (Cross-Setup)91.5Lit-llama
Action DetectionNTU RGB+D 120Accuracy (Cross-Subject)88.7Lit-llama
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.5Lit-llama
3D Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)88.7Lit-llama
Action RecognitionNTU RGB+D 120Accuracy (Cross-Setup)91.5Lit-llama
Action RecognitionNTU RGB+D 120Accuracy (Cross-Subject)88.7Lit-llama

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17