LLMs are Good Action Recognizers

Haoxuan Qu, Yujun Cai, Jun Liu

2024-03-31CVPR 2024 1Skeleton Based Action Recognition Large Language Model Action Recognition Language Modelling

Abstract

Skeleton-based action recognition has attracted lots of research attention. Recently, to build an accurate skeleton-based action recognizer, a variety of works have been proposed. Among them, some works use large model architectures as backbones of their recognizers to boost the skeleton data representation capability, while some other works pre-train their recognizers on external data to enrich the knowledge. In this work, we observe that large language models which have been extensively used in various natural language processing tasks generally hold both large model architectures and rich implicit knowledge. Motivated by this, we propose a novel LLM-AR framework, in which we investigate treating the Large Language Model as an Action Recognizer. In our framework, we propose a linguistic projection process to project each input action signal (i.e., each skeleton sequence) into its ``sentence format'' (i.e., an ``action sentence''). Moreover, we also incorporate our framework with several designs to further facilitate this linguistic projection process. Extensive experiments demonstrate the efficacy of our proposed framework.

Results

Task	Dataset	Metric	Value	Model
Video	NTU RGB+D 120	Accuracy (Cross-Setup)	91.5	Lit-llama
Video	NTU RGB+D 120	Accuracy (Cross-Subject)	88.7	Lit-llama
Temporal Action Localization	NTU RGB+D 120	Accuracy (Cross-Setup)	91.5	Lit-llama
Temporal Action Localization	NTU RGB+D 120	Accuracy (Cross-Subject)	88.7	Lit-llama
Zero-Shot Learning	NTU RGB+D 120	Accuracy (Cross-Setup)	91.5	Lit-llama
Zero-Shot Learning	NTU RGB+D 120	Accuracy (Cross-Subject)	88.7	Lit-llama
Activity Recognition	NTU RGB+D 120	Accuracy (Cross-Setup)	91.5	Lit-llama
Activity Recognition	NTU RGB+D 120	Accuracy (Cross-Subject)	88.7	Lit-llama
Action Localization	NTU RGB+D 120	Accuracy (Cross-Setup)	91.5	Lit-llama
Action Localization	NTU RGB+D 120	Accuracy (Cross-Subject)	88.7	Lit-llama
Action Detection	NTU RGB+D 120	Accuracy (Cross-Setup)	91.5	Lit-llama
Action Detection	NTU RGB+D 120	Accuracy (Cross-Subject)	88.7	Lit-llama
3D Action Recognition	NTU RGB+D 120	Accuracy (Cross-Setup)	91.5	Lit-llama
3D Action Recognition	NTU RGB+D 120	Accuracy (Cross-Subject)	88.7	Lit-llama
Action Recognition	NTU RGB+D 120	Accuracy (Cross-Setup)	91.5	Lit-llama
Action Recognition	NTU RGB+D 120	Accuracy (Cross-Subject)	88.7	Lit-llama

LLMs are Good Action Recognizers

Abstract

Results

Related Papers

LLMs are Good Action Recognizers

Abstract

Results

Related Papers