SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition

Xijun Wang, Ruiqi Xian, Tianrui Guan, Fuxiao Liu, Dinesh Manocha

2023-05-21Optical Flow Estimation Action Recognition Temporal Action Localization

Abstract

We present a new learning approach, Soft Conditional Prompt Learning (SCP), which leverages the strengths of prompt learning for aerial video action recognition. Our approach is designed to predict the action of each agent by helping the models focus on the descriptions or instructions associated with actions in the input videos for aerial/robot visual perception. Our formulation supports various prompts, including learnable prompts, auxiliary visual information, and large vision models to improve the recognition performance. We present a soft conditional prompt method that learns to dynamically generate prompts from a pool of prompt experts under different video inputs. By sharing the same objective with the task, our proposed SCP can optimize prompts that guide the model's predictions while explicitly learning input-invariant (prompt experts pool) and input-specific (data-dependent) prompt knowledge. In practice, we observe a 3.17-10.2% accuracy improvement on the aerial video datasets (Okutama, NECDrone), which consist of scenes with single-agent and multi-agent actions. We further evaluate our approach on ground camera videos to verify the effectiveness and generalization and achieve a 1.0-3.6% improvement on dataset SSV2. We integrate our method into the ROS2 as well.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	Something-Something V2	Top-1 Accuracy	67.3	PLAR
Activity Recognition	Something-Something V2	Top-5 Accuracy	91	PLAR
Activity Recognition	Okutama-Action	Accuracy	75.93	PLAR with bbox (Ours)
Activity Recognition	Okutama-Action	Accuracy	71.54	PLAR without bbox (Ours)
Action Recognition	Something-Something V2	Top-1 Accuracy	67.3	PLAR
Action Recognition	Something-Something V2	Top-5 Accuracy	91	PLAR
Action Recognition	Okutama-Action	Accuracy	75.93	PLAR with bbox (Ours)
Action Recognition	Okutama-Action	Accuracy	71.54	PLAR without bbox (Ours)

SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition

Abstract

Results

Related Papers

SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition

Abstract

Results

Related Papers