TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SCP: Soft Conditional Prompt Learning for Aerial Video Act...

SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition

Xijun Wang, Ruiqi Xian, Tianrui Guan, Fuxiao Liu, Dinesh Manocha

2023-05-21Optical Flow EstimationAction RecognitionTemporal Action Localization
PaperPDF

Abstract

We present a new learning approach, Soft Conditional Prompt Learning (SCP), which leverages the strengths of prompt learning for aerial video action recognition. Our approach is designed to predict the action of each agent by helping the models focus on the descriptions or instructions associated with actions in the input videos for aerial/robot visual perception. Our formulation supports various prompts, including learnable prompts, auxiliary visual information, and large vision models to improve the recognition performance. We present a soft conditional prompt method that learns to dynamically generate prompts from a pool of prompt experts under different video inputs. By sharing the same objective with the task, our proposed SCP can optimize prompts that guide the model's predictions while explicitly learning input-invariant (prompt experts pool) and input-specific (data-dependent) prompt knowledge. In practice, we observe a 3.17-10.2% accuracy improvement on the aerial video datasets (Okutama, NECDrone), which consist of scenes with single-agent and multi-agent actions. We further evaluate our approach on ground camera videos to verify the effectiveness and generalization and achieve a 1.0-3.6% improvement on dataset SSV2. We integrate our method into the ROS2 as well.

Results

TaskDatasetMetricValueModel
Activity RecognitionSomething-Something V2Top-1 Accuracy67.3PLAR
Activity RecognitionSomething-Something V2Top-5 Accuracy91PLAR
Activity RecognitionOkutama-ActionAccuracy75.93PLAR with bbox (Ours)
Activity RecognitionOkutama-ActionAccuracy71.54PLAR without bbox (Ours)
Action RecognitionSomething-Something V2Top-1 Accuracy67.3PLAR
Action RecognitionSomething-Something V2Top-5 Accuracy91PLAR
Action RecognitionOkutama-ActionAccuracy75.93PLAR with bbox (Ours)
Action RecognitionOkutama-ActionAccuracy71.54PLAR without bbox (Ours)

Related Papers

Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16An Efficient Approach for Muscle Segmentation and 3D Reconstruction Using Keypoint Tracking in MRI Scan2025-07-11Learning to Track Any Points from Human Motion2025-07-08TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation2025-07-07Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation2025-06-29