Human Pose Descriptions and Subject-Focused Attention for Improved Zero-Shot Transfer in Human-Centric Classification Tasks

Muhammad Saif Ullah Khan, Muhammad Ferjad Naeem, Federico Tombari, Luc van Gool, Didier Stricker, Muhammad Zeshan Afzal

2024-03-11Age Classification Activity Recognition Emotion Recognition

Abstract

We present a novel LLM-based pipeline for creating contextual descriptions of human body poses in images using only auxiliary attributes. This approach facilitates the creation of the MPII Pose Descriptions dataset, which includes natural language annotations for 17,367 images containing people engaged in 410 distinct activities. We demonstrate the effectiveness of our pose descriptions in enabling zero-shot human-centric classification using CLIP. Moreover, we introduce the FocusCLIP framework, which incorporates Subject-Focused Attention (SFA) in CLIP for improved text-to-image alignment. Our models were pretrained on the MPII Pose Descriptions dataset and their zero-shot performance was evaluated on five unseen datasets covering three tasks. FocusCLIP outperformed the baseline CLIP model, achieving an average accuracy increase of 8.61\% (33.65\% compared to CLIP's 25.04\%). Notably, our approach yielded improvements of 3.98\% in activity recognition, 14.78\% in age classification, and 7.06\% in emotion recognition. These results highlight the potential of integrating detailed pose descriptions and subject-level guidance into general pretraining frameworks for enhanced performance in downstream tasks.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	Stanford40	Top-3 Accuracy (%)	10.47	FocusCLIP
Activity Recognition	Stanford40	Top-3 Accuracy (%)	6.49	CLIP
Emotion Recognition	EMOTIC	Top-3 Accuracy (%)	13.73	FocusCLIP

Related Papers

Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation2025-07-21 Camera-based implicit mind reading by capturing higher-order semantic dynamics of human gaze within environmental context2025-07-17 ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs2025-07-15 A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition2025-07-15 Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation2025-07-11 CAST-Phys: Contactless Affective States Through Physiological signals Database2025-07-08 Exploring Remote Physiological Signal Measurement under Dynamic Lighting Conditions at Night: Dataset, Experiment, and Analysis2025-07-06 SEZ-HARN: Self-Explainable Zero-shot Human Activity Recognition Network2025-06-25