COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality

Honglu Zhou, Asim Kadav, Aviv Shamsian, Shijie Geng, Farley Lai, Long Zhao, Ting Liu, Mubbasir Kapadia, Hans Peter Graf

2021-12-11Group Activity Recognition Relational Reasoning Activity Recognition

Abstract

Group Activity Recognition detects the activity collectively performed by a group of actors, which requires compositional reasoning of actors and objects. We approach the task by modeling the video as tokens that represent the multi-scale semantic concepts in the video. We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally. In addition, prior works suffer from scene biases with privacy and ethical concerns. We only use the keypoint modality which reduces scene biases and prevents acquiring detailed visual data that may contain private or biased information of users. We improve the multiscale representations in COMPOSER by clustering the intermediate scale representations, while maintaining consistent cluster assignments between scales. Finally, we use techniques such as auxiliary prediction and data augmentations tailored to the keypoint signals to aid model training. We demonstrate the model's strength and interpretability on two widely-used datasets (Volleyball and Collective Activity). COMPOSER achieves up to +5.4% improvement with just the keypoint modality. Code is available at https://github.com/hongluzhou/composer

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	Collective Activity	Accuracy	96.2	COMPOSER
Activity Recognition	Volleyball	Accuracy	94.69	COMPOSER

Related Papers

ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs2025-07-15 SEZ-HARN: Self-Explainable Zero-shot Human Activity Recognition Network2025-06-25 Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis2025-06-17 FreeQ-Graph: Free-form Querying with Semantic Consistent Scene Graph for 3D Scene Understanding2025-06-16 DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding2025-06-16 MORIC: CSI Delay-Doppler Decomposition for Robust Wi-Fi-based Human Activity Recognition2025-06-15 AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments2025-06-13 LogiPlan: A Structured Benchmark for Logical Planning and Relational Reasoning in LLMs2025-06-12