TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Actor-agnostic Multi-label Action Recognition with Multi-m...

Actor-agnostic Multi-label Action Recognition with Multi-modal Query

Anindya Mondal, Sauradip Nag, Joaquin M Prada, Xiatian Zhu, Anjan Dutta

2023-07-20Action ClassificationAnimal Action RecognitionZero-Shot Action RecognitionAction RecognitionAction Recognition In Videos
PaperPDFCode(official)

Abstract

Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called 'actor-agnostic multi-modal multi-label action recognition,' which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code is made available at https://github.com/mondalanindya/MSQNet.

Results

TaskDatasetMetricValueModel
Activity RecognitionHockeyAccuracy3.05MSQNet
Activity RecognitionHMDB51Accuracy93.25MSQNet
Activity RecognitionCharadesMAP47.57MSQNet
Activity RecognitionTHUMOS14Accuracy83.16MSQNet
Activity RecognitionAnimal KingdommAP73.1MSQNet
Action RecognitionHockeyAccuracy3.05MSQNet
Action RecognitionHMDB51Accuracy93.25MSQNet
Action RecognitionCharadesMAP47.57MSQNet
Action RecognitionTHUMOS14Accuracy83.16MSQNet
Action RecognitionAnimal KingdommAP73.1MSQNet
Zero-Shot Action RecognitionCharadesmAP35.59MSQNet
Zero-Shot Action RecognitionHMDB51Accuracy69.43MSQNet
Zero-Shot Action RecognitionTHUMOS' 14Accuracy75.33MSQNet

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22Active Multimodal Distillation for Few-shot Action Recognition2025-06-16