Temporal-Relational CrossTransformers for Few-Shot Action Recognition

Toby Perrett, Alessandro Masullo, Tilo Burghardt, Majid Mirmehdi, Dima Damen

2021-01-15CVPR 2021 1Few Shot Action Recognition Action Recognition

Abstract

We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video representations are formed from ordered tuples of varying numbers of frames, which allows sub-sequences of actions at different speeds and temporal offsets to be compared. Our proposed Temporal-Relational CrossTransformers (TRX) achieve state-of-the-art results on few-shot splits of Kinetics, Something-Something V2 (SSv2), HMDB51 and UCF101. Importantly, our method outperforms prior work on SSv2 by a wide margin (12%) due to the its ability to model temporal relations. A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order relational CrossTransformers.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	HMDB51	1:1 Accuracy	75.6	TRX
Activity Recognition	Kinetics-100	Accuracy	85.9	TRX
Activity Recognition	UCF101	1:1 Accuracy	96.1	TRX
Activity Recognition	Something-Something-100	1:1 Accuracy	64.6	TRX
Action Recognition	HMDB51	1:1 Accuracy	75.6	TRX
Action Recognition	Kinetics-100	Accuracy	85.9	TRX
Action Recognition	UCF101	1:1 Accuracy	96.1	TRX
Action Recognition	Something-Something-100	1:1 Accuracy	64.6	TRX

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01 EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26 Feature Hallucination for Self-supervised Action Recognition2025-06-25 CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25 Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23 Adapting Vision-Language Models for Evaluating World Models2025-06-22 Active Multimodal Distillation for Few-shot Action Recognition2025-06-16