COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers

Julien Denize, Mykola Liashuha, Jaonary Rabarisoa, Astrid Orcesi, Romain Hérault

2023-09-03Action Detection Self-Supervised Learning Action Spotting Knowledge Distillation

Abstract

We present COMEDIAN, a novel pipeline to initialize spatiotemporal transformers for action spotting, which involves self-supervised learning and knowledge distillation. Action spotting is a timestamp-level temporal action detection task. Our pipeline consists of three steps, with two initialization stages. First, we perform self-supervised initialization of a spatial transformer using short videos as input. Additionally, we initialize a temporal transformer that enhances the spatial transformer's outputs with global context through knowledge distillation from a pre-computed feature bank aligned with each short video segment. In the final step, we fine-tune the transformers to the action spotting task. The experiments, conducted on the SoccerNet-v2 dataset, demonstrate state-of-the-art performance and validate the effectiveness of COMEDIAN's pretraining paradigm. Our results highlight several advantages of our pretraining pipeline, including improved performance and faster convergence compared to non-pretrained models.

Results

Task	Dataset	Metric	Value	Model
Video	SoccerNet-v2	Average-mAP	77.6	COMEDIAN (ViSwin T ens.)
Video	SoccerNet-v2	Tight Average-mAP	73.1	COMEDIAN (ViSwin T ens.)
Video	SoccerNet-v2	Average-mAP	77.1	COMEDIAN (ViViT T ens.)
Video	SoccerNet-v2	Tight Average-mAP	72	COMEDIAN (ViViT T ens.)
Video	SoccerNet-v2	Average-mAP	76.6	COMEDIAN (ViSwin T)
Video	SoccerNet-v2	Tight Average-mAP	71.6	COMEDIAN (ViSwin T)
Video	SoccerNet-v2	Average-mAP	76.1	COMEDIAN (ViViT T)
Video	SoccerNet-v2	Tight Average-mAP	70.7	COMEDIAN (ViViT T)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17 Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16 HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training2025-07-15 Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14 Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning2025-07-14 KAT-V1: Kwai-AutoThink Technical Report2025-07-11