TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/COMEDIAN: Self-Supervised Learning and Knowledge Distillat...

COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers

Julien Denize, Mykola Liashuha, Jaonary Rabarisoa, Astrid Orcesi, Romain Hérault

2023-09-03Action DetectionSelf-Supervised LearningAction SpottingKnowledge Distillation
PaperPDFCode(official)

Abstract

We present COMEDIAN, a novel pipeline to initialize spatiotemporal transformers for action spotting, which involves self-supervised learning and knowledge distillation. Action spotting is a timestamp-level temporal action detection task. Our pipeline consists of three steps, with two initialization stages. First, we perform self-supervised initialization of a spatial transformer using short videos as input. Additionally, we initialize a temporal transformer that enhances the spatial transformer's outputs with global context through knowledge distillation from a pre-computed feature bank aligned with each short video segment. In the final step, we fine-tune the transformers to the action spotting task. The experiments, conducted on the SoccerNet-v2 dataset, demonstrate state-of-the-art performance and validate the effectiveness of COMEDIAN's pretraining paradigm. Our results highlight several advantages of our pretraining pipeline, including improved performance and faster convergence compared to non-pretrained models.

Results

TaskDatasetMetricValueModel
VideoSoccerNet-v2Average-mAP77.6COMEDIAN (ViSwin T ens.)
VideoSoccerNet-v2Tight Average-mAP73.1COMEDIAN (ViSwin T ens.)
VideoSoccerNet-v2Average-mAP77.1COMEDIAN (ViViT T ens.)
VideoSoccerNet-v2Tight Average-mAP72COMEDIAN (ViViT T ens.)
VideoSoccerNet-v2Average-mAP76.6COMEDIAN (ViSwin T)
VideoSoccerNet-v2Tight Average-mAP71.6COMEDIAN (ViSwin T)
VideoSoccerNet-v2Average-mAP76.1COMEDIAN (ViViT T)
VideoSoccerNet-v2Tight Average-mAP70.7COMEDIAN (ViViT T)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training2025-07-15Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning2025-07-14KAT-V1: Kwai-AutoThink Technical Report2025-07-11