TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Hear Me Out: Fusional Approaches for Audio Augmented Tempo...

Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization

Anurag Bagchi, Jazib Mahmood, Dolton Fernandes, Ravi Kiran Sarvadevabhatla

2021-06-27Action LocalizationAction RecognitionTemporal Action Localization
PaperPDFCode(official)

Abstract

State of the art architectures for untrimmed video Temporal Action Localization (TAL) have only considered RGB and Flow modalities, leaving the information-rich audio modality totally unexploited. Audio fusion has been explored for the related but arguably easier problem of trimmed (clip-level) action recognition. However, TAL poses a unique set of challenges. In this paper, we propose simple but effective fusion-based approaches for TAL. To the best of our knowledge, our work is the first to jointly consider audio and video modalities for supervised TAL. We experimentally show that our schemes consistently improve performance for state of the art video-only TAL approaches. Specifically, they help achieve new state of the art performance on large-scale benchmark datasets - ActivityNet-1.3 (54.34 mAP@0.5) and THUMOS14 (57.18 mAP@0.5). Our experiments include ablations involving multiple fusion schemes, modality combinations and TAL architectures. Our code, models and associated data are available at https://github.com/skelemoa/tal-hmo.

Results

TaskDatasetMetricValueModel
VideoActivityNet-1.3mAP36.82AVFusion
VideoActivityNet-1.3mAP IOU@0.554.34AVFusion
VideoActivityNet-1.3mAP IOU@0.7537.66AVFusion
VideoActivityNet-1.3mAP IOU@0.958.93AVFusion
VideoTHUMOS’14Avg mAP (0.3:0.7)53.3AVFusion
VideoTHUMOS’14mAP IOU@0.370.1AVFusion
VideoTHUMOS’14mAP IOU@0.464.9AVFusion
VideoTHUMOS’14mAP IOU@0.557.1AVFusion
VideoTHUMOS’14mAP IOU@0.645.4AVFusion
VideoTHUMOS’14mAP IOU@0.728.8AVFusion
VideoTHUMOS'14mAP IOU@0.557.18AVFusion
Temporal Action LocalizationActivityNet-1.3mAP36.82AVFusion
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.554.34AVFusion
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.7537.66AVFusion
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.958.93AVFusion
Temporal Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)53.3AVFusion
Temporal Action LocalizationTHUMOS’14mAP IOU@0.370.1AVFusion
Temporal Action LocalizationTHUMOS’14mAP IOU@0.464.9AVFusion
Temporal Action LocalizationTHUMOS’14mAP IOU@0.557.1AVFusion
Temporal Action LocalizationTHUMOS’14mAP IOU@0.645.4AVFusion
Temporal Action LocalizationTHUMOS’14mAP IOU@0.728.8AVFusion
Temporal Action LocalizationTHUMOS'14mAP IOU@0.557.18AVFusion
Zero-Shot LearningActivityNet-1.3mAP36.82AVFusion
Zero-Shot LearningActivityNet-1.3mAP IOU@0.554.34AVFusion
Zero-Shot LearningActivityNet-1.3mAP IOU@0.7537.66AVFusion
Zero-Shot LearningActivityNet-1.3mAP IOU@0.958.93AVFusion
Zero-Shot LearningTHUMOS’14Avg mAP (0.3:0.7)53.3AVFusion
Zero-Shot LearningTHUMOS’14mAP IOU@0.370.1AVFusion
Zero-Shot LearningTHUMOS’14mAP IOU@0.464.9AVFusion
Zero-Shot LearningTHUMOS’14mAP IOU@0.557.1AVFusion
Zero-Shot LearningTHUMOS’14mAP IOU@0.645.4AVFusion
Zero-Shot LearningTHUMOS’14mAP IOU@0.728.8AVFusion
Zero-Shot LearningTHUMOS'14mAP IOU@0.557.18AVFusion
Action LocalizationActivityNet-1.3mAP36.82AVFusion
Action LocalizationActivityNet-1.3mAP IOU@0.554.34AVFusion
Action LocalizationActivityNet-1.3mAP IOU@0.7537.66AVFusion
Action LocalizationActivityNet-1.3mAP IOU@0.958.93AVFusion
Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)53.3AVFusion
Action LocalizationTHUMOS’14mAP IOU@0.370.1AVFusion
Action LocalizationTHUMOS’14mAP IOU@0.464.9AVFusion
Action LocalizationTHUMOS’14mAP IOU@0.557.1AVFusion
Action LocalizationTHUMOS’14mAP IOU@0.645.4AVFusion
Action LocalizationTHUMOS’14mAP IOU@0.728.8AVFusion
Action LocalizationTHUMOS'14mAP IOU@0.557.18AVFusion

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22