TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Modality-Aware Contrastive Instance Learning with Self-Dis...

Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection

Jiashuo Yu, Jinyu Liu, Ying Cheng, Rui Feng, Yuejie Zhang

2022-07-12Anomaly Detection In Surveillance VideosMultiple Instance Learningaudio-visual learning
PaperPDFCode(official)

Abstract

Weakly-supervised audio-visual violence detection aims to distinguish snippets containing multimodal violence events with video-level labels. Many prior works perform audio-visual integration and interaction in an early or intermediate manner, yet overlooking the modality heterogeneousness over the weakly-supervised setting. In this paper, we analyze the modality asynchrony and undifferentiated instances phenomena of the multiple instance learning (MIL) procedure, and further investigate its negative impact on weakly-supervised audio-visual learning. To address these issues, we propose a modality-aware contrastive instance learning with self-distillation (MACIL-SD) strategy. Specifically, we leverage a lightweight two-stream network to generate audio and visual bags, in which unimodal background, violent, and normal instances are clustered into semi-bags in an unsupervised way. Then audio and visual violent semi-bag representations are assembled as positive pairs, and violent semi-bags are combined with background and normal instances in the opposite modality as contrastive negative pairs. Furthermore, a self-distillation module is applied to transfer unimodal visual knowledge to the audio-visual model, which alleviates noises and closes the semantic gap between unimodal and multimodal features. Experiments show that our framework outperforms previous methods with lower complexity on the large-scale XD-Violence dataset. Results also demonstrate that our proposed approach can be used as plug-in modules to enhance other networks. Codes are available at https://github.com/JustinYuu/MACIL_SD.

Results

TaskDatasetMetricValueModel
Video UnderstandingXD-ViolenceAP83.4MACIL_SD
VideoXD-ViolenceAP83.4MACIL_SD
Anomaly DetectionXD-ViolenceAP83.4MACIL_SD

Related Papers

GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning2025-07-09The Trilemma of Truth in Large Language Models2025-06-30OTSurv: A Novel Multiple Instance Learning Framework for Survival Prediction with Heterogeneity-aware Optimal Transport2025-06-25Benchmarking histopathology foundation models in a multi-center dataset for skin cancer subtyping2025-06-23MiCo: Multiple Instance Learning with Context-Aware Clustering for Whole Slide Image Analysis2025-06-22HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI Analysis2025-06-19Dual‑detector Re‑optimization for Federated Weakly Supervised Video Anomaly Detection Via Adaptive Dynamic Recursive Mapping2025-06-13BioLangFusion: Multimodal Fusion of DNA, mRNA, and Protein Language Models2025-06-10