TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Dense-Localizing Audio-Visual Events in Untrimmed Videos: ...

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, Feng Zheng

2023-03-22CVPR 2023 1audio-visual event localization
PaperPDFCode(official)

Abstract

Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, in this paper we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. The problem is challenging as it requires fine-grained audio-visual scene and context understanding. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass. Extensive experiments demonstrate the effectiveness of our method as well as the significance of multi-scale cross-modal perception and dependency modeling for this task.

Results

TaskDatasetMetricValueModel
audio-visual event localizationUnAV-100 mAP47.8UnAV
audio-visual event localizationUnAV-100AP@IOU0.550.6UnAV

Related Papers

Audio-visual Event Localization on Portrait Mode Short Videos2025-04-09Audio-Visual Semantic Graph Network for Audio-Visual Event Localization2025-01-01Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration2024-12-17Pilot-guided Multimodal Semantic Communication for Audio-Visual Event Localization2024-12-09Towards Open-Vocabulary Audio-Visual Event Localization2024-11-18Multimodal Trustworthy Semantic Communication for Audio-Visual Event Localization2024-11-04CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization2024-08-04Label-anticipated Event Disentanglement for Audio-Visual Video Parsing2024-07-11