TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Audio-Visual Activity Guided Cross-Modal Identity Associat...

Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection

Rahul Sharma, Shrikanth Narayanan

2022-12-01Audio-Visual Active Speaker DetectionActive Speaker Detection
PaperPDFCode(official)

Abstract

Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection. Through experiments on entertainment media videos from two benchmark datasets, the AVA active speaker (movies) and Visual Person Clustering Dataset (TV shows), we show that a simple late fusion of the two approaches enhances the active speaker detection performance.

Results

TaskDatasetMetricValueModel
Action DetectionVPCDmean average precision83.9GSCMIA

Related Papers

UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios2025-05-28CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization2025-05-06Understanding Co-speech Gestures in-the-wild2025-03-28LASER: Lip Landmark Assisted Speaker Detection for Robustness2025-01-21ASDnB: Merging Face with Body Cues For Robust Active Speaker Detection2024-12-11BIAS: A Body-based Interpretable Active Speaker Approach2024-12-06How to Squeeze An Explanation Out of Your Model2024-12-06FabuLight-ASD: Unveiling Speech Activity via Body Language2024-11-20