Papers With Code 2 | ML Benchmarks, SotA Results & Code

Dataset Summary

The Deep Evaluation of Audio Representations (DEAR) dataset is a benchmark designed to assess general-purpose audio foundation models on properties critical for hearable devices. It comprises 1,158 mono audio tracks (30 s each), spatially mixing proprietary anechoic speech monologues with high-quality everyday acoustic scene recordings from the HOA‑SSR library. DEAR enables controlled evaluation of:

Context (environment type: domestic, leisure, nature, professional, transport; indoor/outdoor; stationary/transient noise)
Speech sources (speech presence detection; speaker count)
Acoustic properties (direct-to-reverberant ratio DRR, reverberation time RT60, signal‑to‑noise ratio SNR)

All tracks are down‑mixed to a single channel at 44.1 kHz (32‑bit) and split into development and test sets with no overlap in speakers, backgrounds, or impulse responses.

Tasks

| Task Group | Task | Type | Metric | | ------------- | ----------------------------------- | ----------- | ----------- | | Context | 5‑way environment classification | Multi‑class | Matthews'  $\phi$ | | | Indoor vs. outdoor | Binary | Matthews'  $\phi$ | | | Stationary vs. transient noise | Binary | Matthews'  $\phi$ | | Sources | Speech presence (1 s segments) | Binary | Matthews'  $\phi$ | | | Speaker count (1 s segments) | Regression | $R^2$ | | Acoustics | DRR (1 s segments, 1 speaker) | Regression | $R^2$ | | | RT60 (1 s segments, 1 speaker) | Regression | $R^2$ | | | SNR (1 s segments, 1 speaker) | Regression | $R^2$ | | Retrospective | TUT2017 acoustic scene (15 classes) | Multi‑class | Matthews'  $\phi$ | | | LibriCount speaker count (0–10) | Regression | $R^2$ |

Dataset Structure

├── data/
│   ├── 00094903-4dbf-44a9-bf09-698fc361dbff.wav
│   └── …
├── development.csv
└── test.csv

.wav files: mono, 44.1 kHz, 32‑bit float
.csv files: meta-data for all tasks, linkable to wav files with id

Usage

Visit the dedicated code repository: https://github.com/DEAR-dataset/code

Source Data

Speech monologues (proprietary anechoic recordings)
HOA‑SSR library ambisonics scenes (licensed via FORCE Technology)
Impulse responses for controlled reverberation

Citation

If you use DEAR in your research, please cite:

@inproceedings{
  groeger2025dear,
  author={Gröger, Fabian and Baumann, Pascal and Amruthalingam, Ludovic and Simon, Laurent and Giurda, Ruksana and Lionetti, Simone},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Evaluation of Deep Audio Representations for Hearables}, 
  year={2025},
  doi={10.1109/ICASSP49660.2025.10887737}
}

ArXiv version: arxiv.org/abs/2502.06664

Dataset Summary

Context (environment type: domestic, leisure, nature, professional, transport; indoor/outdoor; stationary/transient noise)

Speech sources (speech presence detection; speaker count)

Acoustic properties (direct-to-reverberant ratio DRR, reverberation time RT60, signal‑to‑noise ratio SNR)

All tracks are down‑mixed to a single channel at 44.1 kHz (32‑bit) and split into development and test sets with no overlap in speakers, backgrounds, or impulse responses.

Tasks

| Task Group | Task | Type | Metric | | ------------- | ----------------------------------- | ----------- | ----------- | | Context | 5‑way environment classification | Multi‑class | Matthews' 

\phi

\phi

\phi

\phi

R^2

R^2

R^2

R^2

\phi

R^2

Citation

If you use DEAR in your research, please cite:

@inproceedings{ groeger2025dear, author={Gröger, Fabian and Baumann, Pascal and Amruthalingam, Ludovic and Simon, Laurent and Giurda, Ruksana and Lionetti, Simone}, booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={Evaluation of Deep Audio Representations for Hearables}, year={2025}, doi={10.1109/ICASSP49660.2025.10887737} }

ArXiv version: arxiv.org/abs/2502.06664