TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Zero-note samba: Self-supervised beat tracking

Zero-note samba: Self-supervised beat tracking

Dorian Desblancs, Vincent Lostanlen, Romain Hennequin

2023-07-21IEEE/ACM Transactions on Audio, Speech, and Language Processing 2023 7Self-Supervised LearningInformation RetrievalMusic Information RetrievalBeat Tracking
PaperPDFCode

Abstract

Supervised machine learning for music information retrieval requires a large annotated training set, and is thus an expensive and time-consuming process. To circumvent this problem, we propose to train deep neural networks to perceive beats in musical recordings despite having little or no access to human annotations. The key idea is to train two fully convolutional networks in parallel, which we name “Zero-Note Samba” (ZeroNS): the first analyzes the percussive part of a musical piece whilst the second analyzes its non-percussive part. These networks learn a self-supervised pretext task of synchrony prediction (sync-pred), which simulates the ability of musicians to groove together when playing in the same band. Sync-pred encourages the two networks to return similar outputs if the underlying musical parts are synchronized, yet dissimilar outputs if the parts are out of sync. In practice, we obtain the instrumental parts from commercial recordings via an off-the-shelf source separation system: Spleeter. After self-supervised learning with sync-pred, ZeroNS produces a sparse output that resembles a beat detection function. When used in conjunction with a dynamic Bayesian network, ZeroNS surpasses the state of the art in unsupervised beat tracking. Furthermore, fine-tuning ZeroNS to a small set of labeled data (of the order of one to ten songs) matches the performance of a fully supervised network on 96 songs. Lastly, we show that pre-training a supervised model with sync-pred mitigates dataset bias and thus improves cross-dataset generalization, at no extra annotation cost.

Related Papers

A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17From Chaos to Automation: Enabling the Use of Unstructured Data for Robotic Process Automation2025-07-15Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14Temporal Information Retrieval via Time-Specifier Model Merging2025-07-09Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers2025-07-08An analysis of vision-language models for fabric retrieval2025-07-07