TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Jam-ALT: A Formatting-Aware Lyrics Transcription Benchmark

Jam-ALT: A Formatting-Aware Lyrics Transcription Benchmark

Ondřej Cífka, Constantinos Dimitriou, Cheng-i Wang, Hendrik Schreiber, Luke Miner, Fabian-Robert Stöter

2023-11-23Automatic Lyrics TranscriptionRhythm
PaperPDFCode(official)

Abstract

Current automatic lyrics transcription (ALT) benchmarks focus exclusively on word content and ignore the finer nuances of written lyrics including formatting and punctuation, which leads to a potential misalignment with the creative products of musicians and songwriters as well as listeners' experiences. For example, line breaks are important in conveying information about rhythm, emotional emphasis, rhyme, and high-level structure. To address this issue, we introduce Jam-ALT, a new lyrics transcription benchmark based on the JamendoLyrics dataset. Our contribution is twofold. Firstly, a complete revision of the transcripts, geared specifically towards ALT evaluation by following a newly created annotation guide that unifies the music industry's guidelines, covering aspects such as punctuation, line breaks, spelling, background vocals, and non-word sounds. Secondly, a suite of evaluation metrics designed, unlike the traditional word error rate, to capture such phenomena. We hope that the proposed benchmark contributes to the ALT task, enabling more precise and reliable assessments of transcription systems and enhancing the user experience in lyrics applications such as subtitle renderings for live captioning or karaoke.

Results

TaskDatasetMetricValueModel
Speech RecognitionJam-ALTCase Error Rate3.4AudioShake v1
Speech RecognitionJam-ALTLine break F182.3AudioShake v1
Speech RecognitionJam-ALTParenthesis F-129.4AudioShake v1
Speech RecognitionJam-ALTPunctuation F150.5AudioShake v1
Speech RecognitionJam-ALTSection break F172.1AudioShake v1
Speech RecognitionJam-ALTWord Error Rate (WER)26AudioShake v1
Speech RecognitionJam-ALTLine break F173.5Whisper v3
Speech RecognitionJam-ALTSection break F11Whisper v3
Speech RecognitionJam-ALTWord Error Rate (WER)35.5Whisper v3
Speech RecognitionJam-ALTCase Error Rate4.3Whisper v3
Speech RecognitionJam-ALTLine break F173.5Whisper v3
Speech RecognitionJam-ALTPunctuation F141.6Whisper v3
Speech RecognitionJam-ALTSection break F11Whisper v3
Speech RecognitionJam-ALTWord Error Rate (WER)35.5Whisper v3
Speech RecognitionJam-ALTCase Error Rate4.5Whisper v2
Speech RecognitionJam-ALTPunctuation F141.7Whisper v2
Speech RecognitionJam-ALTWord Error Rate (WER)35.7Whisper v2
Speech RecognitionJam-ALTCase Error Rate5.3Whisper v2 +demucs
Speech RecognitionJam-ALTPunctuation F128Whisper v2 +demucs
Speech RecognitionJam-ALTWord Error Rate (WER)44Whisper v2 +demucs
Speech RecognitionJam-ALTCase Error Rate3.8Whisper v3 +demucs
Speech RecognitionJam-ALTPunctuation F129Whisper v3 +demucs
Speech RecognitionJam-ALTWord Error Rate (WER)47.9Whisper v3 +demucs
Speech RecognitionJam-ALT FrenchLine break F-173.4Whisper v2
Speech RecognitionJam-ALT FrenchSection break F-11.4Whisper v2
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)27.7Whisper v2
Speech RecognitionJam-ALT FrenchCase Error Rate3.2Whisper v2
Speech RecognitionJam-ALT FrenchLine break F-173.4Whisper v2
Speech RecognitionJam-ALT FrenchPunctuation F-145.8Whisper v2
Speech RecognitionJam-ALT FrenchSection break F-11.4Whisper v2
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)27.7Whisper v2
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)34.7Whisper v3
Speech RecognitionJam-ALT FrenchCase Error Rate3.3Whisper v3
Speech RecognitionJam-ALT FrenchLine break F-177.8Whisper v3
Speech RecognitionJam-ALT FrenchPunctuation F-142.4Whisper v3
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)34.7Whisper v3
Speech RecognitionJam-ALT FrenchCase Error Rate2AudioShake v1
Speech RecognitionJam-ALT FrenchLine break F-184.9AudioShake v1
Speech RecognitionJam-ALT FrenchParenthesis F-141.3AudioShake v1
Speech RecognitionJam-ALT FrenchPunctuation F-145.8AudioShake v1
Speech RecognitionJam-ALT FrenchSection break F-172.5AudioShake v1
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)34.9AudioShake v1
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)43.3Whisper v2 +demucs
Speech RecognitionJam-ALT FrenchCase Error Rate3.2Whisper v2 +demucs
Speech RecognitionJam-ALT FrenchLine break F-166.1Whisper v2 +demucs
Speech RecognitionJam-ALT FrenchPunctuation F-134.9Whisper v2 +demucs
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)43.3Whisper v2 +demucs
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)44.9Whisper v3 +demucs
Speech RecognitionJam-ALT FrenchCase Error Rate3.2Whisper v3 +demucs
Speech RecognitionJam-ALT FrenchLine break F-169.4Whisper v3 +demucs
Speech RecognitionJam-ALT FrenchPunctuation F-130.9Whisper v3 +demucs
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)44.9Whisper v3 +demucs
Speech RecognitionJam-ALT SpanishCase Error Rate4.1AudioShake v1
Speech RecognitionJam-ALT SpanishLine break F-182.7AudioShake v1
Speech RecognitionJam-ALT SpanishParenthesis F-138AudioShake v1
Speech RecognitionJam-ALT SpanishPunctuation F-147.8AudioShake v1
Speech RecognitionJam-ALT SpanishSection break F-169.6AudioShake v1
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)22.5AudioShake v1
Speech RecognitionJam-ALT SpanishCase Error Rate6.5Whisper v2
Speech RecognitionJam-ALT SpanishPunctuation F-150Whisper v2
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)25.7Whisper v2
Speech RecognitionJam-ALT SpanishLine break F-173.7Whisper v3
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)28.6Whisper v3
Speech RecognitionJam-ALT SpanishCase Error Rate5Whisper v3
Speech RecognitionJam-ALT SpanishLine break F-173.7Whisper v3
Speech RecognitionJam-ALT SpanishPunctuation F-141.9Whisper v3
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)28.6Whisper v3
Speech RecognitionJam-ALT SpanishCase Error Rate7.1Whisper v2 +demucs
Speech RecognitionJam-ALT SpanishLine break F-156.4Whisper v2 +demucs
Speech RecognitionJam-ALT SpanishPunctuation F-117.2Whisper v2 +demucs
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)38.8Whisper v2 +demucs
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)61.5Whisper v3 +demucs
Speech RecognitionJam-ALT SpanishCase Error Rate3.6Whisper v3 +demucs
Speech RecognitionJam-ALT SpanishLine break F-152.4Whisper v3 +demucs
Speech RecognitionJam-ALT SpanishPunctuation F-128.7Whisper v3 +demucs
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)61.5Whisper v3 +demucs
Speech RecognitionJam-ALT GermanCase Error Rate4.1AudioShake v1
Speech RecognitionJam-ALT GermanLine break F-181.2AudioShake v1
Speech RecognitionJam-ALT GermanParenthesis F-18.1AudioShake v1
Speech RecognitionJam-ALT GermanPunctuation F-148.5AudioShake v1
Speech RecognitionJam-ALT GermanSection break F-169.2AudioShake v1
Speech RecognitionJam-ALT GermanWord Error Rate (WER)24.4AudioShake v1
Speech RecognitionJam-ALT GermanSection break F-11.2Whisper v3
Speech RecognitionJam-ALT GermanWord Error Rate (WER)40.7Whisper v3
Speech RecognitionJam-ALT GermanCase Error Rate4Whisper v3
Speech RecognitionJam-ALT GermanLine break F-171.2Whisper v3
Speech RecognitionJam-ALT GermanPunctuation F-141.2Whisper v3
Speech RecognitionJam-ALT GermanSection break F-11.2Whisper v3
Speech RecognitionJam-ALT GermanWord Error Rate (WER)40.7Whisper v3
Speech RecognitionJam-ALT GermanWord Error Rate (WER)43.5Whisper v3 +demucs
Speech RecognitionJam-ALT GermanCase Error Rate4.4Whisper v3 +demucs
Speech RecognitionJam-ALT GermanLine break F-172Whisper v3 +demucs
Speech RecognitionJam-ALT GermanPunctuation F-134Whisper v3 +demucs
Speech RecognitionJam-ALT GermanWord Error Rate (WER)43.5Whisper v3 +demucs
Speech RecognitionJam-ALT GermanCase Error Rate5.3Whisper v2
Speech RecognitionJam-ALT GermanLine break F-169.9Whisper v2
Speech RecognitionJam-ALT GermanPunctuation F-138.7Whisper v2
Speech RecognitionJam-ALT GermanWord Error Rate (WER)45.4Whisper v2
Speech RecognitionJam-ALT GermanWord Error Rate (WER)65.2Whisper v2 +demucs
Speech RecognitionJam-ALT GermanCase Error Rate5.9Whisper v2 +demucs
Speech RecognitionJam-ALT GermanLine break F-167.5Whisper v2 +demucs
Speech RecognitionJam-ALT GermanPunctuation F-130.2Whisper v2 +demucs
Speech RecognitionJam-ALT GermanWord Error Rate (WER)65.2Whisper v2 +demucs
Speech RecognitionJam-ALT EnglishCase Error Rate3.4AudioShake v1
Speech RecognitionJam-ALT EnglishLine break F-180.7AudioShake v1
Speech RecognitionJam-ALT EnglishParenthesis F-132.4AudioShake v1
Speech RecognitionJam-ALT EnglishPunctuation F-159AudioShake v1
Speech RecognitionJam-ALT EnglishSection break F-177.4AudioShake v1
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)22.1AudioShake v1
Speech RecognitionJam-ALT EnglishLine break F-174LyricWhiz
Speech RecognitionJam-ALT EnglishPunctuation F-134LyricWhiz
Speech RecognitionJam-ALT EnglishSection break F-11.4LyricWhiz
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)24.6LyricWhiz
Speech RecognitionJam-ALT EnglishCase Error Rate3.5LyricWhiz
Speech RecognitionJam-ALT EnglishLine break F-174LyricWhiz
Speech RecognitionJam-ALT EnglishPunctuation F-134LyricWhiz
Speech RecognitionJam-ALT EnglishSection break F-11.4LyricWhiz
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)24.6LyricWhiz
Speech RecognitionJam-ALT EnglishCase Error Rate5.3Whisper v2 +demucs
Speech RecognitionJam-ALT EnglishLine break F-153.8Whisper v2 +demucs
Speech RecognitionJam-ALT EnglishPunctuation F-139.2Whisper v2 +demucs
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)32.3Whisper v2 +demucs
Speech RecognitionJam-ALT EnglishLine break F-171.5Whisper v3
Speech RecognitionJam-ALT EnglishSection break F-12.6Whisper v3
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)37.7Whisper v3
Speech RecognitionJam-ALT EnglishCase Error Rate4.8Whisper v3
Speech RecognitionJam-ALT EnglishLine break F-171.5Whisper v3
Speech RecognitionJam-ALT EnglishPunctuation F-140.9Whisper v3
Speech RecognitionJam-ALT EnglishSection break F-12.6Whisper v3
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)37.7Whisper v3
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)43Whisper v3 +demucs
Speech RecognitionJam-ALT EnglishCase Error Rate4.1Whisper v3 +demucs
Speech RecognitionJam-ALT EnglishLine break F-166.8Whisper v3 +demucs
Speech RecognitionJam-ALT EnglishPunctuation F-123.3Whisper v3 +demucs
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)43Whisper v3 +demucs
Speech RecognitionJam-ALT EnglishLine break F-163Whisper v2
Speech RecognitionJam-ALT EnglishSection break F-111.2Whisper v2
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)43.8Whisper v2
Speech RecognitionJam-ALT EnglishCase Error Rate3.5Whisper v2
Speech RecognitionJam-ALT EnglishLine break F-163Whisper v2
Speech RecognitionJam-ALT EnglishPunctuation F-131.3Whisper v2
Speech RecognitionJam-ALT EnglishSection break F-111.2Whisper v2
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)43.8Whisper v2

Related Papers

Exploring Adapter Design Tradeoffs for Low Resource Music Generation2025-06-26CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment2025-06-25Let Your Video Listen to Your Music!2025-06-23From Generality to Mastery: Composer-Style Symbolic Music Generation via Large-Scale Pre-training2025-06-20DanceChat: Large Language Model-Guided Music-to-Dance Generation2025-06-12Rhythm Features for Speaker Identification2025-06-07MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark2025-06-05Enhancing Lyrics Transcription on Music Mixtures with Consistency Loss2025-06-03