TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Lyrics Transcription for Humans: A Readability-Aware Bench...

Lyrics Transcription for Humans: A Readability-Aware Benchmark

Ondřej Cífka, Hendrik Schreiber, Luke Miner, Fabian-Robert Stöter

2024-07-30Automatic Lyrics Transcription
PaperPDFCode(official)

Abstract

Writing down lyrics for human consumption involves not only accurately capturing word sequences, but also incorporating punctuation and formatting for clarity and to convey contextual information. This includes song structure, emotional emphasis, and contrast between lead and background vocals. While automatic lyrics transcription (ALT) systems have advanced beyond producing unstructured strings of words and are able to draw on wider context, ALT benchmarks have not kept pace and continue to focus exclusively on words. To address this gap, we introduce Jam-ALT, a comprehensive lyrics transcription benchmark. The benchmark features a complete revision of the JamendoLyrics dataset, in adherence to industry standards for lyrics transcription and formatting, along with evaluation metrics designed to capture and assess the lyric-specific nuances, laying the foundation for improving the readability of lyrics. We apply the benchmark to recent transcription systems and present additional error analysis, as well as an experimental comparison with a classical music dataset.

Results

TaskDatasetMetricValueModel
Speech RecognitionJam-ALTCase-Sensitive Word Error Rate20.1AudioShake v3
Speech RecognitionJam-ALTLine break F184.4AudioShake v3
Speech RecognitionJam-ALTParenthesis F-129.4AudioShake v3
Speech RecognitionJam-ALTPunctuation F157AudioShake v3
Speech RecognitionJam-ALTSection break F173.9AudioShake v3
Speech RecognitionJam-ALTWord Error Rate (WER)16.1AudioShake v3
Speech RecognitionJam-ALTCase-Sensitive Word Error Rate32.6Whisper v2 +lang
Speech RecognitionJam-ALTLine break F170.4Whisper v2 +lang
Speech RecognitionJam-ALTPunctuation F145Whisper v2 +lang
Speech RecognitionJam-ALTSection break F13.7Whisper v2 +lang
Speech RecognitionJam-ALTWord Error Rate (WER)27.9Whisper v2 +lang
Speech RecognitionJam-ALTCase-Sensitive Word Error Rate37.2Whisper v3 +lang
Speech RecognitionJam-ALTLine break F173.9Whisper v3 +lang
Speech RecognitionJam-ALTPunctuation F143.7Whisper v3 +lang
Speech RecognitionJam-ALTSection break F10.6Whisper v3 +lang
Speech RecognitionJam-ALTWord Error Rate (WER)32.6Whisper v3 +lang
Speech RecognitionJam-ALTCase-Sensitive Word Error Rate39.3Whisper v2 +demucs +lang
Speech RecognitionJam-ALTLine break F160.6Whisper v2 +demucs +lang
Speech RecognitionJam-ALTPunctuation F139.4Whisper v2 +demucs +lang
Speech RecognitionJam-ALTWord Error Rate (WER)33.5Whisper v2 +demucs +lang
Speech RecognitionJam-ALTCase-Sensitive Word Error Rate39.7Whisper v3
Speech RecognitionJam-ALTPunctuation F143Whisper v3
Speech RecognitionJam-ALTLine break F169.3Whisper v2
Speech RecognitionJam-ALTSection break F13.3Whisper v2
Speech RecognitionJam-ALTCase-Sensitive Word Error Rate42.1Whisper v2
Speech RecognitionJam-ALTLine break F169.3Whisper v2
Speech RecognitionJam-ALTPunctuation F144.2Whisper v2
Speech RecognitionJam-ALTSection break F13.3Whisper v2
Speech RecognitionJam-ALTWord Error Rate (WER)37.8Whisper v2
Speech RecognitionJam-ALTLine break F161.2Whisper v2 +demucs
Speech RecognitionJam-ALTCase-Sensitive Word Error Rate49.8Whisper v2 +demucs
Speech RecognitionJam-ALTLine break F161.2Whisper v2 +demucs
Speech RecognitionJam-ALTPunctuation F141.6Whisper v2 +demucs
Speech RecognitionJam-ALTWord Error Rate (WER)44.5Whisper v2 +demucs
Speech RecognitionJam-ALTCase-Sensitive Word Error Rate50.4Whisper v3 +demucs +lang
Speech RecognitionJam-ALTLine break F165.8Whisper v3 +demucs +lang
Speech RecognitionJam-ALTPunctuation F133.7Whisper v3 +demucs +lang
Speech RecognitionJam-ALTWord Error Rate (WER)46.6Whisper v3 +demucs +lang
Speech RecognitionJam-ALTLine break F165.7Whisper v3 +demucs
Speech RecognitionJam-ALTCase-Sensitive Word Error Rate51.6Whisper v3 +demucs
Speech RecognitionJam-ALTLine break F165.7Whisper v3 +demucs
Speech RecognitionJam-ALTPunctuation F133Whisper v3 +demucs
Speech RecognitionJam-ALTWord Error Rate (WER)48Whisper v3 +demucs
Speech RecognitionJam-ALTCase-Sensitive Word Error Rate72.6OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALTLine break F141.1OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALTPunctuation F120OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALTWord Error Rate (WER)66.5OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALTCase-Sensitive Word Error Rate75OWSM v3.1 +lang
Speech RecognitionJam-ALTLine break F137.8OWSM v3.1 +lang
Speech RecognitionJam-ALTParenthesis F-10.6OWSM v3.1 +lang
Speech RecognitionJam-ALTPunctuation F122.5OWSM v3.1 +lang
Speech RecognitionJam-ALTWord Error Rate (WER)69.3OWSM v3.1 +lang
Speech RecognitionJam-ALT FrenchCase-Sensitive Word Error Rate23.5AudioShake v3
Speech RecognitionJam-ALT FrenchLine break F-188.6AudioShake v3
Speech RecognitionJam-ALT FrenchParenthesis F-13.2AudioShake v3
Speech RecognitionJam-ALT FrenchPunctuation F-146.1AudioShake v3
Speech RecognitionJam-ALT FrenchSection break F-169AudioShake v3
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)20.8AudioShake v3
Speech RecognitionJam-ALT FrenchCase-Sensitive Word Error Rate30.5Whisper v2 +lang
Speech RecognitionJam-ALT FrenchLine break F-173.7Whisper v2 +lang
Speech RecognitionJam-ALT FrenchPunctuation F-145.3Whisper v2 +lang
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)27.1Whisper v2 +lang
Speech RecognitionJam-ALT FrenchCase-Sensitive Word Error Rate31.1Whisper v2
Speech RecognitionJam-ALT FrenchPunctuation F-145.9Whisper v2
Speech RecognitionJam-ALT FrenchCase-Sensitive Word Error Rate38Whisper v3
Speech RecognitionJam-ALT FrenchLine break F-177.9Whisper v3
Speech RecognitionJam-ALT FrenchPunctuation F-142.5Whisper v3
Speech RecognitionJam-ALT FrenchCase-Sensitive Word Error Rate38Whisper v3 +lang
Speech RecognitionJam-ALT FrenchLine break F-177.9Whisper v3 +lang
Speech RecognitionJam-ALT FrenchPunctuation F-142.3Whisper v3 +lang
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)34.7Whisper v3 +lang
Speech RecognitionJam-ALT FrenchCase-Sensitive Word Error Rate42.1Whisper v2 +demucs +lang
Speech RecognitionJam-ALT FrenchLine break F-165.6Whisper v2 +demucs +lang
Speech RecognitionJam-ALT FrenchPunctuation F-136.1Whisper v2 +demucs +lang
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)38.2Whisper v2 +demucs +lang
Speech RecognitionJam-ALT FrenchCase-Sensitive Word Error Rate46.9Whisper v2 +demucs
Speech RecognitionJam-ALT FrenchLine break F-166Whisper v2 +demucs
Speech RecognitionJam-ALT FrenchPunctuation F-138Whisper v2 +demucs
Speech RecognitionJam-ALT FrenchCase-Sensitive Word Error Rate48.2Whisper v3 +demucs
Speech RecognitionJam-ALT FrenchLine break F-169.3Whisper v3 +demucs
Speech RecognitionJam-ALT FrenchPunctuation F-132Whisper v3 +demucs
Speech RecognitionJam-ALT FrenchCase-Sensitive Word Error Rate48.3Whisper v3 +demucs +lang
Speech RecognitionJam-ALT FrenchLine break F-169.3Whisper v3 +demucs +lang
Speech RecognitionJam-ALT FrenchPunctuation F-132Whisper v3 +demucs +lang
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)44.9Whisper v3 +demucs +lang
Speech RecognitionJam-ALT FrenchCase-Sensitive Word Error Rate75.7OWSM v3.1 +lang
Speech RecognitionJam-ALT FrenchLine break F-136OWSM v3.1 +lang
Speech RecognitionJam-ALT FrenchParenthesis F-11.9OWSM v3.1 +lang
Speech RecognitionJam-ALT FrenchPunctuation F-130.6OWSM v3.1 +lang
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)71.6OWSM v3.1 +lang
Speech RecognitionJam-ALT FrenchCase-Sensitive Word Error Rate82.1OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT FrenchLine break F-140.9OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT FrenchPunctuation F-122.3OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT FrenchWord Error Rate (WER)78.5OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT SpanishCase-Sensitive Word Error Rate17.7AudioShake v3
Speech RecognitionJam-ALT SpanishLine break F-181.5AudioShake v3
Speech RecognitionJam-ALT SpanishParenthesis F-14.2AudioShake v3
Speech RecognitionJam-ALT SpanishPunctuation F-156.7AudioShake v3
Speech RecognitionJam-ALT SpanishSection break F-166.4AudioShake v3
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)12.6AudioShake v3
Speech RecognitionJam-ALT SpanishCase-Sensitive Word Error Rate27.7Whisper v2 +lang
Speech RecognitionJam-ALT SpanishLine break F-171.5Whisper v2 +lang
Speech RecognitionJam-ALT SpanishPunctuation F-152.5Whisper v2 +lang
Speech RecognitionJam-ALT SpanishSection break F-13.1Whisper v2 +lang
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)21.9Whisper v2 +lang
Speech RecognitionJam-ALT SpanishCase-Sensitive Word Error Rate28Whisper v3 +lang
Speech RecognitionJam-ALT SpanishLine break F-174.5Whisper v3 +lang
Speech RecognitionJam-ALT SpanishPunctuation F-144.5Whisper v3 +lang
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)22.4Whisper v3 +lang
Speech RecognitionJam-ALT SpanishLine break F-171.7Whisper v2
Speech RecognitionJam-ALT SpanishSection break F-13.1Whisper v2
Speech RecognitionJam-ALT SpanishCase-Sensitive Word Error Rate31.5Whisper v2
Speech RecognitionJam-ALT SpanishLine break F-171.7Whisper v2
Speech RecognitionJam-ALT SpanishPunctuation F-152.8Whisper v2
Speech RecognitionJam-ALT SpanishSection break F-13.1Whisper v2
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)25.8Whisper v2
Speech RecognitionJam-ALT SpanishCase-Sensitive Word Error Rate33.6Whisper v3
Speech RecognitionJam-ALT SpanishPunctuation F-142.5Whisper v3
Speech RecognitionJam-ALT SpanishCase-Sensitive Word Error Rate42.2Whisper v2 +demucs +lang
Speech RecognitionJam-ALT SpanishLine break F-152.6Whisper v2 +demucs +lang
Speech RecognitionJam-ALT SpanishPunctuation F-134.3Whisper v2 +demucs +lang
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)34.9Whisper v2 +demucs +lang
Speech RecognitionJam-ALT SpanishCase-Sensitive Word Error Rate46.5Whisper v2 +demucs
Speech RecognitionJam-ALT SpanishLine break F-156.6Whisper v2 +demucs
Speech RecognitionJam-ALT SpanishPunctuation F-140.4Whisper v2 +demucs
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)39.6Whisper v2 +demucs
Speech RecognitionJam-ALT SpanishCase-Sensitive Word Error Rate62.1Whisper v3 +demucs +lang
Speech RecognitionJam-ALT SpanishLine break F-154.7Whisper v3 +demucs +lang
Speech RecognitionJam-ALT SpanishPunctuation F-134.4Whisper v3 +demucs +lang
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)58.6Whisper v3 +demucs +lang
Speech RecognitionJam-ALT SpanishCase-Sensitive Word Error Rate64.9Whisper v3 +demucs
Speech RecognitionJam-ALT SpanishLine break F-152.3Whisper v3 +demucs
Speech RecognitionJam-ALT SpanishPunctuation F-132.4Whisper v3 +demucs
Speech RecognitionJam-ALT SpanishCase-Sensitive Word Error Rate76OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT SpanishLine break F-133.5OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT SpanishPunctuation F-19OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)70.8OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT SpanishCase-Sensitive Word Error Rate78.5OWSM v3.1 +lang
Speech RecognitionJam-ALT SpanishLine break F-130.2OWSM v3.1 +lang
Speech RecognitionJam-ALT SpanishPunctuation F-18.8OWSM v3.1 +lang
Speech RecognitionJam-ALT SpanishWord Error Rate (WER)73.3OWSM v3.1 +lang
Speech RecognitionJam-ALT GermanCase-Sensitive Word Error Rate17.5AudioShake v3
Speech RecognitionJam-ALT GermanLine break F-183.7AudioShake v3
Speech RecognitionJam-ALT GermanParenthesis F-176.6AudioShake v3
Speech RecognitionJam-ALT GermanPunctuation F-157.1AudioShake v3
Speech RecognitionJam-ALT GermanSection break F-174.5AudioShake v3
Speech RecognitionJam-ALT GermanWord Error Rate (WER)12.6AudioShake v3
Speech RecognitionJam-ALT GermanCase-Sensitive Word Error Rate26Whisper v2 +lang
Speech RecognitionJam-ALT GermanLine break F-171.7Whisper v2 +lang
Speech RecognitionJam-ALT GermanPunctuation F-148.4Whisper v2 +lang
Speech RecognitionJam-ALT GermanWord Error Rate (WER)19.9Whisper v2 +lang
Speech RecognitionJam-ALT GermanCase-Sensitive Word Error Rate30.4Whisper v2 +demucs +lang
Speech RecognitionJam-ALT GermanLine break F-170.6Whisper v2 +demucs +lang
Speech RecognitionJam-ALT GermanPunctuation F-149.2Whisper v2 +demucs +lang
Speech RecognitionJam-ALT GermanWord Error Rate (WER)23.9Whisper v2 +demucs +lang
Speech RecognitionJam-ALT GermanCase-Sensitive Word Error Rate40.4Whisper v3 +lang
Speech RecognitionJam-ALT GermanLine break F-171.1Whisper v3 +lang
Speech RecognitionJam-ALT GermanPunctuation F-147.4Whisper v3 +lang
Speech RecognitionJam-ALT GermanWord Error Rate (WER)35.9Whisper v3 +lang
Speech RecognitionJam-ALT GermanCase-Sensitive Word Error Rate44.6Whisper v3
Speech RecognitionJam-ALT GermanLine break F-171.1Whisper v3
Speech RecognitionJam-ALT GermanPunctuation F-147.3Whisper v3
Speech RecognitionJam-ALT GermanCase-Sensitive Word Error Rate44.9Whisper v3 +demucs +lang
Speech RecognitionJam-ALT GermanLine break F-170.5Whisper v3 +demucs +lang
Speech RecognitionJam-ALT GermanPunctuation F-146.9Whisper v3 +demucs +lang
Speech RecognitionJam-ALT GermanWord Error Rate (WER)40.8Whisper v3 +demucs +lang
Speech RecognitionJam-ALT GermanCase-Sensitive Word Error Rate47.4Whisper v3 +demucs
Speech RecognitionJam-ALT GermanLine break F-171.9Whisper v3 +demucs
Speech RecognitionJam-ALT GermanPunctuation F-145.4Whisper v3 +demucs
Speech RecognitionJam-ALT GermanCase-Sensitive Word Error Rate62OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT GermanLine break F-141.4OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT GermanPunctuation F-124.7OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT GermanWord Error Rate (WER)51.8OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT GermanCase-Sensitive Word Error Rate59.3Whisper v2
Speech RecognitionJam-ALT GermanLine break F-170Whisper v2
Speech RecognitionJam-ALT GermanPunctuation F-147.1Whisper v2
Speech RecognitionJam-ALT GermanWord Error Rate (WER)54.5Whisper v2
Speech RecognitionJam-ALT GermanCase-Sensitive Word Error Rate71.8OWSM v3.1 +lang
Speech RecognitionJam-ALT GermanLine break F-140.7OWSM v3.1 +lang
Speech RecognitionJam-ALT GermanPunctuation F-128.6OWSM v3.1 +lang
Speech RecognitionJam-ALT GermanWord Error Rate (WER)63.3OWSM v3.1 +lang
Speech RecognitionJam-ALT GermanCase-Sensitive Word Error Rate70.4Whisper v2 +demucs
Speech RecognitionJam-ALT GermanLine break F-167.3Whisper v2 +demucs
Speech RecognitionJam-ALT GermanPunctuation F-149.1Whisper v2 +demucs
Speech RecognitionJam-ALT EnglishCase-Sensitive Word Error Rate20.9AudioShake v3
Speech RecognitionJam-ALT EnglishLine break F-184.3AudioShake v3
Speech RecognitionJam-ALT EnglishParenthesis F-137.9AudioShake v3
Speech RecognitionJam-ALT EnglishPunctuation F-165.3AudioShake v3
Speech RecognitionJam-ALT EnglishSection break F-184.8AudioShake v3
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)17.3AudioShake v3
Speech RecognitionJam-ALT EnglishCase-Sensitive Word Error Rate28LyricWhiz
Speech RecognitionJam-ALT EnglishCase-Sensitive Word Error Rate39.1Whisper v2 +demucs
Speech RecognitionJam-ALT EnglishLine break F-153.9Whisper v2 +demucs
Speech RecognitionJam-ALT EnglishPunctuation F-142.2Whisper v2 +demucs
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)33.3Whisper v2 +demucs
Speech RecognitionJam-ALT EnglishCase-Sensitive Word Error Rate41.3Whisper v2 +demucs +lang
Speech RecognitionJam-ALT EnglishLine break F-153.4Whisper v2 +demucs +lang
Speech RecognitionJam-ALT EnglishPunctuation F-141.8Whisper v2 +demucs +lang
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)35.6Whisper v2 +demucs +lang
Speech RecognitionJam-ALT EnglishCase-Sensitive Word Error Rate41.4Whisper v3 +lang
Speech RecognitionJam-ALT EnglishLine break F-172.5Whisper v3 +lang
Speech RecognitionJam-ALT EnglishPunctuation F-141.8Whisper v3 +lang
Speech RecognitionJam-ALT EnglishSection break F-12.6Whisper v3 +lang
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)36.4Whisper v3 +lang
Speech RecognitionJam-ALT EnglishCase-Sensitive Word Error Rate42.5Whisper v3
Speech RecognitionJam-ALT EnglishPunctuation F-141.4Whisper v3
Speech RecognitionJam-ALT EnglishCase-Sensitive Word Error Rate43.7Whisper v2 +lang
Speech RecognitionJam-ALT EnglishLine break F-165.5Whisper v2 +lang
Speech RecognitionJam-ALT EnglishPunctuation F-134.9Whisper v2 +lang
Speech RecognitionJam-ALT EnglishSection break F-111.6Whisper v2 +lang
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)39.7Whisper v2 +lang
Speech RecognitionJam-ALT EnglishCase-Sensitive Word Error Rate47.2Whisper v3 +demucs
Speech RecognitionJam-ALT EnglishLine break F-166.9Whisper v3 +demucs
Speech RecognitionJam-ALT EnglishPunctuation F-125.8Whisper v3 +demucs
Speech RecognitionJam-ALT EnglishCase-Sensitive Word Error Rate47.2Whisper v3 +demucs +lang
Speech RecognitionJam-ALT EnglishLine break F-166.9Whisper v3 +demucs +lang
Speech RecognitionJam-ALT EnglishPunctuation F-125.8Whisper v3 +demucs +lang
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)43Whisper v3 +demucs +lang
Speech RecognitionJam-ALT EnglishCase-Sensitive Word Error Rate47.5Whisper v2
Speech RecognitionJam-ALT EnglishPunctuation F-131.5Whisper v2
Speech RecognitionJam-ALT EnglishCase-Sensitive Word Error Rate69.4OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT EnglishLine break F-147.3OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT EnglishPunctuation F-121.5OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)63.4OWSM v3.1 +demucs +lang
Speech RecognitionJam-ALT EnglishCase-Sensitive Word Error Rate74OWSM v3.1 +lang
Speech RecognitionJam-ALT EnglishLine break F-142.7OWSM v3.1 +lang
Speech RecognitionJam-ALT EnglishPunctuation F-122.3OWSM v3.1 +lang
Speech RecognitionJam-ALT EnglishWord Error Rate (WER)68.6OWSM v3.1 +lang

Related Papers

Enhancing Lyrics Transcription on Music Mixtures with Consistency Loss2025-06-03Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model2024-06-25Jam-ALT: A Formatting-Aware Lyrics Transcription Benchmark2023-11-23Adapting pretrained speech model for Mandarin lyrics transcription and alignment2023-11-21LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT2023-06-29Genre-conditioned Acoustic Models for Automatic Lyrics Transcription of Polyphonic Music2022-04-07Music-robust Automatic Lyrics Transcription of Polyphonic Music2022-04-07PDAugment: Data Augmentation by Pitch and Duration Adjustments for Automatic Lyrics Transcription2021-09-16