Ondřej Cífka, Constantinos Dimitriou, Cheng-i Wang, Hendrik Schreiber, Luke Miner, Fabian-Robert Stöter
Current automatic lyrics transcription (ALT) benchmarks focus exclusively on word content and ignore the finer nuances of written lyrics including formatting and punctuation, which leads to a potential misalignment with the creative products of musicians and songwriters as well as listeners' experiences. For example, line breaks are important in conveying information about rhythm, emotional emphasis, rhyme, and high-level structure. To address this issue, we introduce Jam-ALT, a new lyrics transcription benchmark based on the JamendoLyrics dataset. Our contribution is twofold. Firstly, a complete revision of the transcripts, geared specifically towards ALT evaluation by following a newly created annotation guide that unifies the music industry's guidelines, covering aspects such as punctuation, line breaks, spelling, background vocals, and non-word sounds. Secondly, a suite of evaluation metrics designed, unlike the traditional word error rate, to capture such phenomena. We hope that the proposed benchmark contributes to the ALT task, enabling more precise and reliable assessments of transcription systems and enhancing the user experience in lyrics applications such as subtitle renderings for live captioning or karaoke.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Speech Recognition | Jam-ALT | Case Error Rate | 3.4 | AudioShake v1 |
| Speech Recognition | Jam-ALT | Line break F1 | 82.3 | AudioShake v1 |
| Speech Recognition | Jam-ALT | Parenthesis F-1 | 29.4 | AudioShake v1 |
| Speech Recognition | Jam-ALT | Punctuation F1 | 50.5 | AudioShake v1 |
| Speech Recognition | Jam-ALT | Section break F1 | 72.1 | AudioShake v1 |
| Speech Recognition | Jam-ALT | Word Error Rate (WER) | 26 | AudioShake v1 |
| Speech Recognition | Jam-ALT | Line break F1 | 73.5 | Whisper v3 |
| Speech Recognition | Jam-ALT | Section break F1 | 1 | Whisper v3 |
| Speech Recognition | Jam-ALT | Word Error Rate (WER) | 35.5 | Whisper v3 |
| Speech Recognition | Jam-ALT | Case Error Rate | 4.3 | Whisper v3 |
| Speech Recognition | Jam-ALT | Line break F1 | 73.5 | Whisper v3 |
| Speech Recognition | Jam-ALT | Punctuation F1 | 41.6 | Whisper v3 |
| Speech Recognition | Jam-ALT | Section break F1 | 1 | Whisper v3 |
| Speech Recognition | Jam-ALT | Word Error Rate (WER) | 35.5 | Whisper v3 |
| Speech Recognition | Jam-ALT | Case Error Rate | 4.5 | Whisper v2 |
| Speech Recognition | Jam-ALT | Punctuation F1 | 41.7 | Whisper v2 |
| Speech Recognition | Jam-ALT | Word Error Rate (WER) | 35.7 | Whisper v2 |
| Speech Recognition | Jam-ALT | Case Error Rate | 5.3 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT | Punctuation F1 | 28 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT | Word Error Rate (WER) | 44 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT | Case Error Rate | 3.8 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT | Punctuation F1 | 29 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT | Word Error Rate (WER) | 47.9 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT French | Line break F-1 | 73.4 | Whisper v2 |
| Speech Recognition | Jam-ALT French | Section break F-1 | 1.4 | Whisper v2 |
| Speech Recognition | Jam-ALT French | Word Error Rate (WER) | 27.7 | Whisper v2 |
| Speech Recognition | Jam-ALT French | Case Error Rate | 3.2 | Whisper v2 |
| Speech Recognition | Jam-ALT French | Line break F-1 | 73.4 | Whisper v2 |
| Speech Recognition | Jam-ALT French | Punctuation F-1 | 45.8 | Whisper v2 |
| Speech Recognition | Jam-ALT French | Section break F-1 | 1.4 | Whisper v2 |
| Speech Recognition | Jam-ALT French | Word Error Rate (WER) | 27.7 | Whisper v2 |
| Speech Recognition | Jam-ALT French | Word Error Rate (WER) | 34.7 | Whisper v3 |
| Speech Recognition | Jam-ALT French | Case Error Rate | 3.3 | Whisper v3 |
| Speech Recognition | Jam-ALT French | Line break F-1 | 77.8 | Whisper v3 |
| Speech Recognition | Jam-ALT French | Punctuation F-1 | 42.4 | Whisper v3 |
| Speech Recognition | Jam-ALT French | Word Error Rate (WER) | 34.7 | Whisper v3 |
| Speech Recognition | Jam-ALT French | Case Error Rate | 2 | AudioShake v1 |
| Speech Recognition | Jam-ALT French | Line break F-1 | 84.9 | AudioShake v1 |
| Speech Recognition | Jam-ALT French | Parenthesis F-1 | 41.3 | AudioShake v1 |
| Speech Recognition | Jam-ALT French | Punctuation F-1 | 45.8 | AudioShake v1 |
| Speech Recognition | Jam-ALT French | Section break F-1 | 72.5 | AudioShake v1 |
| Speech Recognition | Jam-ALT French | Word Error Rate (WER) | 34.9 | AudioShake v1 |
| Speech Recognition | Jam-ALT French | Word Error Rate (WER) | 43.3 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT French | Case Error Rate | 3.2 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT French | Line break F-1 | 66.1 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT French | Punctuation F-1 | 34.9 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT French | Word Error Rate (WER) | 43.3 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT French | Word Error Rate (WER) | 44.9 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT French | Case Error Rate | 3.2 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT French | Line break F-1 | 69.4 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT French | Punctuation F-1 | 30.9 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT French | Word Error Rate (WER) | 44.9 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT Spanish | Case Error Rate | 4.1 | AudioShake v1 |
| Speech Recognition | Jam-ALT Spanish | Line break F-1 | 82.7 | AudioShake v1 |
| Speech Recognition | Jam-ALT Spanish | Parenthesis F-1 | 38 | AudioShake v1 |
| Speech Recognition | Jam-ALT Spanish | Punctuation F-1 | 47.8 | AudioShake v1 |
| Speech Recognition | Jam-ALT Spanish | Section break F-1 | 69.6 | AudioShake v1 |
| Speech Recognition | Jam-ALT Spanish | Word Error Rate (WER) | 22.5 | AudioShake v1 |
| Speech Recognition | Jam-ALT Spanish | Case Error Rate | 6.5 | Whisper v2 |
| Speech Recognition | Jam-ALT Spanish | Punctuation F-1 | 50 | Whisper v2 |
| Speech Recognition | Jam-ALT Spanish | Word Error Rate (WER) | 25.7 | Whisper v2 |
| Speech Recognition | Jam-ALT Spanish | Line break F-1 | 73.7 | Whisper v3 |
| Speech Recognition | Jam-ALT Spanish | Word Error Rate (WER) | 28.6 | Whisper v3 |
| Speech Recognition | Jam-ALT Spanish | Case Error Rate | 5 | Whisper v3 |
| Speech Recognition | Jam-ALT Spanish | Line break F-1 | 73.7 | Whisper v3 |
| Speech Recognition | Jam-ALT Spanish | Punctuation F-1 | 41.9 | Whisper v3 |
| Speech Recognition | Jam-ALT Spanish | Word Error Rate (WER) | 28.6 | Whisper v3 |
| Speech Recognition | Jam-ALT Spanish | Case Error Rate | 7.1 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT Spanish | Line break F-1 | 56.4 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT Spanish | Punctuation F-1 | 17.2 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT Spanish | Word Error Rate (WER) | 38.8 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT Spanish | Word Error Rate (WER) | 61.5 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT Spanish | Case Error Rate | 3.6 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT Spanish | Line break F-1 | 52.4 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT Spanish | Punctuation F-1 | 28.7 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT Spanish | Word Error Rate (WER) | 61.5 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT German | Case Error Rate | 4.1 | AudioShake v1 |
| Speech Recognition | Jam-ALT German | Line break F-1 | 81.2 | AudioShake v1 |
| Speech Recognition | Jam-ALT German | Parenthesis F-1 | 8.1 | AudioShake v1 |
| Speech Recognition | Jam-ALT German | Punctuation F-1 | 48.5 | AudioShake v1 |
| Speech Recognition | Jam-ALT German | Section break F-1 | 69.2 | AudioShake v1 |
| Speech Recognition | Jam-ALT German | Word Error Rate (WER) | 24.4 | AudioShake v1 |
| Speech Recognition | Jam-ALT German | Section break F-1 | 1.2 | Whisper v3 |
| Speech Recognition | Jam-ALT German | Word Error Rate (WER) | 40.7 | Whisper v3 |
| Speech Recognition | Jam-ALT German | Case Error Rate | 4 | Whisper v3 |
| Speech Recognition | Jam-ALT German | Line break F-1 | 71.2 | Whisper v3 |
| Speech Recognition | Jam-ALT German | Punctuation F-1 | 41.2 | Whisper v3 |
| Speech Recognition | Jam-ALT German | Section break F-1 | 1.2 | Whisper v3 |
| Speech Recognition | Jam-ALT German | Word Error Rate (WER) | 40.7 | Whisper v3 |
| Speech Recognition | Jam-ALT German | Word Error Rate (WER) | 43.5 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT German | Case Error Rate | 4.4 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT German | Line break F-1 | 72 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT German | Punctuation F-1 | 34 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT German | Word Error Rate (WER) | 43.5 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT German | Case Error Rate | 5.3 | Whisper v2 |
| Speech Recognition | Jam-ALT German | Line break F-1 | 69.9 | Whisper v2 |
| Speech Recognition | Jam-ALT German | Punctuation F-1 | 38.7 | Whisper v2 |
| Speech Recognition | Jam-ALT German | Word Error Rate (WER) | 45.4 | Whisper v2 |
| Speech Recognition | Jam-ALT German | Word Error Rate (WER) | 65.2 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT German | Case Error Rate | 5.9 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT German | Line break F-1 | 67.5 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT German | Punctuation F-1 | 30.2 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT German | Word Error Rate (WER) | 65.2 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT English | Case Error Rate | 3.4 | AudioShake v1 |
| Speech Recognition | Jam-ALT English | Line break F-1 | 80.7 | AudioShake v1 |
| Speech Recognition | Jam-ALT English | Parenthesis F-1 | 32.4 | AudioShake v1 |
| Speech Recognition | Jam-ALT English | Punctuation F-1 | 59 | AudioShake v1 |
| Speech Recognition | Jam-ALT English | Section break F-1 | 77.4 | AudioShake v1 |
| Speech Recognition | Jam-ALT English | Word Error Rate (WER) | 22.1 | AudioShake v1 |
| Speech Recognition | Jam-ALT English | Line break F-1 | 74 | LyricWhiz |
| Speech Recognition | Jam-ALT English | Punctuation F-1 | 34 | LyricWhiz |
| Speech Recognition | Jam-ALT English | Section break F-1 | 1.4 | LyricWhiz |
| Speech Recognition | Jam-ALT English | Word Error Rate (WER) | 24.6 | LyricWhiz |
| Speech Recognition | Jam-ALT English | Case Error Rate | 3.5 | LyricWhiz |
| Speech Recognition | Jam-ALT English | Line break F-1 | 74 | LyricWhiz |
| Speech Recognition | Jam-ALT English | Punctuation F-1 | 34 | LyricWhiz |
| Speech Recognition | Jam-ALT English | Section break F-1 | 1.4 | LyricWhiz |
| Speech Recognition | Jam-ALT English | Word Error Rate (WER) | 24.6 | LyricWhiz |
| Speech Recognition | Jam-ALT English | Case Error Rate | 5.3 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT English | Line break F-1 | 53.8 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT English | Punctuation F-1 | 39.2 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT English | Word Error Rate (WER) | 32.3 | Whisper v2 +demucs |
| Speech Recognition | Jam-ALT English | Line break F-1 | 71.5 | Whisper v3 |
| Speech Recognition | Jam-ALT English | Section break F-1 | 2.6 | Whisper v3 |
| Speech Recognition | Jam-ALT English | Word Error Rate (WER) | 37.7 | Whisper v3 |
| Speech Recognition | Jam-ALT English | Case Error Rate | 4.8 | Whisper v3 |
| Speech Recognition | Jam-ALT English | Line break F-1 | 71.5 | Whisper v3 |
| Speech Recognition | Jam-ALT English | Punctuation F-1 | 40.9 | Whisper v3 |
| Speech Recognition | Jam-ALT English | Section break F-1 | 2.6 | Whisper v3 |
| Speech Recognition | Jam-ALT English | Word Error Rate (WER) | 37.7 | Whisper v3 |
| Speech Recognition | Jam-ALT English | Word Error Rate (WER) | 43 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT English | Case Error Rate | 4.1 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT English | Line break F-1 | 66.8 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT English | Punctuation F-1 | 23.3 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT English | Word Error Rate (WER) | 43 | Whisper v3 +demucs |
| Speech Recognition | Jam-ALT English | Line break F-1 | 63 | Whisper v2 |
| Speech Recognition | Jam-ALT English | Section break F-1 | 11.2 | Whisper v2 |
| Speech Recognition | Jam-ALT English | Word Error Rate (WER) | 43.8 | Whisper v2 |
| Speech Recognition | Jam-ALT English | Case Error Rate | 3.5 | Whisper v2 |
| Speech Recognition | Jam-ALT English | Line break F-1 | 63 | Whisper v2 |
| Speech Recognition | Jam-ALT English | Punctuation F-1 | 31.3 | Whisper v2 |
| Speech Recognition | Jam-ALT English | Section break F-1 | 11.2 | Whisper v2 |
| Speech Recognition | Jam-ALT English | Word Error Rate (WER) | 43.8 | Whisper v2 |