TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Codified audio language modeling learns useful representat...

Codified audio language modeling learns useful representations for music information retrieval

Rodrigo Castellon, Chris Donahue, Percy Liang

2021-07-12Music GenerationMusic Genre ClassificationKey DetectionInformation RetrievalRetrievalGenre classificationMusic Information RetrievalMusic TaggingLanguage ModellingEmotion Recognition
PaperPDFCode(official)

Abstract

We demonstrate that language models pre-trained on codified (discretely-encoded) music audio learn representations that are useful for downstream MIR tasks. Specifically, we explore representations from Jukebox (Dhariwal et al. 2020): a music generation system containing a language model trained on codified audio from 1M songs. To determine if Jukebox's representations contain useful information for MIR, we use them as input features to train shallow models on several MIR tasks. Relative to representations from conventional MIR models which are pre-trained on tagging, we find that using representations from Jukebox as input features yields 30% stronger performance on average across four MIR tasks: tagging, genre classification, emotion recognition, and key detection. For key detection, we observe that representations from Jukebox are considerably stronger than those from models pre-trained on tagging, suggesting that pre-training via codified audio language modeling may address blind spots in conventional approaches. We interpret the strength of Jukebox's representations as evidence that modeling audio instead of tags provides richer representations for MIR.

Results

TaskDatasetMetricValueModel
Emotion RecognitionEmomusicEmoA72.1Jukebox (Pre-training: CALM)
Emotion RecognitionEmomusicEmoV61.7Jukebox (Pre-training: CALM)
Emotion RecognitionEmomusicEmoA67.8CLMR (Pre-training: contrastive)
Emotion RecognitionEmomusicEmoV45.8CLMR (Pre-training: contrastive)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation2025-07-21Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17