TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Spirit LM: Interleaved Spoken and Written Language Model

Spirit LM: Interleaved Spoken and Written Language Model

Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-Jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Mary Williamson, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux

2024-02-08Language Modelling
PaperPDFCode(official)

Abstract

We introduce Spirit LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a 7B pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single stream of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. Spirit LM comes in two versions: a Base version that uses speech phonetic units (HuBERT) and an Expressive version that models expressivity using pitch and style units in addition to the phonetic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that Spirit LM can learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification). We make available model weights and inference code.

Results

TaskDatasetMetricValueModel
Language Modelling2000 HUB5 English10-stage average accuracy10MMLU
Language ModellingSALMonBackground (Domain) Consistency55Spirit-LM (Expr.)
Language ModellingSALMonBackground (Random) Consistency64Spirit-LM (Expr.)
Language ModellingSALMonBackground Alignment59.5Spirit-LM (Expr.)
Language ModellingSALMonGender Consistency85Spirit-LM (Expr.)
Language ModellingSALMonRoom Consistency54.5Spirit-LM (Expr.)
Language ModellingSALMonSentiment Alignment52Spirit-LM (Expr.)
Language ModellingSALMonSentiment Consistency73.5Spirit-LM (Expr.)
Language ModellingSALMonSpeaker Consistency81Spirit-LM (Expr.)
Language ModellingSALMonBackground (Domain) Consistency53.5Spirit-LM (base)
Language ModellingSALMonBackground (Random) Consistency55.5Spirit-LM (base)
Language ModellingSALMonBackground Alignment51.5Spirit-LM (base)
Language ModellingSALMonGender Consistency67Spirit-LM (base)
Language ModellingSALMonRoom Consistency54.5Spirit-LM (base)
Language ModellingSALMonSentiment Alignment48Spirit-LM (base)
Language ModellingSALMonSentiment Consistency54.5Spirit-LM (base)
Language ModellingSALMonSpeaker Consistency69.5Spirit-LM (base)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing2025-07-16