Spirit LM: Interleaved Spoken and Written Language Model

Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-Jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Mary Williamson, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux

2024-02-08Language Modelling

Paper PDF Code(official)

Abstract

We introduce Spirit LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a 7B pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single stream of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. Spirit LM comes in two versions: a Base version that uses speech phonetic units (HuBERT) and an Expressive version that models expressivity using pitch and style units in addition to the phonetic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that Spirit LM can learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification). We make available model weights and inference code.

Results

Task	Dataset	Metric	Value	Model
Language Modelling	2000 HUB5 English	10-stage average accuracy	10	MMLU
Language Modelling	SALMon	Background (Domain) Consistency	55	Spirit-LM (Expr.)
Language Modelling	SALMon	Background (Random) Consistency	64	Spirit-LM (Expr.)
Language Modelling	SALMon	Background Alignment	59.5	Spirit-LM (Expr.)
Language Modelling	SALMon	Gender Consistency	85	Spirit-LM (Expr.)
Language Modelling	SALMon	Room Consistency	54.5	Spirit-LM (Expr.)
Language Modelling	SALMon	Sentiment Alignment	52	Spirit-LM (Expr.)
Language Modelling	SALMon	Sentiment Consistency	73.5	Spirit-LM (Expr.)
Language Modelling	SALMon	Speaker Consistency	81	Spirit-LM (Expr.)
Language Modelling	SALMon	Background (Domain) Consistency	53.5	Spirit-LM (base)
Language Modelling	SALMon	Background (Random) Consistency	55.5	Spirit-LM (base)
Language Modelling	SALMon	Background Alignment	51.5	Spirit-LM (base)
Language Modelling	SALMon	Gender Consistency	67	Spirit-LM (base)
Language Modelling	SALMon	Room Consistency	54.5	Spirit-LM (base)
Language Modelling	SALMon	Sentiment Alignment	48	Spirit-LM (base)
Language Modelling	SALMon	Sentiment Consistency	54.5	Spirit-LM (base)
Language Modelling	SALMon	Speaker Consistency	69.5	Spirit-LM (base)

Spirit LM: Interleaved Spoken and Written Language Model

Abstract

Results

Related Papers

Spirit LM: Interleaved Spoken and Written Language Model

Abstract

Results

Related Papers