TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Textually Pretrained Speech Language Models

Textually Pretrained Speech Language Models

Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz, Yossi Adi

2023-05-22NeurIPS 2023 11Language Modelling
PaperPDFCode(official)

Abstract

Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. We make speech samples, code and models publicly available: https://pages.cs.huji.ac.il/adiyoss-lab/twist/ .

Results

TaskDatasetMetricValueModel
Language ModellingSALMonBackground (Domain) Consistency55TWIST 7B
Language ModellingSALMonBackground (Random) Consistency60.5TWIST 7B
Language ModellingSALMonBackground Alignment54.5TWIST 7B
Language ModellingSALMonGender Consistency70TWIST 7B
Language ModellingSALMonRoom Consistency62TWIST 7B
Language ModellingSALMonSentiment Alignment51.5TWIST 7B
Language ModellingSALMonSentiment Consistency61.5TWIST 7B
Language ModellingSALMonSpeaker Consistency71TWIST 7B
Language ModellingSALMonBackground (Domain) Consistency55.5TWIST 1.3B
Language ModellingSALMonBackground (Random) Consistency60.5TWIST 1.3B
Language ModellingSALMonBackground Alignment56.5TWIST 1.3B
Language ModellingSALMonGender Consistency69.5TWIST 1.3B
Language ModellingSALMonRoom Consistency59TWIST 1.3B
Language ModellingSALMonSentiment Alignment53TWIST 1.3B
Language ModellingSALMonSentiment Consistency61.5TWIST 1.3B
Language ModellingSALMonSpeaker Consistency69TWIST 1.3B
Language ModellingSALMonBackground (Domain) Consistency54TWIST 350M
Language ModellingSALMonBackground (Random) Consistency61.5TWIST 350M
Language ModellingSALMonBackground Alignment56.5TWIST 350M
Language ModellingSALMonGender Consistency68TWIST 350M
Language ModellingSALMonRoom Consistency59TWIST 350M
Language ModellingSALMonSentiment Alignment51.5TWIST 350M
Language ModellingSALMonSentiment Consistency59TWIST 350M
Language ModellingSALMonSpeaker Consistency69.5TWIST 350M

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing2025-07-16