TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SeamlessM4T: Massively Multilingual & Multimodal Machine T...

SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-jussà, Onur Celebi, Maha Elbayad, Cynthia Gao, Francisco Guzmán, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang

2023-08-22Speech-to-Speech TranslationMachine TranslationAutomatic Speech RecognitionSpeech-to-Text TranslationSpeech-to-TextText to SpeechTranslationtext-to-speech
PaperPDFCodeCodeCodeCode(official)

Abstract

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication

Results

TaskDatasetMetricValueModel
Machine Translationflores95-devtest X-engChrF++60.8SeamlessM4T Large
Machine Translationflores95-devtest X-engChrF++60.7SeamlessM4T-NLLB-1.3B
Machine Translationflores95-devtest X-engChrF++55.4SeamlessM4T Medium
Machine Translationflores95-devtest eng-XChrF++50.9SeamlessM4T Large
Machine Translationflores95-devtest eng-XChrF++49.6SeamlessM4T-NLLB-1.3B
Machine Translationflores95-devtest eng-XChrF++48.4SeamlessM4T Medium
Machine TranslationFLoRes-200BLEU37.5SeamlessM4T-Large-V1
Speech-to-Text TranslationFLEURS X-engBLEU24SeamlessM4T Large
Speech-to-Text TranslationFLEURS X-engBLEU20.9SeamlessM4T Medium
Speech-to-Text TranslationFLEURS eng-XBLEU21.5SeamlessM4T Large
Speech-to-Text TranslationFLEURS eng-XBLEU19.2SeamlessM4T Medium
Speech-to-Text TranslationCoVoST 2 X-engBLEU34.1SeamlessM4T Large
Speech-to-Text TranslationCoVoST 2 X-engBLEU29.8SeamlessM4T Medium
Speech-to-Text TranslationCoVoST 2 eng-XBLEU30.6SeamlessM4T Large
Speech-to-Text TranslationCoVoST 2 eng-XBLEU26.6SeamlessM4T Medium
Speech-to-Speech TranslationFLEURS X-engASR-BLEU29.4SeamlessM4T LargeV2
Speech-to-Speech TranslationFLEURS X-engASR-BLEU25.8SeamlessM4T Large
Speech-to-Speech TranslationFLEURS X-engASR-BLEU20.4SeamlessM4T Medium
Speech-to-Speech TranslationCVSSASR-BLEU36.5SeamlessM4T Large
Speech-to-Speech TranslationCVSSASR-BLEU28.1SeamlessM4T Medium

Related Papers

Hear Your Code Fail, Voice-Assisted Debugging for Python2025-07-20Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge2025-07-15Function-to-Style Guidance of LLMs for Code Translation2025-07-15WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments2025-07-14