Multimodal Machine Translation through Visuals and Speech

Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, Jörg Tiedemann

2019-11-28Speech Recognition Machine Translation Multimodal Machine Translation speech-recognition Video Captioning Translation Image Captioning

Paper PDF

Abstract

Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance evaluation. The paper concludes with a discussion of directions for future research in these areas: the need for more expansive and challenging datasets, for targeted evaluations of model performance, and for multimodality in both the input and output space.

Results

Task	Dataset	Metric	Value	Model
Machine Translation	Multi30K	BLEU (EN-DE)	39.4	Caglayan
Machine Translation	Multi30K	Meteor (EN-DE)	58.7	Caglayan
Multimodal Machine Translation	Multi30K	BLEU (EN-DE)	39.4	Caglayan
Multimodal Machine Translation	Multi30K	Meteor (EN-DE)	58.7	Caglayan

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15 Function-to-Style Guidance of LLMs for Code Translation2025-07-15 WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14 Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation2025-07-09