TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multimodal Machine Translation through Visuals and Speech

Multimodal Machine Translation through Visuals and Speech

Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, Jörg Tiedemann

2019-11-28Speech RecognitionMachine TranslationMultimodal Machine Translationspeech-recognitionVideo CaptioningTranslationImage Captioning
PaperPDF

Abstract

Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance evaluation. The paper concludes with a discussion of directions for future research in these areas: the need for more expansive and challenging datasets, for targeted evaluations of model performance, and for multimodality in both the input and output space.

Results

TaskDatasetMetricValueModel
Machine TranslationMulti30KBLEU (EN-DE)39.4Caglayan
Machine TranslationMulti30KMeteor (EN-DE)58.7Caglayan
Multimodal Machine TranslationMulti30KBLEU (EN-DE)39.4Caglayan
Multimodal Machine TranslationMulti30KMeteor (EN-DE)58.7Caglayan

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15Function-to-Style Guidance of LLMs for Code Translation2025-07-15WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation2025-07-09