Multimodal Transformer for Unaligned Multimodal Language Sequences

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, Ruslan Salakhutdinov

2019-06-01ACL 2019 7Time Series Time Series Analysis Multimodal Sentiment Analysis

Abstract

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise crossmodal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.

Results

Task	Dataset	Metric	Value	Model
Sentiment Analysis	MOSI	Accuracy	83	MulT
Sentiment Analysis	MOSI	F1 score	82.8	MulT

Related Papers

MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling2025-07-17 The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting2025-07-17 Emergence of Functionally Differentiated Structures via Mutual Information Optimization in Recurrent Neural Networks2025-07-17 Data Augmentation in Time Series Forecasting through Inverted Framework2025-07-15 D3FL: Data Distribution and Detrending for Robust Federated Learning in Non-linear Time-series Data2025-07-15 Towards Interpretable Time Series Foundation Models2025-07-10 MoFE-Time: Mixture of Frequency Domain Experts for Time-Series Forecasting Models2025-07-09 Foundation models for time series forecasting: Application in conformal prediction2025-07-09