TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Speech-Text Dialog Pre-training for Spoken Dialog Understa...

Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment

Tianshu Yu, Haoyu Gao, Ting-En Lin, Min Yang, Yuchuan Wu, Wentao Ma, Chao Wang, Fei Huang, Yongbin Li

2023-05-19Emotion Recognition in ConversationMultimodal Intent Recognitioncross-modal alignmentMultimodal Sentiment Analysis
PaperPDFCode(official)

Abstract

Recently, speech-text pre-training methods have shown remarkable success in many speech and natural language processing tasks. However, most previous pre-trained models are usually tailored for one or two specific tasks, but fail to conquer a wide range of speech-text tasks. In addition, existing speech-text pre-training methods fail to explore the contextual information within a dialogue to enrich utterance representations. In this paper, we propose Speech-text dialog Pre-training for spoken dialog understanding with ExpliCiT cRoss-Modal Alignment (SPECTRA), which is the first-ever speech-text dialog pre-training model. Concretely, to consider the temporality of speech modality, we design a novel temporal position prediction task to capture the speech-text alignment. This pre-training task aims to predict the start and end time of each textual word in the corresponding speech waveform. In addition, to learn the characteristics of spoken dialogs, we generalize a response selection task from textual dialog pre-training to speech-text dialog pre-training scenarios. Experimental results on four different downstream speech-text tasks demonstrate the superiority of SPECTRA in learning speech-text alignment and multi-turn dialog context.

Results

TaskDatasetMetricValueModel
Reading ComprehensionMIntRecAccuracy (20 classes)73.48SPECTRA
Emotion RecognitionIEMOCAPAccuracy67.94SPECTRA
Sentiment AnalysisMOSIAccuracy87.5SPECTRA
Sentiment AnalysisCMU-MOSEIAccuracy87.34SPECTRA
Sentiment AnalysisCMU-MOSIAcc-287.5SPECTRA
Intent RecognitionMIntRecAccuracy (20 classes)73.48SPECTRA

Related Papers

Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation2025-07-21Transformer-based Spatial Grounding: A Comprehensive Survey2025-07-17CATVis: Context-Aware Thought Visualization2025-07-15Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection2025-07-15Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation2025-07-11Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09Skywork-R1V3 Technical Report2025-07-08RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation Models2025-07-08