The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems

Ryan Lowe, Nissan Pow, Iulian Serban, Joelle Pineau

2015-06-30WS 2015 9Answer Selection Conversational Response Selection

Paper PDF Code(official)Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code

Abstract

This paper introduces the Ubuntu Dialogue Corpus, a dataset containing almost 1 million multi-turn dialogues, with a total of over 7 million utterances and 100 million words. This provides a unique resource for research into building dialogue managers based on neural language models that can make use of large amounts of unlabeled data. The dataset has both the multi-turn property of conversations in the Dialog State Tracking Challenge datasets, and the unstructured nature of interactions from microblog services such as Twitter. We also describe two neural learning architectures suitable for analyzing this dataset, and provide benchmark performance on the task of selecting the best next response.

Results

Task	Dataset	Metric	Value	Model
Conversational Response Selection	Ubuntu Dialogue (v1, Ranking)	R10@1	0.604	Dual-LSTM
Conversational Response Selection	Ubuntu Dialogue (v1, Ranking)	R10@2	0.745	Dual-LSTM
Conversational Response Selection	Ubuntu Dialogue (v1, Ranking)	R10@5	0.926	Dual-LSTM
Conversational Response Selection	Ubuntu Dialogue (v1, Ranking)	R2@1	0.878	Dual-LSTM

Related Papers

FinBERT-QA: Financial Question Answering with pre-trained BERT Language Models2025-04-24 Could Thinking Multilingually Empower LLM Reasoning?2025-04-16 Enhancing Mathematical Reasoning in Large Language Models with Self-Consistency-Based Hallucination Detection2025-04-13 Evaluating Answer Reranking Strategies in Time-sensitive Question Answering2025-03-06 FANS -- Formal Answer Selection for Natural Language Math Reasoning Using Lean42025-03-05 SynTQA: Synergistic Table-based Question Answering via Mixture of Text-to-SQL and E2E TQA2024-09-25 Efficient Dynamic Hard Negative Sampling for Dialogue Selection2024-08-16 Zero-Shot End-To-End Spoken Question Answering In Medical Domain2024-06-09