WikiHow: A Large Scale Text Summarization Dataset

Mahnaz Koupaee, William Yang Wang

2018-10-18Text Summarization

Paper PDF Code Code Code Code Code Code Code Code Code

Abstract

Sequence-to-sequence models have recently gained the state of the art performance in summarization. However, not too many large-scale high-quality datasets are available and almost all the available ones are mainly news articles with specific writing style. Moreover, abstractive human-style systems involving description of the content at a deeper level require data with higher levels of abstraction. In this paper, we present WikiHow, a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base written by different human authors. The articles span a wide range of topics and therefore represent high diversity styles. We evaluate the performance of the existing methods on WikiHow to present its challenges and set some baselines to further improve it.

Results

Task	Dataset	Metric	Value	Model
Text Summarization	WikiHow	ROUGE-1	28.53	Pointer-generator + coverage
Text Summarization	WikiHow	ROUGE-2	9.23	Pointer-generator + coverage
Text Summarization	WikiHow	ROUGE-L	26.54	Pointer-generator + coverage

Related Papers

LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification2025-07-15 On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention2025-06-11 Improving large language models with concept-aware fine-tuning2025-06-09 MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection2025-05-29 APE: A Data-Centric Benchmark for Efficient LLM Adaptation in Text Summarization2025-05-26 FiLLM -- A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM)2025-05-25 Scaling Up Biomedical Vision-Language Models: Fine-Tuning, Instruction Tuning, and Multi-Modal Learning2025-05-23 A Structured Literature Review on Traditional Approaches in Current Natural Language Processing2025-05-19