SubSumE
TextsCC-BY-4.0 LicenseIntroduced 2021-11-01
SubSumE Dataset
This repository contains the SubSumE dataset for subjective document summarization. See the paper and the talk for details on dataset creation. Also check out our work SuDocu on example-based document summarization.
Dataset Files
Download the dataset from here.
The dataset contains :
- Simplified text from 48 Wikipedia pages of the states in the US. Additionally, all the sentences in these documents
are put together in a single file
processed_state_sentences.csvand are assigned a unique sentence id that is used in summary json files. - Intent-based summaries created by human annotators.
Each datapoint file in the directory user_summary_jsons contains a json containing summaries of Wikipedia pages
of eight states with following keys:
- intent : Summarization intent provided to human annotators for generating the summary
- summaries: List of summary jsons for eight states assigned to the annotator. Each json in the list contains following keys:
- state_name: Name of the state
- sentence_ids: Global ids of sentences (wrt
processed_state_sentences.csv) present in the summary - sentences: List of sentences present in the summary
- use_keywords: Keywords used by the annotator to search the document when creating summaries