OpenAsp
OpenAsp Dataset
OpenAsp is an Open Aspect-based Multi-Document Summarization dataset derived from DUC and MultiNews summarization datasets.

Dataset Access
To generate OpenAsp, you require access to the DUC dataset which OpenAsp is derived from.
Steps:
- Grant access to DUC dataset by following NIST instructions here.
- you should receive two user-password pairs (for DUC01-02 and DUC06-07)
- you should receive a file named
fwdrequestingducdata.zip
- Clone this repository by running the following command:
git clone https://github.com/liatschiff/OpenAsp.git - Optionally create a
condaorvirtualenvenvironment:
conda create -n openasp 'python>3.10,<3.11'
conda activate openasp
- Install python requirements, currently requires python3.8-3.10 (later python versions have issues with
spacy)
pip install -r requirements.txt
-
copy
fwdrequestingducdata.zipinto theOpenAsprepo directory -
run the prepare script command:
python prepare_openasp_dataset.py --nist-duc2001-user '<2001-user>' --nist-duc2001-password '<2001-pwd>' --nist-duc2006-user '<2006-user>' --nist-duc2006-password '<2006-pwd>'
- load the dataset using huggingface datasets
from glob import glob
import os
import gzip
import shutil
from datasets import load_dataset
openasp_files = os.path.join('openasp-v1', '*.jsonl.gz')
data_files = {
os.path.basename(fname).split('.')[0]: fname
for fname in glob(openasp_files)
}
for ftype, fname in data_files.copy().items():
with gzip.open(fname, 'rb') as gz_file:
with open(fname[:-3], 'wb') as output_file:
shutil.copyfileobj(gz_file, output_file)
data_files[ftype] = fname[:-3]
# load OpenAsp as huggingface's dataset
openasp = load_dataset('json', data_files=data_files)
# print first sample from every split
for split in ['train', 'valid', 'test']:
sample = openasp[split][0]
# print title, aspect_label, summary and documents for the sample
title = sample['title']
aspect_label = sample['aspect_label']
summary = '\n'.join(sample['summary_text'])
input_docs_text = ['\n'.join(d['text']) for d in sample['documents']]
print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *')
print(f'Sample from {split}\nSplit title={title}\nAspect label={aspect_label}')
print(f'\naspect-based summary:\n {summary}')
print('\ninput documents:\n')
for i, doc_txt in enumerate(input_docs_text):
print(f'---- doc #{i} ----')
print(doc_txt[:256] + '...')
print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *\n\n\n')
Troubleshooting
- Dataset failed loading with
load_dataset()- you may want to delete huggingface datasets cache folder - 401 Client Error: Unauthorized - you're DUC credentials are incorrect, please verify them (case sensitive, no extra spaces etc)
- Dataset created but prints a warning about content verification - you may be using different version of
NLTKorspacymodel which affects the sentence tokenization process. You must use exact versions as pinned onrequirements.txt. - IndexError: list index out of range - similar to (3), try to reinstall the requirements with exact package versions.
Under The Hood
The prepare_openasp_dataset.py script downloads DUC and Multi-News source files, uses sacrerouge package to
prepare the datasets and uses the openasp_v1_dataset_metadata.json file to extract the relevant aspect summaries and compile the final OpenAsp dataset.
License
This repository, including the openasp_v1_dataset_metadata.json and prepare_openasp_dataset.py, are released under APACHE license.
OpenAsp dataset summary and source document for each sample, which are generated by running the script, are licensed under the respective generic summarization dataset - Multi-News license and DUC license.