OpenAsp Dataset

OpenAsp is an Open Aspect-based Multi-Document Summarization dataset derived from DUC and MultiNews summarization datasets.

Antarctica

Dataset Access

To generate OpenAsp, you require access to the DUC dataset which OpenAsp is derived from.

Steps:

Grant access to DUC dataset by following NIST instructions here.
- you should receive two user-password pairs (for DUC01-02 and DUC06-07)
- you should receive a file named fwdrequestingducdata.zip
Clone this repository by running the following command: git clone https://github.com/liatschiff/OpenAsp.git
Optionally create a conda or virtualenv environment:

conda create -n openasp 'python>3.10,<3.11'
conda activate openasp

Install python requirements, currently requires python3.8-3.10 (later python versions have issues with spacy)

pip install -r requirements.txt

copy fwdrequestingducdata.zip into the OpenAsp repo directory
run the prepare script command:

python prepare_openasp_dataset.py --nist-duc2001-user '<2001-user>' --nist-duc2001-password '<2001-pwd>' --nist-duc2006-user '<2006-user>' --nist-duc2006-password '<2006-pwd>'

load the dataset using huggingface datasets

from glob import glob
import os
import gzip
import shutil
from datasets import load_dataset

openasp_files = os.path.join('openasp-v1', '*.jsonl.gz')

data_files = {
    os.path.basename(fname).split('.')[0]: fname
    for fname in glob(openasp_files)
}

for ftype, fname in data_files.copy().items():
    with gzip.open(fname, 'rb') as gz_file:
        with open(fname[:-3], 'wb') as output_file:
            shutil.copyfileobj(gz_file, output_file)
    data_files[ftype] = fname[:-3]

# load OpenAsp as huggingface's dataset
openasp = load_dataset('json', data_files=data_files)

# print first sample from every split
for split in ['train', 'valid', 'test']:
    sample = openasp[split][0]

    # print title, aspect_label, summary and documents for the sample
    title = sample['title']
    aspect_label = sample['aspect_label']
    summary = '\n'.join(sample['summary_text'])
    input_docs_text = ['\n'.join(d['text']) for d in sample['documents']]

    print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *')
    print(f'Sample from {split}\nSplit title={title}\nAspect label={aspect_label}')
    print(f'\naspect-based summary:\n {summary}')
    print('\ninput documents:\n')
    for i, doc_txt in enumerate(input_docs_text):
        print(f'---- doc #{i} ----')
        print(doc_txt[:256] + '...')
    print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *\n\n\n')

Troubleshooting

Dataset failed loading with load_dataset() - you may want to delete huggingface datasets cache folder
401 Client Error: Unauthorized - you're DUC credentials are incorrect, please verify them (case sensitive, no extra spaces etc)
Dataset created but prints a warning about content verification - you may be using different version of NLTK or spacy model which affects the sentence tokenization process. You must use exact versions as pinned on requirements.txt.
IndexError: list index out of range - similar to (3), try to reinstall the requirements with exact package versions.

Under The Hood

The prepare_openasp_dataset.py script downloads DUC and Multi-News source files, uses sacrerouge package to prepare the datasets and uses the openasp_v1_dataset_metadata.json file to extract the relevant aspect summaries and compile the final OpenAsp dataset.

License

This repository, including the openasp_v1_dataset_metadata.json and prepare_openasp_dataset.py, are released under APACHE license.

OpenAsp dataset summary and source document for each sample, which are generated by running the script, are licensed under the respective generic summarization dataset - Multi-News license and DUC license.