TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets/ARF

ARF

Artificial Relationships in Fiction

TextsMITIntroduced 2025-05-04

Artificial Relationships in Fiction

Dataset Description

Artificial Relationships in Fiction (ARF) is a synthetically annotated dataset for Relation Extraction (RE) in fiction, created from a curated selection of literary texts sourced from Project Gutenberg. The dataset captures the rich, implicit relationships within fictional narratives using a novel ontology and GPT-4o for annotation. ARF is the first large-scale RE resource designed specifically for literary texts, advancing both NLP model training and computational literary analysis.

Dataset Configurations and Features

Configurations

  • fiction_books: Metadata-rich corpus of 6,322 public domain fiction books (1850–1950) with inferred author gender and thematic categorization.
  • fiction_books_in_chunks: Books segmented into 5-sentence chunks (5.96M total), preserving narrative coherence via 1-sentence overlap.
  • fiction_books_with_relations: A subset of 95,475 text chunks annotated with 128,000+ relationships using GPT-4o and a fiction-specific ontology.

1. Configuration: fiction_books

  • Description: Contains the full text and metadata of 6,322 English-language fiction books from Project Gutenberg.
  • Features:
    • book_id: Unique Project Gutenberg ID.
    • title: Title of the book.
    • author: Author name.
    • author_birth_year / author_death_year: Author lifespan.
    • release_date: PG release date.
    • subjects: List of thematic topics (mapped to 51 standardized themes).
    • gender: Inferred author gender (via GPT-4o).
    • text: Cleaned full book text.
  • Use Case: Supports thematic and demographic analysis of literary texts.

2. Configuration: fiction_books_in_chunks

  • Description: Each book is segmented into overlapping five-sentence text chunks to enable granular NLP analysis.
  • Features:
    • book_id, chunk_index: Book and chunk identifiers.
    • text_chunk: Five-sentence excerpt from the book.
  • Use Case: Facilitates sequence-level tasks like coreference resolution or narrative progression modeling.

3. Configuration: synthetic_relations_in_fiction_books (ARF)

  • Description: This subset corresponds to the Artificial Relationships in Fiction (ARF) dataset proposed in the LaTeCH-CLfL 2025 paper "Artificial Relationships in Fiction: A Dataset for Advancing NLP in Literary Domains".
  • Features:
    • book_id, chunk_index: Identifiers.
    • text_chunk: Five-sentence text segment.
    • relations: A list of structured relation annotations, each containing:
      • entity1, entity2: Text spans.
      • entity1Type, entity2Type: Entity types based on ontology.
      • relation: Relationship type.
  • Use Case: Ideal for training and evaluating RE models in fictional narratives, studying character networks, and generating structured data from literary texts.

ARF Dataset Structure (config 'synthetic_relations_in_fiction_books')

Each annotated relation is formatted as:

{
  "entity1": "Head Entity text",
  "entity2": "Tail Entity text",
  "entity1Type": "Head entity type",
  "entity2Type": "Tail entity type",
  "relation": "Relation type"
}

Example:

{
  "entity1": "Vortigern",
  "entity2": "castle",
  "entity1Type": "PER",
  "entity2Type": "FAC",
  "relation": "owns"
}

Entity Types (11)

| Entity Type | Description | |-------------|-------------| | PER | Person or group of people | | FAC | Facility – man-made structures for human use | | LOC | Location – natural or loosely defined geographic regions | | WTHR | Weather – atmospheric or celestial phenomena | | VEH | Vehicle – transport devices (e.g., ship, carriage) | | ORG | Organization – formal groups or institutions | | EVNT | Event – significant occurrences in narrative | | TIME | Time – chronological or historical expressions | | OBJ | Object – tangible items in the text | | SENT | Sentiment – emotional states or feelings | | CNCP | Concept – abstract ideas or motifs |

Relation Types (48)

| Relation Type | Entity 1 Type | Entity 2 Type | Description | |----------------------|------------------|-------------------|-------------------------------------------| | parent_father_of | PER | PER | Father relationship | | parent_mother_of | PER | PER | Mother relationship | | child_of | PER | PER | Child to parent | | sibling_of | PER | PER | Sibling relationship | | spouse_of | PER | PER | Spousal relationship | | relative_of | PER | PER | Extended family relationship | | adopted_by | PER | PER | Adopted by another person | | companion_of | PER | PER | Companionship or ally | | friend_of | PER | PER | Friendship | | lover_of | PER | PER | Romantic relationship | | rival_of | PER | PER | Rivalry | | enemy_of | PER/ORG | PER/ORG | Hostile or antagonistic relationship | | inspires | PER | PER | Inspires or motivates | | sacrifices_for | PER | PER | Makes a sacrifice for | | mentor_of | PER | PER | Mentorship or guidance | | teacher_of | PER | PER | Formal teaching relationship | | protector_of | PER | PER | Provides protection to | | employer_of | PER | PER | Employment relationship | | leader_of | PER | ORG | Leader of an organization | | member_of | PER | ORG | Membership in an organization | | lives_in | PER | FAC/LOC | Lives in a location | | lived_in | PER | TIME | Historically lived in | | visits | PER | FAC | Visits a facility | | travel_to | PER | LOC | Travels to a location | | born_in | PER | LOC | Birthplace | | travels_by | PER | VEH | Travels by a vehicle | | participates_in | PER | EVNT | Participates in an event | | causes | PER | EVNT | Causes an event | | owns | PER | OBJ | Owns an object | | believes_in | PER | CNCP | Believes in a concept | | embodies | PER | CNCP | Embodies a concept | | located_in | FAC | LOC | Located in a place | | part_of | FAC/LOC/ORG | FAC/LOC/ORG | Part of a larger entity | | owned_by | FAC/VEH | PER | Owned by someone | | occupied_by | FAC | PER | Occupied by someone | | used_by | FAC | ORG | Used by an organization | | affects | WTHR | LOC/EVNT | Weather affects location or event | | experienced_by | WTHR | PER | Weather experienced by someone | | travels_in | VEH | LOC | Vehicle travels in a location | | based_in | ORG | LOC | Organization based in a location | | attended_by | EVNT | PER | Event attended by person | | ends_in | EVNT | TIME | Event ends at a time | | occurs_in | EVNT | LOC/TIME | Event occurs in a place or time | | features | EVNT | OBJ | Event features an object | | stored_in | OBJ | LOC/FAC | Object stored in a place | | expressed_by | SENT | PER | Sentiment expressed by person | | used_by | OBJ | PER | Object used by person | | associated_with | CNCP | EVNT | Concept associated with event |

Dataset Statistics

| Metric | Value | |----------------------------|------------| | Books | 96 | | Authors | 91 | | Gender Ratio (M/F) | 55% / 45% | | Subgenres | 51 | | Annotated Chunks | 95,475 | | Relations per Chunk | 1.34 avg | | Chunks with No Relations | 35,230 | | Total Relations | ~128,000 |

Methodology

  • Source Texts: English-language fiction from PG bookshelves: Fiction, Children & YA, Crime/Mystery.
  • Annotation Model: GPT-4o via custom prompt integrating strict ontologies.
  • Sampling: Balanced author gender and thematic distributions.
  • Ontology Adherence: <0.05% deviation for entities; 2.95% for relations.
  • Format: Structured JSON, optimized for NLP pipelines.

Applications

  • Fine-tuning RE Models: Adapt models to literary domains with implicit, evolving relationships.
  • Computational Literary Studies: Analyze character networks, thematic evolution, and genre patterns.
  • Creative AI: Enhance AI-driven storytelling, character consistency, and world-building tools.

Citation

If you use this dataset in your research, please cite:

@inproceedings{christou-tsoumakas-2025-artificial,
    title = "Artificial Relationships in Fiction: A Dataset for Advancing {NLP} in Literary Domains",
    author = "Christou, Despina  and Tsoumakas, Grigorios",
    editor = "Kazantseva, Anna and Szpakowicz, Stan and Degaetano-Ortlieb, Stefania and Bizzoni, Yuri and Pagel, Janis",
    booktitle = "Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)",
    month = may,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.latechclfl-1.13/",
    pages = "130--147",
    ISBN = "979-8-89176-241-1"
}

Statistics

Papers
0
Benchmarks
0

Links

Homepage

Tasks

Key Information ExtractionNamed Entity RecognitionRelation Extraction