TF1-EN-3M

klusai/ds-tf1-en-3m

TextsMITIntroduced 2025-04-29

TF1-EN-3M: Three Million Synthetic Moral Fables for Open Language Models

TF1-EN-3M is a large-scale synthetic dataset of 3,000,000 English-language moral fables, generated by instruction-tuned language models with no more than 8 billion parameters. The stories are aimed at child-friendly educational and moral reasoning applications and follow a consistent six-part narrative scaffold: character → trait → setting → conflict → resolution → moral.

Dataset Characteristics

  • Size: 3 million stories (~1B tokens)
  • Format: JSON Lines, with detailed metadata including prompt elements, model configuration, generation time, token counts, and costs
  • Generation Models: Evaluated across 10 open-weight LLMs; final dataset generated using LLaMA-3.1-8B-Instruct for optimal quality and cost balance
  • Story Structure: Each story ends with an explicit moral and follows a template-driven structure
  • Target Audience: Designed primarily for children aged 4–7 (age group B), with simple vocabulary and accessible narratives

Motivation

Natural language processing lacks large, structured corpora of fables that combine creative storytelling with explicit moral lessons. Existing human-authored datasets like Aesop's Fables are limited in scale and diversity. TF1-EN-3M bridges this gap by:

  • Demonstrating that mid-sized open models can reliably generate coherent, instructive stories
  • Enabling research into value alignment, narrative intelligence, and low-resource model fine-tuning
  • Offering a reproducible, cost-efficient alternative to proprietary LLM pipelines

Summary of Content

Each entry in the dataset includes:

  • A structured prompt with narrative elements
  • The generated fable text
  • Metadata (model name, inference time, token usage, cost, etc.)
  • Quality assessments (via LLM-based scoring for grammar, creativity, moral clarity, and structure adherence)

Use Cases

TF1-EN-3M is suitable for a wide range of tasks and applications:

  • Training small or medium-sized LLMs for story generation or moral reasoning
  • Benchmarking models on tasks like moral inference, story-to-moral mapping, or story quality evaluation
  • Educational AI tools, such as interactive storytelling tutors or automated moral education platforms
  • Creative NLP research, including literary analysis and narrative generation
  • Multilingual extension by swapping out prompt elements for other languages

Citation

If you use TF1-EN-3M, please cite:

@misc{nadas2025tf1en3m,
  title={TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models},
  author={Mihai Nădaș and Laura Dioșan and Andreea Tomescu and Andrei Pișcoran},
  year={2025},
  eprint={2504.20605},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Dataset Access