TF1-EN-3M: Three Million Synthetic Moral Fables for Open Language Models

TF1-EN-3M is a large-scale synthetic dataset of 3,000,000 English-language moral fables, generated by instruction-tuned language models with no more than 8 billion parameters. The stories are aimed at child-friendly educational and moral reasoning applications and follow a consistent six-part narrative scaffold: character → trait → setting → conflict → resolution → moral.

Dataset Characteristics

Size: 3 million stories (~1B tokens)
Format: JSON Lines, with detailed metadata including prompt elements, model configuration, generation time, token counts, and costs
Generation Models: Evaluated across 10 open-weight LLMs; final dataset generated using LLaMA-3.1-8B-Instruct for optimal quality and cost balance
Story Structure: Each story ends with an explicit moral and follows a template-driven structure
Target Audience: Designed primarily for children aged 4–7 (age group B), with simple vocabulary and accessible narratives

Motivation

Natural language processing lacks large, structured corpora of fables that combine creative storytelling with explicit moral lessons. Existing human-authored datasets like Aesop's Fables are limited in scale and diversity. TF1-EN-3M bridges this gap by:

Demonstrating that mid-sized open models can reliably generate coherent, instructive stories
Enabling research into value alignment, narrative intelligence, and low-resource model fine-tuning
Offering a reproducible, cost-efficient alternative to proprietary LLM pipelines

Summary of Content

Each entry in the dataset includes:

A structured prompt with narrative elements
The generated fable text
Metadata (model name, inference time, token usage, cost, etc.)
Quality assessments (via LLM-based scoring for grammar, creativity, moral clarity, and structure adherence)

Use Cases

TF1-EN-3M is suitable for a wide range of tasks and applications:

Training small or medium-sized LLMs for story generation or moral reasoning
Benchmarking models on tasks like moral inference, story-to-moral mapping, or story quality evaluation
Educational AI tools, such as interactive storytelling tutors or automated moral education platforms
Creative NLP research, including literary analysis and narrative generation
Multilingual extension by swapping out prompt elements for other languages

Citation

If you use TF1-EN-3M, please cite:

@misc{nadas2025tf1en3m,
  title={TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models},
  author={Mihai Nădaș and Laura Dioșan and Andreea Tomescu and Andrei Pișcoran},
  year={2025},
  eprint={2504.20605},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Dataset Access

Hugging Face: klusai/ds-tf1-en-3m
Generation & Evaluation Code: TinyFabulist GitHub