Ontology-Free General-Domain Knowledge Graph-to-Text Generation Dataset Synthesis using Large Language Model

Daehee Kim, Deokhyung Kang, Sangwon Ryu, Gary Geunbae Lee

2024-09-11Knowledge Graphs Data-to-Text Generation Text Generation Graph-to-Sequence Large Language Model Language Modelling

Paper PDF Code(official)

Abstract

Knowledge Graph-to-Text (G2T) generation involves verbalizing structured knowledge graphs into natural language text. Recent advancements in Pretrained Language Models (PLMs) have improved G2T performance, but their effectiveness depends on datasets with precise graph-text alignment. However, the scarcity of high-quality, general-domain G2T generation datasets restricts progress in the general-domain G2T generation research. To address this issue, we introduce Wikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2T dataset generated using a novel method that leverages Large Language Model (LLM) and Data-QuestEval. Our new dataset, which contains 5.85M general-domain graph-text pairs, offers high graph-text consistency without relying on external ontologies. Experimental results demonstrate that PLM fine-tuned on WikiOFGraph outperforms those trained on other datasets across various evaluation metrics. Our method proves to be a scalable and effective solution for generating high-quality G2T data, significantly advancing the field of G2T generation.

Results

Task	Dataset	Metric	Value	Model
Text Generation	WikiOFGraph	BLEU	69.27	T5-large
Text Generation	GenWiki	BLEU	45.85	T5-large
Data-to-Text Generation	WikiOFGraph	BLEU	69.27	T5-large
Data-to-Text Generation	GenWiki	BLEU	45.85	T5-large

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18 SMART: Relation-Aware Learning of Geometric Representations for Knowledge Graphs2025-07-17 Making Language Model a Hierarchical Classifier and Generator2025-07-17 GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17 The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17 Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17 Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17