A Note on Statistically Accurate Tabular Data Generation Using Large Language Models

Andrey Sidorenko

2025-05-05Tabular Data Generation

Abstract

Large language models (LLMs) have shown promise in synthetic tabular data generation, yet existing methods struggle to preserve complex feature dependencies, particularly among categorical variables. This work introduces a probability-driven prompting approach that leverages LLMs to estimate conditional distributions, enabling more accurate and scalable data synthesis. The results highlight the potential of prompting probability distributions to enhance the statistical fidelity of LLM-generated tabular data.

Related Papers

CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation2025-06-17 dpmm: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation2025-05-31 The Prompt is Mightier than the Example2025-05-24 Graph Conditional Flow Matching for Relational Data Generation2025-05-21 A Comprehensive Survey of Synthetic Tabular Data Generation2025-04-23 Diffusion Transformers for Tabular Data Time Series Generation2025-04-10 TabRep: a Simple and Effective Continuous Representation for Training Tabular Diffusion Models2025-04-07 Assessing Generative Models for Structured Data2025-03-26