A Note on Statistically Accurate Tabular Data Generation Using Large Language Models
Andrey Sidorenko
2025-05-05Tabular Data Generation
Abstract
Large language models (LLMs) have shown promise in synthetic tabular data generation, yet existing methods struggle to preserve complex feature dependencies, particularly among categorical variables. This work introduces a probability-driven prompting approach that leverages LLMs to estimate conditional distributions, enabling more accurate and scalable data synthesis. The results highlight the potential of prompting probability distributions to enhance the statistical fidelity of LLM-generated tabular data.
Related Papers
CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation2025-06-17dpmm: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation2025-05-31The Prompt is Mightier than the Example2025-05-24Graph Conditional Flow Matching for Relational Data Generation2025-05-21A Comprehensive Survey of Synthetic Tabular Data Generation2025-04-23Diffusion Transformers for Tabular Data Time Series Generation2025-04-10TabRep: a Simple and Effective Continuous Representation for Training Tabular Diffusion Models2025-04-07Assessing Generative Models for Structured Data2025-03-26