TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Language Models are Realistic Tabular Data Generators

Language Models are Realistic Tabular Data Generators

Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, Gjergji Kasneci

2022-10-12Tabular Data Generation
PaperPDFCode(official)

Abstract

Tabular data is among the oldest and most ubiquitous forms of data. However, the generation of synthetic samples with the original data's characteristics remains a significant challenge for tabular data. While many generative models from the computer vision domain, such as variational autoencoders or generative adversarial networks, have been adapted for tabular data generation, less research has been directed towards recent transformer-based large language models (LLMs), which are also generative in nature. To this end, we propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative LLM to sample synthetic and yet highly realistic tabular data. Furthermore, GReaT can model tabular data distributions by conditioning on any subset of features; the remaining features are sampled without additional overhead. We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles. We find that GReaT maintains state-of-the-art performance across numerous real-world and synthetic data sets with heterogeneous feature types coming in various sizes.

Results

TaskDatasetMetricValueModel
Tabular Data GenerationSICKDT Accuracy97.72GReaT
Tabular Data GenerationSICKLR Accuracy97.72GReaT
Tabular Data GenerationSICKParameters(M)355GReaT
Tabular Data GenerationSICKRF Accuracy98.3GReaT
Tabular Data GenerationSICKDT Accuracy95.39Distill-GReaT
Tabular Data GenerationSICKLR Accuracy96.56Distill-GReaT
Tabular Data GenerationSICKParameters(M)82Distill-GReaT
Tabular Data GenerationSICKRF Accuracy97.72Distill-GReaT
Tabular Data GenerationHELOCDT Accuracy81.4Distill-GReaT
Tabular Data GenerationHELOCLR Accuracy70.58Distill-GReaT
Tabular Data GenerationHELOCParameters(M)82Distill-GReaT
Tabular Data GenerationHELOCRF Accuracy82.14Distill-GReaT
Tabular Data GenerationHELOCDT Accuracy79.1GReaT
Tabular Data GenerationHELOCLR Accuracy71.9GReaT
Tabular Data GenerationHELOCParameters(M)355GReaT
Tabular Data GenerationHELOCRF Accuracy80.93GReaT
Tabular Data GenerationCalifornia Housing PricesDT Mean Squared Error0.43Distill-GReaT
Tabular Data GenerationCalifornia Housing PricesLR Mean Squared Error0.57Distill-GReaT
Tabular Data GenerationCalifornia Housing PricesParameters(M)82Distill-GReaT
Tabular Data GenerationCalifornia Housing PricesRF Mean Squared Error0.32Distill-GReaT
Tabular Data GenerationCalifornia Housing PricesDT Mean Squared Error0.39GReaT
Tabular Data GenerationCalifornia Housing PricesLR Mean Squared Error0.34GReaT
Tabular Data GenerationCalifornia Housing PricesParameters(M)355GReaT
Tabular Data GenerationCalifornia Housing PricesRF Mean Squared Error0.28GReaT
Tabular Data GenerationTravelDT Accuracy83.56GReaT
Tabular Data GenerationTravelLR Accuracy80.1GReaT
Tabular Data GenerationTravelParameters(M)355GReaT
Tabular Data GenerationTravelRF Accuracy84.3GReaT
Tabular Data GenerationTravelDT Accuracy77.38Distill-GReaT
Tabular Data GenerationTravelLR Accuracy78.53Distill-GReaT
Tabular Data GenerationTravelParameters(M)82Distill-GReaT
Tabular Data GenerationTravelRF Accuracy79.5Distill-GReaT
Tabular Data GenerationDiabetesDT Accuracy0.5523GReaT
Tabular Data GenerationDiabetesLR Accuracy0.5734GReaT
Tabular Data GenerationDiabetesParameters(M)355GReaT
Tabular Data GenerationDiabetesRF Accuracy0.5834GReaT
Tabular Data GenerationDiabetesDT Accuracy0.541Distill-GReaT
Tabular Data GenerationDiabetesLR Accuracy0.5733Distill-GReaT
Tabular Data GenerationDiabetesParameters(M)82Distill-GReaT
Tabular Data GenerationDiabetesRF Accuracy0.5803Distill-GReaT
Tabular Data GenerationAdult Census IncomeDT Accuracy84.81GReaT
Tabular Data GenerationAdult Census IncomeLR Accuracy84.77GReaT
Tabular Data GenerationAdult Census IncomeParameters(M)355GReaT
Tabular Data GenerationAdult Census IncomeRF Accuracy85.42GReaT
Tabular Data GenerationAdult Census IncomeDT Accuracy84.49Distill-GReaT
Tabular Data GenerationAdult Census IncomeLR Accuracy84.65Distill-GReaT
Tabular Data GenerationAdult Census IncomeParameters(M)82Distill-GReaT
Tabular Data GenerationAdult Census IncomeRF Accuracy85.25Distill-GReaT

Related Papers

CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation2025-06-17dpmm: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation2025-05-31The Prompt is Mightier than the Example2025-05-24Graph Conditional Flow Matching for Relational Data Generation2025-05-21A Note on Statistically Accurate Tabular Data Generation Using Large Language Models2025-05-05A Comprehensive Survey of Synthetic Tabular Data Generation2025-04-23Diffusion Transformers for Tabular Data Time Series Generation2025-04-10TabRep: a Simple and Effective Continuous Representation for Training Tabular Diffusion Models2025-04-07