TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/A Comprehensive Survey of Synthetic Tabular Data Generation

A Comprehensive Survey of Synthetic Tabular Data Generation

Ruxue Shi, Yili Wang, Mengnan Du, Xu Shen, Xin Wang

2025-04-23Tabular Data GenerationSynthetic Data GenerationPrivacy Preserving
PaperPDFCode(official)

Abstract

Tabular data remains one of the most prevalent and critical data formats across diverse real-world applications. However, its effective use in machine learning (ML) is often constrained by challenges such as data scarcity, privacy concerns, and class imbalance. Synthetic data generation has emerged as a promising solution, leveraging generative models to learn the distribution of real datasets and produce high-fidelity, privacy-preserving samples. Various generative paradigms have been explored, including energy-based models (EBMs), variational autoencoders (VAEs), generative adversarial networks (GANs), large language models (LLMs), and diffusion models. While several surveys have investigated synthetic tabular data generation, most focus on narrow subdomains or specific generative methods, such as GANs, diffusion models, or privacy-preserving techniques. This limited scope often results in fragmented insights, lacking a comprehensive synthesis that bridges diverse approaches. In particular, recent advances driven by LLMs and diffusion-based models remain underexplored. This gap hinders a holistic understanding of the field`s evolution, methodological interplay, and open challenges. To address this, our survey provides a unified and systematic review of synthetic tabular data generation. Our contributions are threefold: (1) we propose a comprehensive taxonomy that organizes existing methods into traditional approaches, diffusion-based methods, and LLM-based models, and provide an in-depth comparative analysis; (2) we detail the complete pipeline for synthetic tabular data generation, including data synthesis, post-processing, and evaluation; (3) we identify major challenges, explore real-world applications, and outline open research questions and future directions to guide future work in this rapidly evolving area.

Related Papers

A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17Federated Learning for Commercial Image Sources2025-07-17Transformer-Based Person Identification via Wi-Fi CSI Amplitude and Phase Perturbations2025-07-17Privacy-Preserving Fusion for Multi-Sensor Systems Under Multiple Packet Dropouts2025-07-17Federated Learning in Open- and Closed-Loop EMG Decoding: A Privacy and Performance Perspective2025-07-16Safeguarding Federated Learning-based Road Condition Classification2025-07-16A Privacy-Preserving Framework for Advertising Personalization Incorporating Federated Learning and Differential Privacy2025-07-16ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs2025-07-15