Description
Unlike synthetic data generators that rely on increasingly complex and resource-heavy architectures, TabularARGN adopts a more focused and efficient model design. These design choices result in:
- High Fidelity: TabularARGN achieves synthetic data quality on par with state-of-the-art (SOTA) models
- Privacy by Design: TabularARGN only considers privacy-preserving value ranges for sampling, and has built-in privacy protection features. Plus can be trained via DP-SGD for obtaining differential privacy guarantees.
- Simplicity: TabularARGN leverages existing building blocks, and thus can be easily implemented within standard deep learning frameworks.
- Compute Efficiency: With training speeds up to 100x faster, TabularARGN scales effectively, even for large and complex datasets.
- Sampling Flexibility: TabularARGN supports advanced sampling capabilities, including:
- Conditional generation to create targeted datasets.
- Missing value imputation to handle incomplete data seamlessly.
- Fairness adjustments to align with ethical data synthesis goals.
- Controlling sampling probabilities via temperature adjustments to balance rule-adherence with data diversity.
- Data Versatility: TabularARGN accommodates the heterogeneity of real-world tabular datasets, including:
- Multi-variate, mixed-type data (categorical, numerical, date-time, geo-spatial).
- Multi-sequence datasets with varying sequence lengths and varying time intervals.
- Missing values.
- Robustness in Training: TabularARGN delivers high-quality synthetic data with default settings and remains consistent across several training runs.