Designing DSIC Mechanisms for Data Sharing in the Era of Large Language Models

Seyed Moein Ayyoubzadeh, Kourosh Shahnazari, Mohammmadali Keshtparvar, Mohammadamin Fazli

2025-06-01Privacy Preserving

Abstract

Training large language models (LLMs) requires vast amounts of high-quality data from institutions that face legal, privacy, and strategic constraints. Existing data procurement methods often rely on unverifiable trust or ignore heterogeneous provider costs. We introduce a mechanism-design framework for truthful, trust-minimized data sharing that ensures dominant-strategy incentive compatibility (DSIC), individual rationality, and weak budget balance, while rewarding data based on both quality and learning utility. We formalize a model where providers privately know their data cost and quality, and value arises solely from the data's contribution to model performance. Based on this, we propose the Quality-Weighted Marginal-Incentive Auction (Q-MIA), which ranks providers using a virtual cost metric and uses Myerson-style payments to ensure DSIC and budget feasibility. To support settings with limited liquidity or long-term incentives, we introduce the Marginal Utility Token (MUT), which allocates future rights based on marginal contributions. We unify these in Mixed-MIA, a hybrid mechanism balancing upfront payments and deferred rewards. All mechanisms support verifiable, privacy-preserving implementation. Theoretically and empirically, they outperform volume-based and trust-based baselines, eliciting higher-quality data under budget constraints while remaining robust to misreporting and collusion. This establishes a principled foundation for sustainable and fair data markets for future LLMs.

Related Papers

A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17 Federated Learning for Commercial Image Sources2025-07-17 Transformer-Based Person Identification via Wi-Fi CSI Amplitude and Phase Perturbations2025-07-17 Privacy-Preserving Fusion for Multi-Sensor Systems Under Multiple Packet Dropouts2025-07-17 Federated Learning in Open- and Closed-Loop EMG Decoding: A Privacy and Performance Perspective2025-07-16 Safeguarding Federated Learning-based Road Condition Classification2025-07-16 A Privacy-Preserving Framework for Advertising Personalization Incorporating Federated Learning and Differential Privacy2025-07-16 ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs2025-07-15