Probably Approximately Correct Labels
Emmanuel J. Candès, Andrew Ilyas, Tijana Zrnic
Abstract
Obtaining high-quality labeled datasets is often costly, requiring either extensive human annotation or expensive experiments. We propose a method that supplements such "expert" labels with AI predictions from pre-trained models to construct labeled datasets more cost-effectively. Our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. This solution enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large language models, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.
Related Papers
EduCoder: An Open-Source Annotation System for Education Transcript Data2025-07-07MegaFold: System-Level Optimizations for Accelerating Protein Structure Prediction Models2025-06-24Central Dogma Cycle and Network: A Model for Cell Memory2025-06-19Protein Inverse Folding From Structure Feedback2025-06-03Protein folding classes -- High-dimensional geometry of amino acid composition space revisited2025-06-02P: A Universal Measure of Predictive Intelligence2025-05-30Simultaneous Modeling of Protein Conformation and Dynamics via Autoregression2025-05-23Predicting protein folding dynamics using sequence information2025-05-22