Probably Approximately Correct Labels

Emmanuel J. Candès, Andrew Ilyas, Tijana Zrnic

2025-06-12text annotation Protein Folding

Abstract

Obtaining high-quality labeled datasets is often costly, requiring either extensive human annotation or expensive experiments. We propose a method that supplements such "expert" labels with AI predictions from pre-trained models to construct labeled datasets more cost-effectively. Our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. This solution enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large language models, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.

Related Papers

EduCoder: An Open-Source Annotation System for Education Transcript Data2025-07-07 MegaFold: System-Level Optimizations for Accelerating Protein Structure Prediction Models2025-06-24 Central Dogma Cycle and Network: A Model for Cell Memory2025-06-19 Protein Inverse Folding From Structure Feedback2025-06-03 Protein folding classes -- High-dimensional geometry of amino acid composition space revisited2025-06-02 P: A Universal Measure of Predictive Intelligence2025-05-30 Simultaneous Modeling of Protein Conformation and Dynamics via Autoregression2025-05-23 Predicting protein folding dynamics using sequence information2025-05-22