OCR-IDL

OCR Annotations for Industry Document Library Dataset

ImagesTextsIntroduced 2022-02-25

The OCR-IDL dataset comprises the OCR annotations for a subset of 26M pages of the large-scale IDL document library. These annotations have a monetary value over $20,000 and are made publicly available with the aim of advancing the Document Intelligence research field. Our motivation is two-fold: First, by making these annotations public, we aim to level the differences between research groups and companies who have big private datasets to pre/train on. And second, we make use of a commercial OCR engine to obtain high quality annotations, leading to reduce the noise provided by OCR on pretraining and downstream tasks.