RVL-CDIP_MP

RVL-CDIP multi-page

ImagesTextsapache 2.0Introduced 2023-08-24

RVL-CDIP_MP is our first contribution to retrieve the original documents of the IIT-CDIP test collection which were used to create RVL-CDIP. Some PDFs or encoded images were corrupt, which explains that we have around 500 fewer instances. By leveraging metadata from OCR-IDL , we matched the original identifiers from IIT-CDIP and retrieved them from IDL using a conversion.

It has the same label taxonomy as RVL-CDIP (16) with close to 400K documents in PDF format, averaging 5 pages per document.