Belfort

The Belfort dataset: Handwritten Text Recognition from Crowdsourced Annotations

ImagesTextsCreative Commons Attribution 4.0 InternationalIntroduced 2023-06-15

The Belfort dataset

This dataset includes minutes of Belfort municipal council drawn up between 1790 and 1946. Documents include deliberations, lists of councillors, convocations, and agendas. It includes 24,105 text-line images that were automatically detected from pages. Up to 4 transcriptions are available for each line image: two from humans, and two from automatic models.

Files are organized in three folders: Images, Transcriptions, and Partitions.

Images

The dataset include 24,105 text-line images that were automatically detected using a generic Doc-UFCN model, and resized to a fixed height of 128 pixels.

Transcriptions

Up to 4 transcriptions are available for each image, as summarized in the following table:

| Folder | N transcriptions | Description | Comments | |:----------: |-----------------: |-----------------------------|-----------------------------------------------------------------------------------| | callico_1/ | 24,105 | Human annotation n°1 | All lines have at least one human annotation | | callico_2/ | 8,878 | Human annotation n°2 | Only 33% of lines have two different human annotations | | dan/ | 24,102 | DAN automatic model | 3 images have empty transcriptions (no text was predicted by the model) | | pylaia/ | 23,536 | PyLaia automatic model | 569 images have empty transcriptions (no text was predicted by the model) | | rasa/ | 23,287 | RASA aggregation algorithm | 818 images have empty transcriptions | | rover/ | 24,104 | ROVER aggregation algorithm | 1 image has an empty transcription |

Data partition

We provide two distinct splits, both of them containing 19,013 training images, 2,262 validation images and 2,830 test images.

  • The Agreement-based split ensures the reliability of the test set:
    • The test set includes lines with perfect agreement between human annotators (Character Error Rate = 0%);
    • The validation set includes lines with good agreement between human annotators (0% < Character Error Rate < 5%);
    • The training set includes all the other lines.
  • The Random split is randomized.

Evaluation

Evaluation results in the paper are computed by comparing predictions to human annotations. Automatic and aggregated transcriptions are only used during model training.