The Belfort dataset

This dataset includes minutes of Belfort municipal council drawn up between 1790 and 1946. Documents include deliberations, lists of councillors, convocations, and agendas. It includes 24,105 text-line images that were automatically detected from pages. Up to 4 transcriptions are available for each line image: two from humans, and two from automatic models.

Files are organized in three folders: Images, Transcriptions, and Partitions.

Images

The dataset include 24,105 text-line images that were automatically detected using a generic Doc-UFCN model, and resized to a fixed height of 128 pixels.

Transcriptions

Up to 4 transcriptions are available for each image, as summarized in the following table:

| Folder | N transcriptions | Description | Comments | |:----------: |-----------------: |-----------------------------|-----------------------------------------------------------------------------------| | callico_1/ | 24,105 | Human annotation n°1 | All lines have at least one human annotation | | callico_2/ | 8,878 | Human annotation n°2 | Only 33% of lines have two different human annotations | | dan/ | 24,102 | DAN automatic model | 3 images have empty transcriptions (no text was predicted by the model) | | pylaia/ | 23,536 | PyLaia automatic model | 569 images have empty transcriptions (no text was predicted by the model) | | rasa/ | 23,287 | RASA aggregation algorithm | 818 images have empty transcriptions | | rover/ | 24,104 | ROVER aggregation algorithm | 1 image has an empty transcription |

Data partition

We provide two distinct splits, both of them containing 19,013 training images, 2,262 validation images and 2,830 test images.

The Agreement-based split ensures the reliability of the test set:
- The test set includes lines with perfect agreement between human annotators (Character Error Rate = 0%);
- The validation set includes lines with good agreement between human annotators (0% < Character Error Rate < 5%);
- The training set includes all the other lines.
The Random split is randomized.

Evaluation

Evaluation results in the paper are computed by comparing predictions to human annotations. Automatic and aggregated transcriptions are only used during model training.

Belfort

The Belfort dataset

Images

Transcriptions

Data partition

Evaluation

Benchmarks