TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SIMARA: a database for key-value information extraction fr...

SIMARA: a database for key-value information extraction from full pages

Solène Tarride, Mélodie Boillet, Jean-François Moufflet, Christopher Kermorvant

2023-04-26Handwriting RecognitionHandwritten Text RecognitionNamed Entity Recognition (NER)Key Information Extraction
PaperPDF

Abstract

We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries. Finding aids are handwritten documents that contain metadata describing older archives. They are stored in the National Archives of France and are used by archivists to identify and find archival documents. Each document is annotated at page-level, and contains seven fields to retrieve. The localization of each field is not available in such a way that this dataset encourages research on segmentation-free systems for information extraction. We propose a model based on the Transformer architecture trained for end-to-end information extraction and provide three sets for training, validation and testing, to ensure fair comparison with future works. The database is freely accessible at https://zenodo.org/record/7868059.

Results

TaskDatasetMetricValueModel
Optical Character Recognition (OCR)SIMARACER (%)6.46DAN
Optical Character Recognition (OCR)SIMARAWER (%)14.79DAN
Handwritten Text RecognitionSIMARACER (%)6.46DAN
Handwritten Text RecognitionSIMARAWER (%)14.79DAN
Key Information ExtractionSIMARAF1 (%)95.05DAN

Related Papers

Advancing Offline Handwritten Text Recognition: A Systematic Review of Data Augmentation and Generation Techniques2025-07-08Flippi: End To End GenAI Assistant for E-Commerce2025-07-08PaddleOCR 3.0 Technical Report2025-07-08Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models2025-06-28Class-Agnostic Region-of-Interest Matching in Document Images2025-06-26A Transformer Based Handwriting Recognition System Jointly Using Online and Offline Features2025-06-25Learning to Align: Addressing Character Frequency Distribution Shifts in Handwritten Text Recognition2025-06-11Creating a Historical Migration Dataset from Finnish Church Records, 1800-19202025-06-09