Chong Zhang, Ya Guo, Yi Tu, Huan Chen, Jinyang Tang, Huijia Zhu, Qi Zhang, Tao Gui
Recent advances in multimodal pre-trained models have significantly improved information extraction from visually-rich documents (VrDs), in which named entity recognition (NER) is treated as a sequence-labeling task of predicting the BIO entity tags for tokens, following the typical setting of NLP. However, BIO-tagging scheme relies on the correct order of model inputs, which is not guaranteed in real-world NER on scanned VrDs where text are recognized and arranged by OCR systems. Such reading order issue hinders the accurate marking of entities by BIO-tagging scheme, making it impossible for sequence-labeling methods to predict correct named entities. To address the reading order issue, we introduce Token Path Prediction (TPP), a simple prediction head to predict entity mentions as token sequences within documents. Alternative to token classification, TPP models the document layout as a complete directed graph of tokens, and predicts token paths within the graph as entities. For better evaluation of VrD-NER systems, we also propose two revised benchmark datasets of NER on scanned documents which can reflect real-world scenarios. Experiment results demonstrate the effectiveness of our method, and suggest its potential to be a universal solution to various information extraction tasks on documents.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Relation Extraction | FUNSD | F1 | 79.2 | TPP (LayoutMask) |
| Entity Linking | FUNSD | F1 | 79.2 | TPP (LayoutMask) |
| Named Entity Recognition (NER) | FUNSD-r | F1 | 80.4 | TPP (LayoutLMv3) |
| Named Entity Recognition (NER) | FUNSD-r | F1 | 78.19 | TPP (LayoutMask) |
| Named Entity Recognition (NER) | CORD-r | F1 | 91.85 | TPP (LayoutLMv3) |
| Named Entity Recognition (NER) | CORD-r | F1 | 89.34 | TPP (LayoutMask) |
| Semantic entity labeling | FUNSD | F1 | 85.16 | TPP (LayoutMask) |
| Key Information Extraction | CORD | F1 | 96.92 | TPP (LayoutMask) |
| Key Information Extraction | RFUND-EN | key-value pair F1 | 50.27 | TPP (LayoutLMv3_base) |
| Reading Order Detection | ROOR | Segment-level F1 | 42.96 | TPP (LayoutLMv3-base) |
| Reading Order Detection | ReadingBank | Average Page-level BLEU | 98.16 | TPP (LayoutMask) |
| Reading Order Detection | ReadingBank | Average Relative Distance (ARD) | 0.37 | TPP (LayoutMask) |