SOTAB: The WDC Schema.org Table Annotation Benchmark

Keti Korini, Ralph Peeters, Christian Bizer

2023-01-09SemTab@ISWC 2023 1Column Type Annotation Columns Property Annotation Data Integration Table annotation

Abstract

Understanding the semantics of table elements is a prerequisite for many data integration and data discovery tasks. Table annotation is the task of labeling table elements with terms from a given vocabulary. This paper presents the WDC Schema.org Table Annotation Benchmark (SOTAB) for comparing the performance of table annotation systems. SOTAB covers the column type annotation (CTA) and columns property annotation (CPA) tasks. SOTAB provides ∼50,000 annotated tables for each of the tasks containing Schema.org data from different websites. The tables cover 17 different types of entities such as movie, event, local business, recipe, job posting, or product. The tables stem from the WDC Schema.org Table Corpus which was created by extracting Schema.org annotations from the Common Crawl. Consequently, the labels used for annotating columns in SOTAB are part of the Schema.org vocabulary. The benchmark covers 91 types for CTA and 176 properties for CPA distributed across textual, numerical and date/time columns. The tables are split into fixed training, validation and test sets. The test sets are further divided into subsets focusing on specific challenges, such as columns with missing values or different value formats, in order to allow a more fine-grained comparison of annotation systems. The evaluation of SOTAB using Doduo and TURL shows that the benchmark is difficult to solve for current state-of-the-art systems.

Related Papers

From Classical Machine Learning to Emerging Foundation Models: Review on Multimodal Data Integration for Cancer Research2025-07-11 Empowering Digital Agriculture: A Privacy-Preserving Framework for Data Sharing and Collaborative Research2025-06-25 Intelligent Operation and Maintenance and Prediction Model Optimization for Improving Wind Power Generation Efficiency2025-06-19 Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs2025-06-17 Brain Imaging Foundation Models, Are We There Yet? A Systematic Review of Foundation Models for Brain Imaging and Biomedical Research2025-06-16 Leveraging MIMIC Datasets for Better Digital Health: A Review on Open Problems, Progress Highlights, and Future Promises2025-06-15 Enhancing Bagging Ensemble Regression with Data Integration for Time Series-Based Diabetes Prediction2025-06-11 scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data2025-06-10