Sato: Contextual Semantic Type Detection in Tables

Dan Zhang, Yoshihiko Suhara, Jinfeng Li, Madelon Hulsebos, Çağatay Demiralp, Wang-Chiew Tan

2019-11-14Structured Prediction Column Type Annotation Hybrid Machine Learning Information Retrieval Vocal Bursts Type Prediction Retrieval

Paper PDF Code(official)

Abstract

Detecting the semantic types of data columns in relational tables is important for various data preparation and information retrieval tasks such as data cleaning, schema matching, data discovery, and semantic search. However, existing detection approaches either perform poorly with dirty data, support only a limited number of semantic types, fail to incorporate the table context of columns or rely on large sample sizes for training data. We introduce Sato, a hybrid machine learning model to automatically detect the semantic types of columns in tables, exploiting the signals from the context as well as the column values. Sato combines a deep learning model trained on a large-scale table corpus with topic modeling and structured prediction to achieve support-weighted and macro average F1 scores of 0.925 and 0.735, respectively, exceeding the state-of-the-art performance by a significant margin. We extensively analyze the overall and per-type performance of Sato, discussing how individual modeling components, as well as feature categories, contribute to its performance.

Results

Task	Dataset	Metric	Value	Model
Data Integration	VizNet-Sato-Full	Macro-F1	75.6	Sato
Data Integration	VizNet-Sato-Full	Weighted-F1	90.2	Sato
Data Integration	VizNet-Sato-MultiColumn	Macro-F1	73.5	Sato
Data Integration	VizNet-Sato-MultiColumn	Weighted-F1	92.5	Sato
Table annotation	VizNet-Sato-Full	Macro-F1	75.6	Sato
Table annotation	VizNet-Sato-Full	Weighted-F1	90.2	Sato
Table annotation	VizNet-Sato-MultiColumn	Macro-F1	73.5	Sato
Table annotation	VizNet-Sato-MultiColumn	Weighted-F1	92.5	Sato

Related Papers

Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17 A Survey of Context Engineering for Large Language Models2025-07-17 MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17 Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 Context-Aware Search and Retrieval Over Erasure Channels2025-07-16