TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Deep Entity Matching with Pre-Trained Language Models

Deep Entity Matching with Pre-Trained Language Models

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew Tan

2020-04-01Entity ResolutionData Augmentation
PaperPDFCode(official)

Abstract

We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve Ditto's matching capability. Ditto allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. Ditto also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, Ditto adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, Ditto is forced to learn "harder" to improve the model's matching capability. The optimizations we developed further boost the performance of Ditto by up to 9.8%. Perhaps more surprisingly, we establish that Ditto can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate Ditto's effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, Ditto achieves a high F1 score of 96.5%.

Results

TaskDatasetMetricValueModel
Data IntegrationAbt-BuyF1 (%)89.33Ditto
Data IntegrationWDC Computers-xlargeF1 (%)95.45Ditto
Data IntegrationWDC Watches-smallF1 (%)85.12Ditto
Data IntegrationAmazon-GoogleF1 (%)75.58Ditto
Data IntegrationWDC Computers-smallF1 (%)80.76Ditto
Data IntegrationWDC Watches-xlargeF1 (%)96.53Ditto
Entity ResolutionAbt-BuyF1 (%)89.33Ditto
Entity ResolutionWDC Computers-xlargeF1 (%)95.45Ditto
Entity ResolutionWDC Watches-smallF1 (%)85.12Ditto
Entity ResolutionAmazon-GoogleF1 (%)75.58Ditto
Entity ResolutionWDC Computers-smallF1 (%)80.76Ditto
Entity ResolutionWDC Watches-xlargeF1 (%)96.53Ditto

Related Papers

Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Data Augmentation in Time Series Forecasting through Inverted Framework2025-07-15Iceberg: Enhancing HLS Modeling with Synthetic Data2025-07-14AI-Enhanced Pediatric Pneumonia Detection: A CNN-Based Approach Using Data Augmentation and Generative Adversarial Networks (GANs)2025-07-13FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation2025-07-08