TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Entity Matching using Large Language Models

Entity Matching using Large Language Models

Ralph Peeters, Aaron Steiner, Christian Bizer

2023-10-17Entity ResolutionData Integration
PaperPDFCode(official)

Abstract

Entity matching is the task of deciding whether two entity descriptions refer to the same real-world entity. Entity matching is a central step in most data integration pipelines. Many state-of-the-art entity matching methods rely on pre-trained language models (PLMs) such as BERT or RoBERTa. Two major drawbacks of these models for entity matching are that (i) the models require significant amounts of task-specific training data and (ii) the fine-tuned models are not robust concerning out-of-distribution entities. This paper investigates using generative large language models (LLMs) as a less task-specific training data-dependent and more robust alternative to PLM-based matchers. The study covers hosted and open-source LLMs which can be run locally. We evaluate these models in a zero-shot scenario and a scenario where task-specific training data is available. We compare different prompt designs and the prompt sensitivity of the models. We show that there is no single best prompt but that the prompt needs to be tuned for each model/dataset combination. We further investigate (i) the selection of in-context demonstrations, (ii) the generation of matching rules, as well as (iii) fine-tuning LLMs using the same pool of training data. Our experiments show that the best LLMs require no or only a few training examples to perform comparably to PLMs that were fine-tuned using thousands of examples. LLM-based matchers further exhibit higher robustness to unseen entities. We show that GPT4 can generate structured explanations for matching decisions and can automatically identify potential causes of matching errors by analyzing explanations of wrong decisions. We demonstrate that the model can generate meaningful textual descriptions of the identified error classes, which can help data engineers to improve entity matching pipelines.

Results

TaskDatasetMetricValueModel
Data IntegrationAbt-BuyF1 (%)95.78gpt4-0613_zeroshot
Data IntegrationWDC Products-80%cc-seen-mediumF1 (%)89.61gpt4-0613_zeroshot
Data IntegrationAmazon-GoogleF1 (%)85.21gpt4-0613_fewshot-10
Entity ResolutionAbt-BuyF1 (%)95.78gpt4-0613_zeroshot
Entity ResolutionWDC Products-80%cc-seen-mediumF1 (%)89.61gpt4-0613_zeroshot
Entity ResolutionAmazon-GoogleF1 (%)85.21gpt4-0613_fewshot-10

Related Papers

From Classical Machine Learning to Emerging Foundation Models: Review on Multimodal Data Integration for Cancer Research2025-07-11Empowering Digital Agriculture: A Privacy-Preserving Framework for Data Sharing and Collaborative Research2025-06-25Intelligent Operation and Maintenance and Prediction Model Optimization for Improving Wind Power Generation Efficiency2025-06-19Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs2025-06-17Brain Imaging Foundation Models, Are We There Yet? A Systematic Review of Foundation Models for Brain Imaging and Biomedical Research2025-06-16Leveraging MIMIC Datasets for Better Digital Health: A Review on Open Problems, Progress Highlights, and Future Promises2025-06-15Enhancing Bagging Ensemble Regression with Data Integration for Time Series-Based Diabetes Prediction2025-06-11scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data2025-06-10