TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SC-Block: Supervised Contrastive Blocking within Entity Re...

SC-Block: Supervised Contrastive Blocking within Entity Resolution Pipelines

Alexander Brinkmann, Roee Shraga, Christian Bizer

2023-03-06Entity ResolutionContrastive LearningBlocking
PaperPDFCode(official)

Abstract

The goal of entity resolution is to identify records in multiple datasets that represent the same real-world entity. However, comparing all records across datasets can be computationally intensive, leading to long runtimes. To reduce these runtimes, entity resolution pipelines are constructed of two parts: a blocker that applies a computationally cheap method to select candidate record pairs, and a matcher that afterwards identifies matching pairs from this set using more expensive methods. This paper presents SC-Block, a blocking method that utilizes supervised contrastive learning for positioning records in the embedding space, and nearest neighbour search for candidate set building. We benchmark SC-Block against eight state-of-the-art blocking methods. In order to relate the training time of SC-Block to the reduction of the overall runtime of the entity resolution pipeline, we combine SC-Block with four matching methods into complete pipelines. For measuring the overall runtime, we determine candidate sets with 99.5% pair completeness and pass them to the matcher. The results show that SC-Block is able to create smaller candidate sets and pipelines with SC-Block execute 1.5 to 2 times faster compared to pipelines with other blockers, without sacrificing F1 score. Blockers are often evaluated using relatively small datasets which might lead to runtime effects resulting from a large vocabulary size being overlooked. In order to measure runtimes in a more challenging setting, we introduce a new benchmark dataset that requires large numbers of product offers to be blocked. On this large-scale benchmark dataset, pipelines utilizing SC-Block and the best-performing matcher execute 8 times faster than pipelines utilizing another blocker with the same matcher reducing the runtime from 2.5 hours to 18 minutes, clearly compensating for the 5 minutes required for training SC-Block.

Results

TaskDatasetMetricValueModel
Data IntegrationAbt-BuyCandidate Set Size5000SC-Block
Data IntegrationAbt-BuyRecall99.5SC-Block
Data IntegrationAbt-BuyCandidate Set Size8000BM25
Data IntegrationAbt-BuyRecall94.7BM25
Data IntegrationAmazon-GoogleCandidate Set Size11000SC-Block
Data IntegrationAmazon-GoogleRecall99.6SC-Block
Data IntegrationAmazon-GoogleCandidate Set Size40000BM25
Data IntegrationAmazon-GoogleRecall98.7BM25
Data IntegrationWDC Block - mediumCandidate Set Size100000SC-Block
Data IntegrationWDC Block - mediumRecall91.9SC-Block
Data IntegrationWDC Block - mediumCandidate Set Size500000BM25
Data IntegrationWDC Block - mediumRecall97.8BM25
Data IntegrationWDC Block - largeCandidate Set Size5000000SC-Block
Data IntegrationWDC Block - largeRecall89.5SC-Block
Data IntegrationWDC Block - largeCandidate Set Size20000000BM25
Data IntegrationWDC Block - largeRecall95.5BM25
Data IntegrationWDC Block - smallCandidate Set Size250000BM25
Data IntegrationWDC Block - smallCandidate Set Size70000SC-Block
Entity ResolutionAbt-BuyCandidate Set Size5000SC-Block
Entity ResolutionAbt-BuyRecall99.5SC-Block
Entity ResolutionAbt-BuyCandidate Set Size8000BM25
Entity ResolutionAbt-BuyRecall94.7BM25
Entity ResolutionAmazon-GoogleCandidate Set Size11000SC-Block
Entity ResolutionAmazon-GoogleRecall99.6SC-Block
Entity ResolutionAmazon-GoogleCandidate Set Size40000BM25
Entity ResolutionAmazon-GoogleRecall98.7BM25
Entity ResolutionWDC Block - mediumCandidate Set Size100000SC-Block
Entity ResolutionWDC Block - mediumRecall91.9SC-Block
Entity ResolutionWDC Block - mediumCandidate Set Size500000BM25
Entity ResolutionWDC Block - mediumRecall97.8BM25
Entity ResolutionWDC Block - largeCandidate Set Size5000000SC-Block
Entity ResolutionWDC Block - largeRecall89.5SC-Block
Entity ResolutionWDC Block - largeCandidate Set Size20000000BM25
Entity ResolutionWDC Block - largeRecall95.5BM25
Entity ResolutionWDC Block - smallCandidate Set Size250000BM25
Entity ResolutionWDC Block - smallCandidate Set Size70000SC-Block

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16LLM-Driven Dual-Level Multi-Interest Modeling for Recommendation2025-07-15Latent Space Consistency for Sparse-View CT Reconstruction2025-07-15Self-supervised pretraining of vision transformers for animal behavioral analysis and neural encoding2025-07-13