Alexander Brinkmann, Roee Shraga, Christian Bizer
The goal of entity resolution is to identify records in multiple datasets that represent the same real-world entity. However, comparing all records across datasets can be computationally intensive, leading to long runtimes. To reduce these runtimes, entity resolution pipelines are constructed of two parts: a blocker that applies a computationally cheap method to select candidate record pairs, and a matcher that afterwards identifies matching pairs from this set using more expensive methods. This paper presents SC-Block, a blocking method that utilizes supervised contrastive learning for positioning records in the embedding space, and nearest neighbour search for candidate set building. We benchmark SC-Block against eight state-of-the-art blocking methods. In order to relate the training time of SC-Block to the reduction of the overall runtime of the entity resolution pipeline, we combine SC-Block with four matching methods into complete pipelines. For measuring the overall runtime, we determine candidate sets with 99.5% pair completeness and pass them to the matcher. The results show that SC-Block is able to create smaller candidate sets and pipelines with SC-Block execute 1.5 to 2 times faster compared to pipelines with other blockers, without sacrificing F1 score. Blockers are often evaluated using relatively small datasets which might lead to runtime effects resulting from a large vocabulary size being overlooked. In order to measure runtimes in a more challenging setting, we introduce a new benchmark dataset that requires large numbers of product offers to be blocked. On this large-scale benchmark dataset, pipelines utilizing SC-Block and the best-performing matcher execute 8 times faster than pipelines utilizing another blocker with the same matcher reducing the runtime from 2.5 hours to 18 minutes, clearly compensating for the 5 minutes required for training SC-Block.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Data Integration | Abt-Buy | Candidate Set Size | 5000 | SC-Block |
| Data Integration | Abt-Buy | Recall | 99.5 | SC-Block |
| Data Integration | Abt-Buy | Candidate Set Size | 8000 | BM25 |
| Data Integration | Abt-Buy | Recall | 94.7 | BM25 |
| Data Integration | Amazon-Google | Candidate Set Size | 11000 | SC-Block |
| Data Integration | Amazon-Google | Recall | 99.6 | SC-Block |
| Data Integration | Amazon-Google | Candidate Set Size | 40000 | BM25 |
| Data Integration | Amazon-Google | Recall | 98.7 | BM25 |
| Data Integration | WDC Block - medium | Candidate Set Size | 100000 | SC-Block |
| Data Integration | WDC Block - medium | Recall | 91.9 | SC-Block |
| Data Integration | WDC Block - medium | Candidate Set Size | 500000 | BM25 |
| Data Integration | WDC Block - medium | Recall | 97.8 | BM25 |
| Data Integration | WDC Block - large | Candidate Set Size | 5000000 | SC-Block |
| Data Integration | WDC Block - large | Recall | 89.5 | SC-Block |
| Data Integration | WDC Block - large | Candidate Set Size | 20000000 | BM25 |
| Data Integration | WDC Block - large | Recall | 95.5 | BM25 |
| Data Integration | WDC Block - small | Candidate Set Size | 250000 | BM25 |
| Data Integration | WDC Block - small | Candidate Set Size | 70000 | SC-Block |
| Entity Resolution | Abt-Buy | Candidate Set Size | 5000 | SC-Block |
| Entity Resolution | Abt-Buy | Recall | 99.5 | SC-Block |
| Entity Resolution | Abt-Buy | Candidate Set Size | 8000 | BM25 |
| Entity Resolution | Abt-Buy | Recall | 94.7 | BM25 |
| Entity Resolution | Amazon-Google | Candidate Set Size | 11000 | SC-Block |
| Entity Resolution | Amazon-Google | Recall | 99.6 | SC-Block |
| Entity Resolution | Amazon-Google | Candidate Set Size | 40000 | BM25 |
| Entity Resolution | Amazon-Google | Recall | 98.7 | BM25 |
| Entity Resolution | WDC Block - medium | Candidate Set Size | 100000 | SC-Block |
| Entity Resolution | WDC Block - medium | Recall | 91.9 | SC-Block |
| Entity Resolution | WDC Block - medium | Candidate Set Size | 500000 | BM25 |
| Entity Resolution | WDC Block - medium | Recall | 97.8 | BM25 |
| Entity Resolution | WDC Block - large | Candidate Set Size | 5000000 | SC-Block |
| Entity Resolution | WDC Block - large | Recall | 89.5 | SC-Block |
| Entity Resolution | WDC Block - large | Candidate Set Size | 20000000 | BM25 |
| Entity Resolution | WDC Block - large | Recall | 95.5 | BM25 |
| Entity Resolution | WDC Block - small | Candidate Set Size | 250000 | BM25 |
| Entity Resolution | WDC Block - small | Candidate Set Size | 70000 | SC-Block |