Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs

Doohee You, Samuel Fraiberger

2024-10-02Semantic Similarity Semantic Textual Similarity

Abstract

This study investigates efficient deduplication techniques for a large NLP dataset of economic research paper titles. We explore various pairing methods alongside established distance measures (Levenshtein distance, cosine similarity) and a sBERT model for semantic evaluation. Our findings suggest a potentially low prevalence of duplicates based on the observed semantic similarity across different methods. Further exploration with a human-annotated ground truth set is completed for a more conclusive assessment. The result supports findings from the NLP, LLM based distance metrics.

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17 SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression2025-07-08 FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection2025-07-06 LineRetriever: Planning-Aware Observation Reduction for Web Agents2025-06-30 Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval2025-06-26 DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning2025-06-26 Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn't Help with MT Evaluation2025-06-25 Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models2025-06-25