Cross-lingual paraphrase identification

Inessa Fedorova, Aleksei Musatow

2024-06-21Cross-Lingual Paraphrase Identification Paraphrase Identification Semantic Similarity Semantic Textual Similarity

Paper PDF

Abstract

The paraphrase identification task involves measuring semantic similarity between two short sentences. It is a tricky task, and multilingual paraphrase identification is even more challenging. In this work, we train a bi-encoder model in a contrastive manner to detect hard paraphrases across multiple languages. This approach allows us to use model-produced embeddings for various tasks, such as semantic search. We evaluate our model on downstream tasks and also assess embedding space quality. Our performance is comparable to state-of-the-art cross-encoders, with only a minimal relative drop of 7-10% on the chosen dataset, while keeping decent quality of embeddings.

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17 SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression2025-07-08 FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection2025-07-06 LineRetriever: Planning-Aware Observation Reduction for Web Agents2025-06-30 Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval2025-06-26 DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning2025-06-26 Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn't Help with MT Evaluation2025-06-25 Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models2025-06-25