TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Generating Datasets with Pretrained Language Models

Generating Datasets with Pretrained Language Models

Timo Schick, Hinrich Schütze

2021-04-15EMNLP 2021 11Sentence EmbeddingsSemantic Textual Similarity
PaperPDFCodeCode(official)

Abstract

To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either be augmented with additional pretraining objectives or finetuned on a large set of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how PLMs can be leveraged to obtain high-quality sentence embeddings without the need for labeled data, finetuning or modifications to the pretraining objective: We utilize the generative abilities of large and high-performing PLMs to generate entire datasets of labeled text pairs from scratch, which we then use for finetuning much smaller and more efficient models. Our fully unsupervised approach outperforms strong baselines on several semantic textual similarity datasets.

Results

TaskDatasetMetricValueModel
Semantic Textual SimilaritySTS14Spearman Correlation0.7125Dino (STSb/̄🦕)
Semantic Textual SimilaritySTS15Spearman Correlation0.8049Dino (STSb/)
Semantic Textual SimilaritySICKSpearman Correlation0.7426Dino (STS/̄🦕)
Semantic Textual SimilaritySICKSpearman Correlation0.6809Dino (STSb/̄🦕)
Semantic Textual SimilaritySTS13Spearman Correlation0.8126Dino (STSb/̄🦕)
Semantic Textual SimilaritySTS BenchmarkSpearman Correlation0.7782Dino (STSb/̄🦕)
Semantic Textual SimilaritySTS BenchmarkSpearman Correlation0.7651Dino (STS/̄🦕)
Semantic Textual SimilaritySTS12Spearman Correlation0.7027Dino (STSb/̄🦕)
Semantic Textual SimilaritySTS16Spearman Correlation0.7718Dino (STSb/̄🦕)

Related Papers

From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment2025-07-20SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression2025-07-08FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection2025-07-06LineRetriever: Planning-Aware Observation Reduction for Web Agents2025-06-30Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval2025-06-26DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning2025-06-26Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn't Help with MT Evaluation2025-06-25