Generating Datasets with Pretrained Language Models

Timo Schick, Hinrich Schütze

2021-04-15EMNLP 2021 11Sentence Embeddings Semantic Textual Similarity

Abstract

To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either be augmented with additional pretraining objectives or finetuned on a large set of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how PLMs can be leveraged to obtain high-quality sentence embeddings without the need for labeled data, finetuning or modifications to the pretraining objective: We utilize the generative abilities of large and high-performing PLMs to generate entire datasets of labeled text pairs from scratch, which we then use for finetuning much smaller and more efficient models. Our fully unsupervised approach outperforms strong baselines on several semantic textual similarity datasets.

Results

Task	Dataset	Metric	Value	Model
Semantic Textual Similarity	STS14	Spearman Correlation	0.7125	Dino (STSb/̄🦕)
Semantic Textual Similarity	STS15	Spearman Correlation	0.8049	Dino (STSb/)
Semantic Textual Similarity	SICK	Spearman Correlation	0.7426	Dino (STS/̄🦕)
Semantic Textual Similarity	SICK	Spearman Correlation	0.6809	Dino (STSb/̄🦕)
Semantic Textual Similarity	STS13	Spearman Correlation	0.8126	Dino (STSb/̄🦕)
Semantic Textual Similarity	STS Benchmark	Spearman Correlation	0.7782	Dino (STSb/̄🦕)
Semantic Textual Similarity	STS Benchmark	Spearman Correlation	0.7651	Dino (STS/̄🦕)
Semantic Textual Similarity	STS12	Spearman Correlation	0.7027	Dino (STSb/̄🦕)
Semantic Textual Similarity	STS16	Spearman Correlation	0.7718	Dino (STSb/̄🦕)

Related Papers

From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment2025-07-20 SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17 SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression2025-07-08 FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection2025-07-06 LineRetriever: Planning-Aware Observation Reduction for Web Agents2025-06-30 Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval2025-06-26 DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning2025-06-26 Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn't Help with MT Evaluation2025-06-25