SGPT: GPT Sentence Embeddings for Semantic Search

Niklas Muennighoff

2022-02-17Question Answering News Retrieval Duplicate-Question Retrieval Argument Retrieval Fact Checking Entity Retrieval Passage Retrieval Sentence Embeddings Tweet Retrieval Information Retrieval Biomedical Information Retrieval Citation Prediction Zero-shot Text Search

Paper PDF Code(official)

Abstract

Decoder transformers have continued increasing in scale reaching hundreds of billions of parameters. Due to their scale the same decoder sets state-of-the-art results on various language tasks via prompting or fine-tuning. Yet, these large foundation models remain unusable for the related fields of semantic search and sentence embeddings. This prevents possibly new state-of-the-art results and forces organizations to train and maintain separate models. To this end, we propose SGPT to use decoders for sentence embeddings and semantic search via prompting or fine-tuning. At 5.8 billion parameters SGPT improves on the previously best sentence embeddings by a margin of 7% and outperforms a concurrent method with 175 billion parameters as measured on the BEIR search benchmark. Code, models and result files are freely available at https://github.com/Muennighoff/sgpt.

Results

Task	Dataset	Metric	Value	Model
Question Answering	HotpotQA (BEIR)	nDCG@10	0.699	SGPT-CE-6.1B
Question Answering	HotpotQA (BEIR)	nDCG@10	0.593	SGPT-BE-5.8B
Question Answering	NQ (BEIR)	nDCG@10	0.524	SGPT-BE-5.8B
Question Answering	NQ (BEIR)	nDCG@10	0.401	SGPT-CE-6.1B
Question Answering	FiQA-2018 (BEIR)	nDCG@10	0.401	SGPT-CE-6.1B
Question Answering	FiQA-2018 (BEIR)	nDCG@10	0.372	SGPT-BE-5.8B
Information Retrieval	CQADupStack	mAP@100	0.16	SGPT-BE-5.8B
Information Retrieval	MSMARCO (BEIR)	nDCG@10	0.399	SGPT-BE-5.8B
Information Retrieval	MSMARCO (BEIR)	nDCG@10	0.29	SGPT-CE-6.1B
Information Retrieval	MSMARCO (BEIR)	nDCG@10	0.278	SGPT-CE-2.7B
Biomedical Information Retrieval	NFCorpus (BEIR)	nDCG@10	0.362	SGPT-BE-5.8B
Biomedical Information Retrieval	NFCorpus (BEIR)	nDCG@10	0.358	OpenAI Search-Davinci
Biomedical Information Retrieval	NFCorpus (BEIR)	nDCG@10	0.347	SGPT-CE-6.1B
Biomedical Information Retrieval	NFCorpus (BEIR)	nDCG@10	0.333	SGPT-CE-2.7B
Biomedical Information Retrieval	BioASQ (BEIR)	nDCG@10	0.547	SGPT-CE-6.1B
Biomedical Information Retrieval	BioASQ (BEIR)	nDCG@10	0.546	SGPT-CE-2.7B
Biomedical Information Retrieval	BioASQ (BEIR)	nDCG@10	0.413	SGPT-BE-5.8B
Biomedical Information Retrieval	TREC-COVID (BEIR)	nDCG@10	0.873	SGPT-BE-5.8B
Biomedical Information Retrieval	TREC-COVID (BEIR)	nDCG@10	0.791	SGPT-CE-6.1B
Biomedical Information Retrieval	TREC-COVID (BEIR)	nDCG@10	0.762	SGPT-CE-2.7B
Fact Checking	CLIMATE-FEVER (BEIR)	nDCG@10	0.305	SGPT-BE-5.8B
Fact Checking	CLIMATE-FEVER (BEIR)	nDCG@10	0.161	SGPT-CE-6.1B
Fact Checking	FEVER (BEIR)	nDCG@10	0.783	SGPT-BE-5.8B
Fact Checking	FEVER (BEIR)	nDCG@10	0.725	SGPT-CE-6.1B
Fact Checking	SciFact (BEIR)	nDCG@10	0.747	SGPT-BE-5.8B
Fact Checking	SciFact (BEIR)	nDCG@10	0.682	SGPT-CE-6.1B

SGPT: GPT Sentence Embeddings for Semantic Search

Abstract

Results

Related Papers

SGPT: GPT Sentence Embeddings for Semantic Search

Abstract

Results

Related Papers