Text and Code Embeddings by Contrastive Pre-Training

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov, Joanne Jang, Peter Welinder, Lilian Weng

2022-01-24Passage Ranking Natural Questions text similarity Linear-Probe Classification TriviaQA Code Search Zero-shot Text Search

Paper PDF Code

Abstract

Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models. On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The same text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4%, 14.7%, and 10.6% over previous best unsupervised methods on MSMARCO, Natural Questions and TriviaQA benchmarks, respectively. Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8% relative improvement over prior best work on code search.

Results

Task	Dataset	Metric	Value	Model
Code Search	CodeSearchNet	Go	97.5	cpt-code M
Code Search	CodeSearchNet	JS	86.5	cpt-code M
Code Search	CodeSearchNet	Java	94.4	cpt-code M
Code Search	CodeSearchNet	Overall	93.5	cpt-code M
Code Search	CodeSearchNet	PHP	97.2	cpt-code M
Code Search	CodeSearchNet	Python	99.9	cpt-code M
Code Search	CodeSearchNet	Ruby	85.5	cpt-code M
Code Search	CodeSearchNet	Go	97.7	cpt-code S
Code Search	CodeSearchNet	JS	86	cpt-code S
Code Search	CodeSearchNet	Java	94	cpt-code S
Code Search	CodeSearchNet	Overall	93.4	cpt-code S
Code Search	CodeSearchNet	PHP	96.7	cpt-code S
Code Search	CodeSearchNet	Python	99.8	cpt-code S
Code Search	CodeSearchNet	Ruby	86.3	cpt-code S
Passage Ranking	MS MARCO	MRR@10	44.3	Fine-tuned SOTA
Passage Ranking	MS MARCO	MRR@10	22.7	cpt-text XL
Passage Ranking	MS MARCO	MRR@10	21.5	cpt-text L
Passage Ranking	MS MARCO	MRR@10	18.4	BM25

Text and Code Embeddings by Contrastive Pre-Training

Abstract

Results

Related Papers

Text and Code Embeddings by Contrastive Pre-Training

Abstract

Results

Related Papers