TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Text and Code Embeddings by Contrastive Pre-Training

Text and Code Embeddings by Contrastive Pre-Training

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov, Joanne Jang, Peter Welinder, Lilian Weng

2022-01-24Passage RankingNatural Questionstext similarityLinear-Probe ClassificationTriviaQACode SearchZero-shot Text Search
PaperPDFCode

Abstract

Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models. On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The same text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4%, 14.7%, and 10.6% over previous best unsupervised methods on MSMARCO, Natural Questions and TriviaQA benchmarks, respectively. Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8% relative improvement over prior best work on code search.

Results

TaskDatasetMetricValueModel
Code SearchCodeSearchNetGo97.5cpt-code M
Code SearchCodeSearchNetJS86.5cpt-code M
Code SearchCodeSearchNetJava94.4cpt-code M
Code SearchCodeSearchNetOverall93.5cpt-code M
Code SearchCodeSearchNetPHP97.2cpt-code M
Code SearchCodeSearchNetPython99.9cpt-code M
Code SearchCodeSearchNetRuby85.5cpt-code M
Code SearchCodeSearchNetGo97.7cpt-code S
Code SearchCodeSearchNetJS86cpt-code S
Code SearchCodeSearchNetJava94cpt-code S
Code SearchCodeSearchNetOverall93.4cpt-code S
Code SearchCodeSearchNetPHP96.7cpt-code S
Code SearchCodeSearchNetPython99.8cpt-code S
Code SearchCodeSearchNetRuby86.3cpt-code S
Passage RankingMS MARCOMRR@1044.3Fine-tuned SOTA
Passage RankingMS MARCOMRR@1022.7cpt-text XL
Passage RankingMS MARCOMRR@1021.5cpt-text L
Passage RankingMS MARCOMRR@1018.4BM25

Related Papers

ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment2025-06-28IRanker: Towards Ranking Foundation Model2025-06-25Constructing and Evaluating Declarative RAG Pipelines in PyTerrier2025-06-12Adding simple structure at inference improves Vision-Language Compositionality2025-06-11Retrieval-Augmented Generation as Noisy In-Context Learning: A Unified Theory and Risk Bounds2025-06-03MGS3: A Multi-Granularity Self-Supervised Code Search Framework2025-05-30DeepRTL2: A Versatile Model for RTL-Related Tasks2025-05-28GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models2025-05-26