BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, Iryna Gurevych

2021-04-17Question Answering News Retrieval Benchmarking Text Retrieval Duplicate-Question Retrieval Argument Retrieval Fact Checking Entity Retrieval Passage Retrieval Tweet Retrieval Information Retrieval Biomedical Information Retrieval Re-Ranking Retrieval Citation Prediction

Paper PDF Code Code(official)Code

Abstract

Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense and sparse-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. We hope this framework allows us to better evaluate and understand existing retrieval systems, and contributes to accelerating progress towards better robust and generalizable systems in the future. BEIR is publicly available at https://github.com/UKPLab/beir.

Results

Task	Dataset	Metric	Value	Model
Question Answering	HotpotQA (BEIR)	nDCG@10	0.707	BM25+CE
Question Answering	NQ (BEIR)	nDCG@10	0.533	BM25+CE
Question Answering	NQ (BEIR)	nDCG@10	0.524	ColBERT
Question Answering	FiQA-2018 (BEIR)	nDCG@10	0.347	BM25+CE
Information Retrieval	MSMARCO (BEIR)	nDCG@10	0.413	BM25+CE
Information Retrieval	MSMARCO (BEIR)	nDCG@10	0.408	TAS-b
Information Retrieval	MSMARCO (BEIR)	nDCG@10	0.401	ColBERT
Information Retrieval	MSMARCO (BEIR)	nDCG@10	0.388	ANCE
Information Retrieval	MSMARCO (BEIR)	nDCG@10	0.351	SPARTA
Information Retrieval	MSMARCO (BEIR)	nDCG@10	0.338	docT5query
Information Retrieval	MSMARCO (BEIR)	nDCG@10	0.296	DeepCT
Information Retrieval	MSMARCO (BEIR)	nDCG@10	0.228	BM25
Information Retrieval	MSMARCO (BEIR)	nDCG@10	0.177	DPR
Biomedical Information Retrieval	NFCorpus (BEIR)	nDCG@10	0.35	BM25+CE
Biomedical Information Retrieval	NFCorpus (BEIR)	nDCG@10	0.305	ColBERT
Biomedical Information Retrieval	BioASQ (BEIR)	nDCG@10	0.523	BM25+CE
Biomedical Information Retrieval	BioASQ (BEIR)	nDCG@10	0.514	BM25
Biomedical Information Retrieval	TREC-COVID (BEIR)	nDCG@10	0.757	BM25+CE
Biomedical Information Retrieval	TREC-COVID (BEIR)	nDCG@10	0.677	ColBERT
Fact Checking	CLIMATE-FEVER (BEIR)	nDCG@10	0.253	BM25+CE
Fact Checking	FEVER (BEIR)	nDCG@10	0.819	BM25+CE
Fact Checking	SciFact (BEIR)	nDCG@10	0.688	BM25+CE
Fact Checking	SciFact (BEIR)	nDCG@10	0.671	ColBERT

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Abstract

Results

Related Papers

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Abstract

Results

Related Papers