Urdu MsMarco

TextsCC BY-SAIntroduced 2024-12-17

This dataset is the translation of the MS-marco dataset, marking it the first large-scale urdu IR dataset.

Dataset Details: The MS MARCO dataset is formed by a collection of 8.8M passages, approximately 530k queries, and at least one relevant passage per query, which were selected by humans. The development set of MS MARCO comprises more than 100k queries. However, a smaller set of 6,980 queries is used for evaluation in most published works.

The triples files (triples.train.small.urdu.tsv , named as triple files part aa to ae) is around 47 GB and is split into 5 parts. Download them and combine them before you start working.

This dataset is created using the IndicTrans2 translation model.