SOSD

Searching on Sorted Data

CC0Introduced 2019-08-01

SOSD is a collection of dataset to benchmark the lookup performance of learned indexes.

SOSD currently includes eight different datasets. Each dataset consists of 200 million 64-bit unsigned integers (keys) with very few duplicates (if at all): amzn represents book sale popularity data. face is an upsampled version of a Facebook user ID dataset. logn and norm are lognormal (0, 2) and normal distributions, respectively. osmc is uniformly sampled OpenStreetMap locations represented as Google S2 CellIds. uden is dense integers. uspr is uniformly distributed sparse integers. wiki is Wikipedia article edit timestamps.

In addition, there are 32-bit versions of all datasets (except osmc and wiki) with similar CDFs. We use different parameters, (0, 1), for logn in the 32-bit case to reduce the number of duplicates.