address_parser_data

Introduced 2024-04-08

This is a set of datasets containing three versions of data:

  • V0: the original sampled addresses with no augmentation.
  • V1: Given V0, we apply basic cleaning and address structure masking technique (rearrangement or removal, but no addition of augmented address parts).
  • V2: Given V0, we apply more advanced augmentation techniques (see paper for more info)

For each version we have three types of dataset:

  • Train: around 3M datapoints used for training
  • Test: around 100k datapoints used for testing
  • Zero-shot: around 300k datapoints used for validation, extracting from addresses in countries not seen in train and test