address_parser_data

Introduced 2024-04-08

This is a set of datasets containing three versions of data:

V0: the original sampled addresses with no augmentation.
V1: Given V0, we apply basic cleaning and address structure masking technique (rearrangement or removal, but no addition of augmented address parts).
V2: Given V0, we apply more advanced augmentation techniques (see paper for more info)

For each version we have three types of dataset:

Train: around 3M datapoints used for training
Test: around 100k datapoints used for testing
Zero-shot: around 300k datapoints used for validation, extracting from addresses in countries not seen in train and test