address_parser_data
Introduced 2024-04-08
This is a set of datasets containing three versions of data:
- V0: the original sampled addresses with no augmentation.
- V1: Given V0, we apply basic cleaning and address structure masking technique (rearrangement or removal, but no addition of augmented address parts).
- V2: Given V0, we apply more advanced augmentation techniques (see paper for more info)
For each version we have three types of dataset:
- Train: around 3M datapoints used for training
- Test: around 100k datapoints used for testing
- Zero-shot: around 300k datapoints used for validation, extracting from addresses in countries not seen in train and test