ShopTC-100K Dataset

The ShopTC-100K dataset is collected using TermMiner, an open-source data collection and topic modeling pipeline introduced in the paper:

Harmful Terms and Where to Find Them: Measuring and Modeling Unfavorable Financial Terms and Conditions in Shopping Websites at Scale

If you find this dataset or the related paper useful for your research, please cite our paper:

@inproceedings{tsai2025harmful,
  author = {Elisa Tsai and Neal Mangaokar and Boyuan Zheng and Haizhong Zheng and Atul Prakash},
  title = {Harmful Terms and Where to Find Them: Measuring and Modeling Unfavorable Financial Terms and Conditions in Shopping Websites at Scale},
  booktitle = {Proceedings of the ACM Web Conference 2025 (WWW ’25)},
  year = {2025},
  location = {Sydney, NSW, Australia},
  publisher = {ACM},
  address = {New York, NY, USA},
  pages = {14},
  month = {April 28-May 2},
  doi = {10.1145/3696410.3714573}
}

Dataset Description

The dataset consists of sanitized terms extracted from e-commerce websites with English terms and conditions. The websites were sourced from the Tranco list (as of April 2024).