TermMiner

Data collection and topic modeling pipeline introduced in the paper:

Harmful Terms and Where to Find Them: Measuring and Modeling Unfavorable Financial Terms and Conditions in Shopping Websites at Scale

If you find this pipeline or the ShopTC-100K dataset useful for your research, please cite our paper:

@inproceedings{tsai2025harmful,
  author = {Elisa Tsai and Neal Mangaokar and Boyuan Zheng and Haizhong Zheng and Atul Prakash},
  title = {Harmful Terms and Where to Find Them: Measuring and Modeling Unfavorable Financial Terms and Conditions in Shopping Websites at Scale},
  booktitle = {Proceedings of the ACM Web Conference 2025 (WWW ’25)},
  year = {2025},
  location = {Sydney, NSW, Australia},
  publisher = {ACM},
  address = {New York, NY, USA},
  pages = {14},
  month = {April 28-May 2},
  doi = {10.1145/3696410.3714573}
}

Table of Contents

Installation
Configuration
- API Token
- File Structure
Data Collection
- Tranco List
- Fetch Terms and Conditions
Topic Modeling
ShopTC-100K Dataset

Installation

Create a Python virtual environment using e.g. Conda:

conda env create -f environment.yaml & conda activate tc

Configuration

API Token

Suppose your OpenAI API key is stored in ./api_token, change the corresponding entries in configs/measurement.yaml

File Structure

├── config.py
├── configs
│   ├── measurement.yaml
│   └── prompts
│       ├── comparison_prompt.txt
│       ├── financial_term_binary_classification_prompt.txt
│       ├── financial_term_classification_prompt.txt
│       ├── gpt_3.5_shopping_website_classification.txt
│       ├── gpt_4o_shopping_website_classification.txt
│       ├── malicious_financial_term_classification_prompt.txt
│       ├── malicious_financial_term_taxonomy.txt
│       ├── payment_page_classification.txt
│       └── term_annotation_scheme.txt
├── measurement
│   ├── 1tranco.py
│   ├── 2fetch_terms.py
│   ├── 3sanitize_terms.py
│   ├── 4cluster.py
│   ├── README.md
│   ├── tc_locator.py
│   └── website_classification.py
├── stats.py
├── terms.py
└── utils.py

The following folders are automatically generated when running the measurement code:

data/tranco/
├── clusters
├── embeddings
├── non_shopping_terms
├── output
├── sanitized
├── screenshots
├── shopping_terms
└── stats

To customize the default file save paths, update the settings in configs/measurement.yaml.

Data collection

💡 For detailed instructions on running the code, refer to measurement/README.md.

Tranco List

Download the most recent Tranco List from this link. Store it as top-1m.csv

Fetch Terms and Conditions

Run:

python measurement/1tranco.py --start=0 --end=1000

to fetch the list of websites (from the Tranco list top 1000), checks their accessibility and language, takes screenshots, classifies them as shopping or non-shopping websites using a vision-based classifier, and saves the results.

Run:

python measurement/2fetch_terms.py --start=0 --end=1000

to fetch the T&C pages from the English shopping websites.

Run:

python measurement/3sanitize_terms.py --start=0 --end=1000 --target=sanitized_split0.csv

to split terms into paragraphs.

Topic Modeling

Run:

python measurement/4cluster.py --split=0 --cluster=True --chunk-num=5 --is-financial=True --eps=0.21

to cluster and topic model the terms.

ShopTC-100K Dataset

We create the ShopTC-100K dataset, containing 1.8 million sanitized terms extracted from the terms and conditions of 7,777 shopping websites in the Tranco top 100K most popular websites (as of April 2024).

Download ShopTC-100K from Hugging face: 🤗 https://huggingface.co/datasets/eltsai/ShopTC-100K

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TermMiner

Installation

Configuration

API Token

File Structure

Data collection

Tranco List

Fetch Terms and Conditions

Topic Modeling

ShopTC-100K Dataset

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
figs		figs
measurement		measurement
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ShopTC-100K_README.md		ShopTC-100K_README.md
config.py		config.py
environment.yml		environment.yml
stats.py		stats.py
terms.py		terms.py
utils.py		utils.py

License

eltsai/term_miner

Folders and files

Latest commit

History

Repository files navigation

TermMiner

Installation

Configuration

API Token

File Structure

Data collection

Tranco List

Fetch Terms and Conditions

Topic Modeling

ShopTC-100K Dataset

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages