PrivacyScraper

NSF Privacy Question Answering Project - Principal Investigator Norman Sadeh, CMU
04/21/2024, Tong Jiao

Overview

Given a URL or a CSV file containing URLs to privacy policy pages, this scraper can retrieve and save full privacy policy texts from the given URLs.

Objective of Designing This Scraper

Though app developers are required to provide a URL to their privacy policy on the app store, the website's content from the given URL is not always the full privacy policy text. It may indirectly lead to the privacy policy or have links to other additional information. This software aims to get all information related to privacy from this given URL.

Detailed design and analysis of this scraper, its performance on popular and random apps in the iOS app store, resources and data used, and its limitations: Check https://docs.google.com/presentation/d/10oUdr1Rszi4xrfLsKUewKoWFaeEmQ7Kc067tvIjfc7o/edit?usp=sharing for more detail.

Code Organization

chatgpt_utils.py: Utilities related to ChatGPT API calls
config.py: Configurations used throughout the scraper
download_text_genai.py: Functions for downloading text from URLs using generative AI tools
get_websites.py: Retrieves a list of websites to download from the provided CSV file
main.py: Executes the entire extracting process, if you do not want to use the download_text or download_text_save method elsewhere.

Example Input, Output Files, and Usage

Example input file: popular_apps.csv (contains 100 URLs to privacy policies of popular apps on the iOS app store, accessed at 10/14/2023)
Output folders:
- saved_policies: Contains texts from privacy policies
- saved_non_policies: Contains texts from non-privacy policies
Used as an API imported by another module:

from download_text_genai import download_text
from config import config

example_url = "http://pbskids.org/privacy"
example_app_name = ''
policy_text, is_policy_page = download_text(example_url)

Usage Instructions for using `main.py`

Prepare Input:
- Get a list of websites to scrape and save the app names and URLs in a CSV file.
Specify Configurations in config.py:
Explanation of each parameter:
- openai_api_key: The API key for using OpenAI services
- link_csv_path: The CSV file containing links to privacy policies
- policy_col_name: The column name of the column specifying the URL to the privacy policy page of each app
- app_id_col_name: The column name of the column specifying the name of each app
- chatgpt_model: The chatgpt model used in GenAI steps
- output_path_policy: The path of a folder to save texts from privacy policies (determined by the scraper through GenAI)
- output_path_nonpolicy: The path of a folder to save texts from non-privacy policies (determined by the scraper through GenAI)
- headless_driver: If the Selenium driver is using headless mode. For non-GUI servers, this should be set to True
- chatgpt_api_timeout: Seconds to wait before retrying for ChatGPT API
- chatgpt_api_retries: Maximum Number of tries for a single ChatGPT API call
- initial_prompt: The initial prompt given to ChatGPT
- analyze_anchor_text_prompt_beginning: Beginning of the prompt when asking ChatGPT to find a link to the correct privacy policy page. It gives context.
- analyze_anchor_text_prompt_ending: Ending of the prompt when asking ChatGPT to find a link to the correct privacy policy page. It describes the task.
- if_policy_page_prompt_beginning: Beginning of the prompt when asking ChatGPT to determine if the content in a webpage is a privacy policy. It gives context.
- if_policy_page_prompt_ending: Ending of the prompt when asking ChatGPT to determine if the content in a webpage is a privacy policy. It describes the task.
- if_policy_page_prompt_extract_answer: The prompt used to ask GenAI to give a one-word answer. When asking ChatGPT to determine if the content in a webpage is a privacy policy, Chain of Thought is used and this prompt extracts answers from GenAI's initial response.
Run the Scraper (Execute main.py)
- After running main.py, privacy policies (determined by GenAI) will be saved in "output_path_policy" and non-policies will be saved in "output_path_nonpolicy".

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
app_list		app_list
sample_inputs		sample_inputs
sample_outputs		sample_outputs
README.md		README.md
chatgpt_utils.py		chatgpt_utils.py
config.py		config.py
download_text_genai.py		download_text_genai.py
get_websites.py		get_websites.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PrivacyScraper

Overview

Objective of Designing This Scraper

Code Organization

Example Input, Output Files, and Usage

Usage Instructions for using `main.py`

About

Releases

Packages

Languages

TongJ05/PrivacyScraper

Folders and files

Latest commit

History

Repository files navigation

PrivacyScraper

Overview

Objective of Designing This Scraper

Code Organization

Example Input, Output Files, and Usage

Usage Instructions for using main.py

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Usage Instructions for using `main.py`

Packages