diff --git a/README.md b/README.md
index 45e2e4af87..73c460c5e8 100644
--- a/README.md
+++ b/README.md
@@ -14,10 +14,9 @@
- Trackers
- · Websites
- · Blog
- · Explorer
+ Trackers
+ · Websites
+ · Explorer
@@ -29,96 +28,36 @@
-This repository contains:
-
-- data on trackers and websites as shown on [whotracks.me](https://whotracks.me/) (WTM)
-- database mapping tracker domains to companies
-- code to render the [whotracks.me](https://whotracks.me/) site
-
----
-
-> :warning: **Upcoming changes**
->
-> We are in the process of migrating the website to [ghostery.com/whotracksme](https://ghostery.com/whotracksme):
->
-> * https://www.ghostery.com/whotracksme
-> * https://www.ghostery.com/whotracksme/trackers
-> * https://www.ghostery.com/whotracksme/websites
-> * https://www.ghostery.com/whotracksme/explorer
->
-> The following documentation has not been updated yet and still points to the old website.
-> You can already try out the new website with the links above. If you have any feedback,
-> about missing functionality or if you spot inconsistencies, feel free to create a
-> [Github issue](https://github.com/whotracksme/whotracks.me/issues).
->
-> Monthly data dumps will not be affected by these changes. The licensing also remains unchanged.
->
-> For more information, see https://github.com/whotracksme/whotracks.me/issues/367
->
-> For now, if you came from the Ghostery website and have question on the data itself
-> (e.g. how to download or looking for a technical documentation of the fields),
-> it is best to start here:
->
-> https://github.com/whotracksme/whotracks.me/blob/master/whotracksme/data/Readme.md
->
-> If it does not answer your questions, also feel free to create a
-> [Github issue](https://github.com/whotracksme/whotracks.me/issues).
-> It helps us to improve the missing parts of the documentation.
-
----
-
-# Installation
-
-Python 3.11 is needed to build the site. We recommend creating a
-[virtualenv](http://docs.python-guide.org/en/latest/dev/virtualenvs/)
-to install the dependencies, or use `pipenv` or )
-
-to
-
-```sh
-python -m venv venv
-. venv/bin/activate
-```
-
-After the initial setup, you can proceed with installing whotracks.me.
+# Downloading the data
-For nushell:
+Each month, we release a new version of the web site. The data from the last month can be directly [accessed through the website](https://www.ghostery.com/whotracksme/explorer).
-```nushell
-python -m virtualenv venv
-overlay use venv/bin/activate.nu
-```
+The raw data, from which the graphs have been computed, is also available as an open data set (updated every month). You can also
+download historical data. More information on the raw data can be found [here](whotracksme/data/Readme.md).
-## With Pip
+WhoTracks.me also builts heavily on another open source project called [TrackerDB](https://github.com/ghostery/trackerdb);
+all meta data (e.g. company descriptions) is maintained there.
-```sh
-$ python -m pip install git+https://github.com/ghostery/whotracks.me.git
-```
-
-## From source
+# Using the data
-After cloning the repository:
+You can directly use the [raw data](whotracksme/data/Readme.md), which are all text files. As an alternative, you an also
+download it locally and use the Python API:
-```sh
-$ python -m pip install -r requirements.txt
-$ python -m pip install -e .
+```
+python3.11 -m venv venv
+. venv/bin/activate
+pip install git+https://github.com/ghostery/whotracks.me.git
```
-That’s all you need to get started\!
-
-# Downloading the data
-
-Each month, we release a new version of the web site. The raw data, from which the
-graphs have been computed, are also available as an open data set (updated every month).
-
-The data from month can be also directly [accessed through the website](https://whotracks.me/explorer.html).
-
-More information on the raw data can be found in `whotracksme/data/Readme.md`.
+... or if you have locally checked it out:
-# Using the data
+```
+python3.11 -m venv venv
+. venv/bin/activate
+pip install -r requirements.txt
+```
-To get started with the data, everything you need can be found in
-`whotracksme.data`:
+The Python API can now be accessed as follows (make sure you have already downloaded data):
```python
from whotracksme.data.loader import DataSource
@@ -131,7 +70,7 @@ data.companies
data.sites
```
-A whitepaper for whotracks.me is available at https://arxiv.org/abs/1804.08959, and here's a BibTeX entry that you can use to cite it in a publication:
+A whitepaper for WhoTracks.me is available at https://arxiv.org/abs/1804.08959, and here's a BibTeX entry that you can use to cite it in a publication:
```
@misc{whotracksme,
@@ -144,56 +83,24 @@ A whitepaper for whotracks.me is available at https://arxiv.org/abs/1804.08959,
}
```
-# Building the site
-
-Building the site requires a few extra dependencies, not installed by
-default to not make the installation heavier than it needs to be. You
-will need to install `whotracksme` from the repository, because not all
-assets are packaged with `whotracksme` released on pypi:
-
-```sh
-$ python -m pip install -r requirements-dev.txt
-$ python -m pip install -e '.[dev]'
-```
-
-Once this is done, you will have access to a `whotracksme` entry point
-that can be used this way:
-
-```sh
-$ whotracksme website [serve]
-```
-
-The `serve` part is optional and can be used while making changes on the
-website.
-
-All generated artifacts can be found in the `_site/` folder.
-
-> If you debug the website generator, the parallel execution can be
-> disabled by setting the environment variable DEBUG=1.
-
-## Tests
-
-To run tests, you will need `pytest`, or simply install `whotracksme`
-with the `dev` extra:
-
-```sh
-$ python -m pip install -e '.[dev]'
-$ pytest
-```
-
# Contributing
-We are happy to take contributions on:
+We rely on contributions for the community to keep the quality of this project high. If you want, you can support us in multiple ways:
+* Do you see inconsistencies in the data? Please open a Github issue [here](https://github.com/whotracksme/whotracks.me/issues). We will have a look!
+* Do you see wrong company descriptions? Did we put something in the category? Please check out the [TrackerDB project](https://github.com/ghostery/trackerdb), where all the meta data is kept, and open an [issue](https://github.com/ghostery/trackerdb/issues), or send us a pull request.
+* Do you have any feedback on the [WhoTracks.me homepage](https://www.ghostery.com/whotracksme) or about the documentation? Please, let us know, so we can improve.
-- Guest articles for our blog in the topics of tracking, privacy and security. Feel free to use the data in this repository if you need inspiration.
-- Feature requests that are doable using the WTM database.
-- Curating our database of tracker profiles. Open an issue if you spot anything odd.
+You can also contact us via email at [info@whotracks.me](mailto:info@whotracks.me)
# Right to Amend
Please read our [Guideline for 3rd parties](https://github.com/ghostery/whotracks.me/blob/master/RIGHT_TO_AMEND.md) wanting to suggest
corrections to their data.
+# Local builds
+
+[Readme on local builds](docs/local-build.md) (this is mostly relevant for the maintainer of this project)
+
# License
The content of this project itself is licensed under the [Creative
diff --git a/docs/local-build.md b/docs/local-build.md
new file mode 100644
index 0000000000..047e05883c
--- /dev/null
+++ b/docs/local-build.md
@@ -0,0 +1,40 @@
+# Local development
+
+## Building the website (the static HTML based version) and the internal API
+
+The code to build the website on https://www.ghostery.com/whotracksme is not public;
+but for local testing, you can still use the code for the [previous version](https://web.archive.org/web/20240501140903/https://whotracks.me/).
+
+Python 3.11 is needed to build the site (Python 3.12 is currently not supported):
+
+```sh
+python3.11 -m venv venv
+. venv/bin/activate
+pip -r requirements-dev.txt
+pip install -e '.[dev]'
+```
+
+If you have not done so, make sure that you have downloaded data (see [Data Readme](../whotracksme/data/Readme.md)).
+
+```
+whotracksme website
+```
+
+It will generate static HTML files in the `_site` directory. Plus, it will also create a JSON files
+in the `_site/api/` directory. Use them at your own risk, since the format is expected to change over
+time. If there is interest to stabilize the API files, let us know. Currently, it is only used internally
+within Ghostery to power the new website.
+
+> Hint: if you debug the website generator, the parallel execution can be
+> disabled by setting the environment variable DEBUG=1.
+
+## Tests
+
+To run the unit tests:
+
+```sh
+python3.11 -m venv venv
+. venv/bin/activate
+python -m pip install -e '.[dev]'
+pytest
+```
diff --git a/whotracksme/data/Readme.md b/whotracksme/data/Readme.md
index 96c9e37851..a8aefa704a 100644
--- a/whotracksme/data/Readme.md
+++ b/whotracksme/data/Readme.md
@@ -2,16 +2,17 @@
The data for the whotracks.me site is provided here as JSON files, with a SQL database containing tracker information. This document describes the format of the data provided in the `assets` directory.
+> Note: Beside the monthly data dump, there is also a separate project for the meta data. It is an open source project called [TrackerDB](https://github.com/ghostery/trackerdb).
+
## How to get the data
You have two options to work with the raw data:
-1. Explore the raw data from last month through the web site
+1. Explore the raw data from last month [through the web site](https://www.ghostery.com/whotracksme/explorer)
1. Download the data locally (including historic data)
### Use the Explorer on the whotracks.me website
-The last month can be directly accessed from website:
-https://whotracks.me/explorer.html
+The last month can be directly accessed from website at https://www.ghostery.com/whotracksme/explorer
> Note: The meaning of the column in the explorer in explained in this document (see below).
@@ -43,30 +44,32 @@ aws s3 sync --no-sign-request s3://data.whotracks.me/ .
## Tracker database
-The tracker database is provided in the `assets/trackerdb.sql` file. This is a dump of a SQLite3 database containing the following tables:
+Generally, it is recommend to get the Tracker database directly from the upstream project: https://github.com/ghostery/trackerdb
+
+For consistency, a snapshot of the tracker database used to generate the monthly data set is also provided in the `assets/trackerdb.sql` file. This is a dump of a SQLite3 database containing the following tables:
* `categories`: Categories for trackers (e.g. `advertising`, `social_media`).
* `companies`: Metadata on companies: name, description, and various links.
* `trackers`: Metadata on trackers: name, description, category, website and an optional link to a parent company.
* `tracker_domains`: Table linking trackers to domain names.
-This SQL database links third-party domains to trackers, which are then linked to unique companies operating those trackers. This is similar to [Disconnect's Tracker List](https://github.com/disconnectme/disconnect-tracking-protection) and [webXray's Domain Owner List](https://github.com/timlib/webXray_Domain_Owner_List#webxray-domain-owner-list). These parties are already categorized accordingly within the WhoTracks.me datasets below.
-
## WhoTracks.me datasets
-WhoTracks.me datasets are provided monthly in the `assets//{month}/{country}/{file}.csv` format. A glimpse of the datasets is available in [Explorer](https://whotracks.me/explorer.html) section on the website.
+WhoTracks.me datasets are provided monthly in the `assets//{month}/{country}/{file}.csv` format. A glimpse of the datasets is available in [Explorer](https://www.ghostery.com/whotracksme/explorer) section on the website.
### Data collection
-Data was collected from May 2017 from users that used Cliqz browser extension. In Feb 2018, 70% of the data came from German users according to [this](https://whotracks.me/blog/update_feb_2018.html) blog post. Then in March 2018, users of Ghostery FireFox extension - and Ghostery extension available for other browsers (Safari, Chrome, Opera and Edge) from users that opted-in to *HumanWeb* data collection - were added to the dataset. This caused a slight decrease in the avg. no. of trackers in April 2018, since Ghostery users were blocking more trackers. This is explained in [this](https://whotracks.me/blog/where_is_the_data_from.html) and [this](https://whotracks.me/blog/update_apr_2018.html) blog posts.
+Nowadays, the data comes exclusively from users of the [Ghostery Extension](https://github.com/ghostery/ghostery-extension/). Precise user counts are difficult due to the nature of the data collection; but it is estimated to be in the order a few million devices per month, spread all around the world, but mostly in Europe and the US. The methodology is builds on the concept of k-Anonymity and is described in paper [Tracking the Trackers](https://0x65.dev/static/docs/studies/TrackingTheTrackers.pdf). The WhoTracks.me monthly data sets are derived from the same data that also powers the anti-tracking protection in Ghostery; it is also described in [this blog post](https://www.0x65.dev/blog/2019-12-19/blocking-tracking-without-blocking-trackers.html). The code can be found [here](https://github.com/whotracksme/webextension-packages/tree/main/packages/reporting/src/request).
-[This](https://whotracks.me/blog/update_apr_2018.html) blog post illustrates where the traffic came from in April 2018: Germany and USA being most representative.
+Before 2018, all traffic came from users of the Cliqz Browser and the Cliqz extension (now both discontinued). Around April 2018, Ghostery users started to contribute to the data set. This both increased the user base and made it more internation (Cliqz was mostly used in German speaking regions). Here are some historical information from the Cliqz side (to help understand data sets before 2020):
-[This](https://cliqz.com/en/magazine/government-websites-leak-data-to-google-co) blog post notes that WhoTracks.me does not collect data for pages with no trackers; in other words, collected data for all sites contains some number of third-parties and tracking.
+* Data was collected from May 2017 from users that used Cliqz browser extension. In Feb 2018, 70% of the data came from German users according to [this](https://web.archive.org/web/20240121094157/https://whotracks.me/blog/update_feb_2018.html) blog post. Then in March 2018, users of Ghostery Firefox extension - and Ghostery extension available for other browsers (Safari, Chrome, Opera and Edge) from users that opted-in to HumanWeb data collection - were added to the dataset. This caused a slight decrease in the avg. no. of trackers in April 2018, since Ghostery users were blocking more trackers. This is explained in [this](https://web.archive.org/web/20240430053538/https://whotracks.me/blog/where_is_the_data_from.html) and [this](https://web.archive.org/web/20240121094145/https://whotracks.me/blog/update_apr_2018.html) blog posts.
+* [This](https://web.archive.org/web/20240121094145/https://whotracks.me/blog/update_apr_2018.html) blog post illustrates where the traffic came from in April 2018: Germany and USA being most representative.
+* [This](https://cliqz.com/en/magazine/government-websites-leak-data-to-google-co) blog post notes that WhoTracks.me does not collect data for pages with no trackers; in other words, collected data for all sites contains some number of third-parties and tracking.
### Datasets
-There are 5 main datasets on - unlike 4 datasets available in [Explorer](https://whotracks.me/explorer.html) section on the website:
+There are five main datasets (unlike the four datasets available in the [explorer](https://www.ghostery.com/whotracksme/explorer) section on the website):
* `sites.csv`: Stats for number of trackers seen on popular websites.
* `site_trackers.csv`: Stats for each tracker on each site.
@@ -78,7 +81,7 @@ There are 5 main datasets on - unlike 4 datasets available in [Explorer](https:/
### Variable descriptions
-The data is created by aggregating data about page loads at several different levels. Therefore, all 5 above datasets share similar aggregated variables. The difference therefore, lies in the *perspective* of each dataset. Variable descriptions ("contexts" are added to variables for groupings) are given below:
+The data is created by aggregating data about page loads at several different levels. Therefore, all five above datasets share similar aggregated variables. The difference therefore, lies in the *perspective* of each dataset. Variable descriptions ("contexts" are added to variables for groupings) are given below:
**General context**:
@@ -90,9 +93,9 @@ The data is created by aggregating data about page loads at several different le
* `category` - site's category (in `sites.csv`). Descriptions of website categories (first-parties) are provided [here](https://arxiv.org/pdf/1804.08959v1.pdf#page=14) in Appendix A. String.
- * `tracker_category` - tracker's category (in `sites_trackers.csv`). Descriptions of tracker categories are provided [here](https://whotracks.me/blog/tracker_categories.html). String.
+ * `tracker_category` - tracker's category (in `sites_trackers.csv`). Descriptions of tracker categories are provided [here](https://github.com/ghostery/trackerdb/blob/main/docs/categories.md). String.
- * `popularity` - the relative amount of traffic compared to the most popular site (described [here](https://whotracks.me/blog/updating_our_tracking_prevalence_metrics.html)). Float between 0 and 1.
+ * `popularity` - the relative amount of traffic compared to the most popular site (described [here](https://web.archive.org/web/20240121094155/https://whotracks.me/blog/updating_our_tracking_prevalence_metrics.html)). Float between 0 and 1.
**Utilised tracking context (stateful)** – generates more persistant tracking ID by trackers:
@@ -128,15 +131,15 @@ The data is created by aggregating data about page loads at several different le
**Tracker's blocking context** – how often the tracker is affected by blocklist-based blockers:
- * `requests_failed` - average number of requests make to the tracker per page which do not succeed. In other words, avg. number of failed requests per page load (for comparison with `requests` to get an idea of how aggressive the blocking is). This is an approximate measure of blocking from external sources (i.e. adblocking extensions or firewalls). Measure [added](https://whotracks.me/blog/update_dec_2017.html) in Dec 2017. Positive float.
+ * `requests_failed` - average number of requests make to the tracker per page which do not succeed. In other words, avg. number of failed requests per page load (for comparison with `requests` to get an idea of how aggressive the blocking is). This is an approximate measure of blocking from external sources (i.e. adblocking extensions or firewalls). Measure [added](https://web.archive.org/web/20240121094211/https://whotracks.me/blog/update_dec_2017.html) in Dec 2017. Positive float.
- * `has_blocking` - proportion of pages where some kind of external blocking of the tracker was detected.Measure [added](https://whotracks.me/blog/update_dec_2017.html) in Dec 2017. Float between 0 and 1.
+ * `has_blocking` - proportion of pages where some kind of external blocking of the tracker was detected.Measure [added](https://web.archive.org/web/20240121094211/https://whotracks.me/blog/update_dec_2017.html) in Dec 2017. Float between 0 and 1.
> "These signals [`requests_failed` and `has_blocking`] should be able to tell us something about the impact of blocking on different trackers in the ecosystem. For example, we see evidence of blocking 40% of the time for Google Analytics and Facebook [in Dec 2017], and between 10% and 20% of requests failing. Thus, anyone using these services to measure activity and conversions on their sites must reckon with error rates in these orders. We also can see how new entrants can initially avoid the effects of blocking - for Tru Optik and Digitrust who we mentioned earlier, we measure only 5 and 1% of pages which may be affected by blocking."
**Tracker's content loading context** – proportion of page loads where specific resource types were loaded by the tracker (e.g. scripts, iframes, plugins)
-Signals for the frequency with which certain resource types are loaded by third-parties (measures [added](https://whotracks.me/blog/update_feb_2018.html) in Feb 2018):
+Signals for the frequency with which certain resource types are loaded by third-parties (measures [added](https://web.archive.org/web/20240121094157/https://whotracks.me/blog/update_feb_2018.html) in Feb 2018):
* `script` - JavaScript code (via a `