This repo contains examples using Common Crawl's host index. The host index for a crawl is a table which contains a single row for each web host, combining aggregated information from the columnar index, the web graph, and our raw crawler logs.
This document discusses the testing v2 version of this dataset -- it will change before its final release.
- What's our history of crawling a particular website, or group of websites?
- What popular websites have a lot of non-English content?
- What popular websites seem to have so little content that we might need to execute javascript to crawl them?
- What's the full list of websites where more than half of the webpages are primarily not English?
- What popular websites end our crawls with most of their crawl budget left uncrawled?
- What websites have a lot of content in particular languages?
- What websites have a lot of content with particular Unicode scripts?
- there is a Hive-style partition on
crawl
, which is a crawl name likeCC-MAIN-2025-13
. - there is one row for every webhost in the web graph, even if we didn't crawl that website in that particular crawl
- the primary key is
surt_host_name
, which is quirky (commoncrawl.org -> org,commoncrawl) - there is also
url_host_tld
, which we recommend that you use whenever possible ("org" for commoncrawl.org) - there are counts of what we stored in our archive (warc, crawldiagnostics, robots)
fetch_200, fetch_3xx, fetch_4xx, fetch_5xx, fetch_gone, fetch_notModified, fetch_other, fetch_redirPerm, fetch_redirTemp
robots_200, robots_3xx, robots_4xx, robots_5xx, robots_gone, robots_notModified, robots_other, robots_redirPerm, robots_redirTemp
- there is ranking information from the web graph: harmonic centrality, page rank, and both normalized to a 0-10 scale
hcrank
,prank
,hcrank10
,prank10
- there is a language summary (for now, just the count of languages other than English (LOTE))
fetch_200_lote, fetch_200_lote_pct
- for a subset of the numbers, there is a
foo_pct
, which can help you avoid doing math in SQL. It is an integer 0-100. - there are raw numbers from our crawler logs, which also reveal the crawl budget and if we exhausted it
nutch_fetched, nutch_gone, nutch_notModified, nutch_numRecords, nutch_redirPerm, nutch_redirTemp, nutch_unfetched
nutch_fetched_pct, nutch_gone_pct, nutch_notModified_pct, nutch_redirPerm_pct, nutch_redirTemp_pct, nutch_unfetched_pct
- there is a size summary (average and median size (compressed))
warc_record_length_median, warc_record_length_av
(will be renamed to _avg in v3)
- the full schema is at
athena_schema.v2.sql
US Federal government websites in the *.gov domain (about 1,400 domains, y-axis scale is millions):
See all graphs from this dataset
commoncrawl.org fetch. You can see that we revamped our website in CC-2023-14, which caused a lot of permanent redirects to be crawled in the next few crawls:
See all graphs from this dataset
The host index can either be used in place at AWS, or you can download it and use it from local disk. The size is about 7 gigabytes per crawl, and the most recent 24 crawls are currently indexed (testing-v2).
pip install -r requirements.txt
wget https://data.commoncrawl.org/projects/host-index-testing/v2.paths.gz
export HOST_INDEX=v2.paths.gz
aws s3 cp s3://commoncrawl/projects/host-index-testing/v2.paths.gz .
export HOST_INDEX=v2.paths.gz
export HOST_INDEX_BUCKET=s3://commoncrawl/
If you have enough local disk space (180 gigabytes), install cc-downloader and then:
wget https://data.commoncrawl.org/projects/host-index-testing/v2.paths.gz
cc-downloader download v2.paths.gz .
Wherever you move the downloaded files, point at the top directory:
export HOST_INDEX=/home/cc-pds/commoncrawl/projects/host-index-testing/v2/
CREATE DATABASE cchost_index_testing_v2
Select the new database in the GUI.
Paste the contents of athena_schema.v2.sql
into a query and run it.
Then,
MSCK REPAIR TABLE host_index_testing_v2
Now check that it's working:
SELECT COUNT(*) FROM cchost_index_testing_v2
The included script graph.py
knows how to make csvs, png images, and webpages containing
these images. It runs in 3 styles:
- one web host:
python ./graph.py example.com
- wildcared subdomains:
python ./graph.py *.example.com
- a list of hosts:
python ./graph.py -f list_of_hosts.txt
Yes, these commands take a while to run, because Parquet is a bad choice to look up a single row. On my machine the example.com graph takes 5.5 minutes of CPU time.
This repo also has duckdb-to-csv.py
, which you can use to run a
single SQL command and get csv output.
Host fetches, non-English count, and ranking. Note the use of url_host_tld
... that is recommended
to make the SQL query optimizer's life easier.
SELECT
crawl, fetch_200, fetch_200_lote, prank10, hcrank10
FROM cchost_index_testing_v2
WHERE surt_host_name = 'org,commoncrawl'
AND url_host_tld = 'org'
ORDER BY crawl ASC
Counts of web captures. This includes captures that you will find in the crawldiagnostics warcs.
SELECT
crawl, fetch_200, fetch_gone, fetch_redirPerm, fetch_redirTemp, fetch_notModified, fetch_3xx, fetch_4xx, fetch_5xx, fetch_other
FROM cchost_index_testing_v2
WHERE surt_host_name = 'org,commoncrawl'
AND url_host_tld = 'org'
ORDER BY crawl ASC
Per-host page size stats, average and median.
SELECT
crawl, warc_record_length_av, warc_record_length_median
FROM cchost_index_testing_v2
WHERE surt_host_name = 'org,commoncrawl'
AND url_host_tld = 'org'
ORDER BY crawl ASC
Raw crawler logs counts. nutch_numrecords
gives an idea of what the crawl budget was,
and if it was exhausted.
SELECT
crawl, nutch_numRecords, nutch_fetched, nutch_unfetched, nutch_gone, nutch_redirTemp, nutch_redirPerm, nutch_notModified
FROM cchost_index_testing_v2
WHERE surt_host_name = 'org,commoncrawl'
AND url_host_tld = 'org'
ORDER BY crawl ASC
Robots.txt fetch details.
SELECT
crawl, robots_200, robots_gone, robots_redirPerm, robots_redirTemp, robots_notModified, robots_3xx, robots_4xx, robots_5xx, robots_other
FROM cchost_index_testing_v2
WHERE surt_host_name = 'org,commoncrawl'
AND url_host_tld = 'org'
ORDER BY crawl ASC
Top 10 Vatican websites from crawl CC-MAIN-2025-13 that are > 50% languages other than English (LOTE).
SELECT
crawl, surt_host_name, hcrank10, fetch_200_lote_pct, fetch_200_lote
FROM cchost_index_testing_v2
WHERE crawl = 'CC-MAIN-2025-13'
AND url_host_tld = 'va'
AND fetch_200_lote_pct > 50
ORDER BY hcrank10 DESC
LIMIT 10
# | crawl | surt_host_name | hcrank10 | fetch_200_lote_pct | fetch_200_lote |
---|---|---|---|---|---|
1 | CC-MAIN-2025-13 | va,vaticannews | 5.472 | 89 | 18872 |
2 | CC-MAIN-2025-13 | va,vatican | 5.164 | 73 | 14549 |
3 | CC-MAIN-2025-13 | va,museivaticani | 4.826 | 77 | 568 |
4 | CC-MAIN-2025-13 | va,vatican,press | 4.821 | 67 | 3804 |
5 | CC-MAIN-2025-13 | va,clerus | 4.813 | 79 | 68 |
6 | CC-MAIN-2025-13 | va,osservatoreromano | 4.783 | 98 | 3305 |
7 | CC-MAIN-2025-13 | va,vaticanstate | 4.738 | 73 | 509 |
8 | CC-MAIN-2025-13 | va,migrants-refugees | 4.732 | 67 | 2055 |
9 | CC-MAIN-2025-13 | va,iubilaeum2025 | 4.73 | 85 | 672 |
10 | CC-MAIN-2025-13 | va,cultura | 4.724 | 67 | 80 |
Top 10 websites from crawl CC-MAIN-2025-13 that are > 90% languages other than English (LOTE).
SELECT
crawl, surt_host_name, hcrank10, fetch_200_lote_pct, fetch_200_lote
FROM cchost_index_testing_v2
WHERE crawl = 'CC-MAIN-2025-13'
AND fetch_200_lote_pct > 90
ORDER BY hcrank10 DESC
LIMIT 10
# | crawl | surt_host_name | hcrank10 | fetch_200_lote_pct | fetch_200_lote |
---|---|---|---|---|---|
1 | CC-MAIN-2025-13 | org,wikipedia,fr | 5.885 | 99 | 55334 |
2 | CC-MAIN-2025-13 | org,wikipedia,es | 5.631 | 100 | 48527 |
3 | CC-MAIN-2025-13 | com,chrome,developer | 5.62 | 92 | 29298 |
4 | CC-MAIN-2025-13 | ar,gob,argentina | 5.613 | 95 | 16580 |
5 | CC-MAIN-2025-13 | fr,ebay | 5.579 | 100 | 24633 |
6 | CC-MAIN-2025-13 | org,wikipedia,ja | 5.55 | 100 | 49008 |
7 | CC-MAIN-2025-13 | ru,gosuslugi | 5.535 | 100 | 1560 |
8 | CC-MAIN-2025-13 | org,wikipedia,de | 5.508 | 100 | 48223 |
9 | CC-MAIN-2025-13 | com,acidholic | 5.477 | 100 | 356 |
10 | CC-MAIN-2025-13 | ph,telegra | 5.455 | 92 | 57153 |
- When the S3 bucket is under heavy use, AWS Athena will sometimes throw 503 errors. We have yet to figure out how to increase the retry limit.
- duckdb's https retries don't seem to work as advertised, so duckdb talking to AWS is also affected during periods of heavy usage.
- Hint: https://status.commoncrawl.org/ has graphs of S3 performance for the last day, week, and month.
- The sort order is a bit messed up, so database queries take more time than they should.
warc_record_length_av
will be renamed to_avg
(that was a typo)- more
_pct
columns - count truncations: length, time, disconnect, unspecified
- addition of indegree and outdegree from the web graph
- improve language details to be more than only LOTE and LOTE_pct
- add unicode block information, similar to languages
prank10
needs its power law touched up (hcrank10
might change too)- there's a sort problem that .com shards have a smattering of not-.com hosts. This hurts performance.
- add domain prank/hcrank
- CI running against S3
robots_digest_count_distinct
robots_digest
(if there is exactly 1 robots digest)- Summarize
fetch_redirect
: same surt, same surt host, other.
We'd love to get testing and code contributions! Here are some clues:
- We'd love to hear if you tried it out, and what your comments are
- We'd love to have python examples using Athena, similar to duckdb
- We'd love to have more python examples
- Please use pyarrow whenever possible