This project publishes a daily export of box office revenues scraped from Box Office Mojo. Each daily export contains all revenue data from January 1st, 2000 up to the current day.
Data published for a specific day is available under the releases tab.
To download the latest version of the raw dataset, you can use the following url.
For example:
import pandas as pd
url = 'https://github.com/tjwaterman99/boxofficemojo-scraper/releases/latest/download/revenues_per_day.csv.gz'
df = pd.read_csv(url, parse_dates=['date'], index_col='id')
df.head()
id | date | title | revenue | theaters | distributor |
---|---|---|---|---|---|
362a6861-2040-4257-b414-b932f5c69f10 | 2018-03-08 00:00:00 | Black Panther | 4251525 | 4084 | Walt Disney Studios Motion Pictures |
25320541-0e30-e62b-2573-284863c73e4a | 2018-03-08 00:00:00 | Red Sparrow | 1270235 | 3056 | Twentieth Century Fox |
08f98020-cf73-de6b-4803-2213649f9ea0 | 2018-03-08 00:00:00 | Game Night | 931272 | 3502 | Warner Bros. |
4a9c0497-0a38-540f-30b2-a06d16dfa784 | 2018-03-08 00:00:00 | Death Wish | 860755 | 2847 | Metro-Goldwyn-Mayer (MGM) |
e7986901-67fc-537d-9407-c3fc4c7a2faf | 2018-03-08 00:00:00 | Peter Rabbit | 620538 | 3607 | Sony Pictures Entertainment (SPE) |
Development requires Python3.6+ and access to a postgres database.
Create a virtual environment.
virtualenv venv --python=python3
Install the requirements.
pip install -r requirements.txt
Set the PG
variables. These will be used by DBT during the build steps.
export PGHOST=127.0.0.1
export PGPORT=5432
export PGUSER=postgres
export PGPASSWORD=postgres
export PGDATABASE=postgres
Create the schema on the postgres database.
psql -c "create schema raw;"
psql -f schema.sql
Load the current data.
psql -c "\copy raw.boxofficemojo_revenues from $PWD/parsed.json
Build the dbt models.
dbt run --project-dir $PWD/dbt --profiles-dir $PWD/.dbt
Parsed data is saved each day in the parsed.json
file by a github actions workflow. To rebuild the project with new data,
simply fetch the most recent commits on the main
branch.
git pull
Then rebuild the raw
schema and reinsert the parsed.json
data, and rebuild the dbt models.
psql -f schema.sql
psql -c "\copy raw.boxofficemojo_revenues from $PWD/parsed.json
dbt run --project-dir $PWD/dbt --profiles-dir $PWD/.dbt
dbt test --project-dir $PWD/dbt --profiles-dir $PWD/.dbt
To rebuild the parsed.json
file from scratch, use the parser.py
script.
python parser.py parse-all > parsed.json