Building an ETL pipeline that extracts data from Amazon Web Services S3, stages them in Redshift Cluster, and transforms data into a set of dimensional tables for the analytics teams. The Redshift cluster is created and maintained using the Infrastructure as Code paradigm and policies.
βββ README.md
βββ dwh.cfg
βββ run.sh
βββ analysis
βββ Exploratory Data Analysis.ipynb
βββ Performance Measure.ipynb
βββ modules
βββ create_database.py
βββ create_tables.py
βββ sql_queries.py
βββ etl.py
βββ __init__.py
There is a shell script in the root directory that will call upon the required python scripts. Just use the shell file to run the program, logging will aid you in every step of execution.
Run: sh run.sh
table | count |
---|---|
time | 16112 |
staging events | 8056 |
songplays | 320 |
staging songs | 14896 |
users | 7770 |
artists | 14896 |
songs | 14896 |
location | count |
---|---|
San Francisco-Oakland-Hayward, CA | 41 |
Portland-South Portland, ME | 31 |
Lansing-East Lansing, MI | 29 |
Waterloo-Cedar Falls, IA | 20 |
Tampa-St. Petersburg-Clearwater, FL | 18 |
Sacramento--Roseville--Arden-Arcade, CA | 17 |
Atlanta-Sandy Springs-Roswell, GA | 17 |
Chicago-Naperville-Elgin, IL-IN-WI | 15 |
Lake Havasu City-Kingman, AZ | 13 |
Janesville-Beloit, WI | 11 |
platforms | count |
---|---|
Macintosh | 139 |
Windows NT 5.1 | 53 |
X11 | 46 |
Windows NT 6.1 | 40 |
Windows NT 6.3 | 32 |
iPhone | 9 |
compatible | 1 |
artist | total_album_minutes |
---|---|
Jean Grae | 3006 |
Freddie Hubbard | 2882 |
Aphex Twin | 2657 |
Enigma | 2644 |
Madonna | 2513 |
Opeth | 2479 |
Mortiis | 2445 |
Herbie Hancock | 2421 |
Simple Minds | 2343 |
Stephan Micus | 2298 |
week | count |
---|---|
44 | 972 |
45 | 2982 |
46 | 4564 |
47 | 4074 |
48 | 3520 |
level | count |
---|---|
paid | 262 |
free | 58 |
gender | level | sum |
---|---|---|
F | paid | 10972270 |
F | free | 318801 |
M | free | 386760 |
M | paid | 993345 |
location | artist_name |
---|---|
Atlanta-Sandy Springs-Roswell, GA | Dr. Alban |
Atlanta-Sandy Springs-Roswell, GA | The Smiths |
Atlanta-Sandy Springs-Roswell, GA | Binary Star |
Atlanta-Sandy Springs-Roswell, GA | The Human League |
Atlanta-Sandy Springs-Roswell, GA | The Presets |
Atlanta-Sandy Springs-Roswell, GA | Limi-T 21 |
Atlanta-Sandy Springs-Roswell, GA | Goldfrapp |
Atlanta-Sandy Springs-Roswell, GA | Arctic Monkeys |
Atlanta-Sandy Springs-Roswell, GA | Fergie |
Atlanta-Sandy Springs-Roswell, GA | Jack Johnson |
year | average_album_length |
---|---|
0 | 251.89 |
1927 | 185.00 |
1944 | 142.00 |
1952 | 133.00 |
1954 | 234.67 |
Table | W/ Distribution Style | W/O Distribution Style
------------------------------------------------------------------------------
staging_events 15.15 1.92
staging_songs 169.99 181.38
songs 0.73 0.79
users 0.65 0.67
artists 0.67 0.68
time 0.51 0.78
songplays 6.08 1.21
Table | W/ Distribution Style | W/O Distribution Style
------------------------------------------------------------------------------
QUERY 1 261 437
QUERY 2 269 415
QUERY 3 276 556
QUERY 4 256 4680
QUERY 5 263 360
QUERY 6 264 508
QUERY 7 285 564
QUERY 8 251 413