In this group project you will use D3.js to visualize a complex dataset. Below you will find multiple datasets that you can use for the project. In addition to the datasets that we provide, we encourage you to combine them with other sources of data to create more insightful and information rich visualizations. Some points that will help to get you started:
-
Discuss with the group which you would like to use for the project. Once you have made a decision, let the teaching assistant know of your choice. Note that at most two groups can use the same dataset. First come, first serve.
-
Remember that you can find the course schedule on Blackboard (slides from the first practical session and course information). This week you will explore the datasets, think about what you want to do with the data, look for additional data sources, create a plan with the group and start working on the code. Next week you will hand in your mid-term report and give a presentation for all students in the group to receive feedback that will help you in improving the final result. For the presentation next week we expect you to have some preliminary visualizations to show. The presentations should be at most 7 minutes long with additional 3 minutes available for the questions. The mid-term report and slides should be uploaded to Blackboard. Please make sure to bring a USB stick with the slides on the day of the presentation, including a backup as PDF in case Powerpoint is not working properly (e.g. transitions, images or fonts).
-
The D3.js example gallery might serve as inspiration for what is possible in D3. Of course, you are free to use other sources of inspiration as well. Remember that D3.js and JavaScript are excellent for creating interactive visualizations. If you feel that you need more basic D3.js knowledge, have a look at the Dashing D3.js tutorial posted on Blackboard.
Some say climate change is the biggest threat of our age while others say it’s a myth based on dodgy science. We are turning some of the data over to you so you can form your own view.
Even more than with other data sets that Kaggle has featured, there’s a huge amount of data cleaning and preparation that goes into putting together a long-time study of climate trends. Early data was collected by technicians using mercury thermometers, where any variation in the visit time impacted measurements. In the 1940s, the construction of airports caused many weather stations to be moved. In the 1980s, there was a move to electronic thermometers that are said to have a cooling bias.
Given this complexity, there are a range of organizations that collate climate trends data. The three most cited land and ocean temperature data sets are NOAA’s MLOST, NASA’s GISTEMP and the UK’s HadCrut.
We have repackaged the data from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. They also use methods that allow weather observations from shorter time series to be included, meaning fewer observations need to be thrown away.
- Data and Description Source: Kaggle
- Data Format: Multiple CSV files (as .zip file)
- External URL: Browse, Download, or Download
The Global Terrorism Database (GTD) is an open-source database including information on terrorist attacks around the world from 1970 through 2015 (with annual updates planned for the future). The GTD includes systematic data on domestic as well as international terrorist incidents that have occurred during this time period and now includes more than 150,000 cases. The database is maintained by researchers at the National Consortium for the Study of Terrorism and Responses to Terrorism (START), headquartered at the University of Maryland.
- Data and Description Source: Kaggle
- Data Format: One CSV file (as .zip file)
- External URL: Browse, Download, or Download
Ranking universities is a difficult, political, and controversial practice. There are hundreds of different national and international university ranking systems, many of which disagree with each other. This dataset contains three global university rankings from very different places.
The Times Higher Education World University Ranking is widely regarded as one of the most influential and widely observed university measures. Founded in the United Kingdom in 2010, it has been criticized for its commercialization and for undermining non-English-instructing institutions.
The Academic Ranking of World Universities, also known as the Shanghai Ranking, is an equally influential ranking. It was founded in China in 2003 and has been criticized for focusing on raw research power and for undermining humanities and quality of instruction.
The Center for World University Rankings, is a less well know listing that comes from Saudi Arabia, it was founded in 2012.
- Data and Description Source: Kaggle
- Data Format: Multiple CSV files (as .zip file)
- External URL: Browse, Download, or Download
The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled, and diverted flights is published in DOT's monthly Air Travel Consumer Report and in this dataset of 2015 flight delays and cancellations.
- Data and Description Source: Kaggle
- Data Format: Multiple CSV files (as .zip file)
- External URL: Browse, Download, or Download
Prices of owner-occupied houses (excluding new constructions) were on average 7.4 percent higher in April 2017 than in April last year. This is the most substantial price increase since April 2002. Residential property prices have risen since June 2013, according to the price index of owner-occupied houses, a joint publication by Statistics Netherlands (CBS) and the Land Registry Office.
This dataset is about the change in house prices by region in the Netherlands, in the last 7 years, from Q1-2010 to Q1-2017. The change in prices are reported quarelty and anually.
- Data and Description Source: Centraal Bureau voor Statistiek (CBS)
- Data Format: One CSV file
- External URL: Browse, Download
Het Basisbestand Gebieden Amsterdam (BBGA) bevat kerncijfers op het niveau van de meest gebruikte gebiedsindelingen in Amsterdam: stadsdelen, de 22 gebieden van het gebiedsgericht werken, wijken, buurten, winkel- en werkgebieden en twee alternatieve buurtindelingen van de stadsdelen.
Het BBGA bevat meer dan 500 variabelen ingedeeld naar de volgende thema's: bevolking, leeftijd, wonen, openbare ruimte, verkeer, leefbaarheid, veiligheid, bedrijvigheid, sport en recreatie, welzijn en zorg, onderwijs, inkomen, participatie.
- Data and Description Source: Amsterdam Open Data
- Data Format: Two CSV files (as .zip file)
- External URL: Browse, Download
Dataset Energielabels in Amsterdam bevat ruwe data van afgegeven energielabels van gebouwen in Amsterdam. Per energielabel is de beschikbare informatie: gebouwinformatie dit betreft de postcode, het huisnummer, eventuele huisnummertoevoeging en een vrij veld dat gebruikt kan worden voor extra gebouw identificatie met verder het woningtype of hoofdgebruiksfunctie het certificaatnummer, de datum van opname en registratie door de adviseur, de labelwaarde (energie-index) en de labelklasse, bron van de opname het berekende gebouwgebonden energieverbruik in MJ, en indien beschikbaar m3 gas, kWh elektrisch, MJ warmte en het aantal m2 van het gebouw.
- Data and Description Source: Amsterdam Open Data
- Data Format: One CSV file
- External URL: Browse, Download, or Download
If you feel that you need other resources, you can use the websites from which the data originates. For example, if you think that the GDP depends on population, you may need the data on population of each country. Additionally, you can use other sources, such as Google Trends to get regional data related to what people search for in The Netherlands and Europe. We encourage students to explore and combine other data as creativity and information rich visualization will positively contribute towards your final grade.
You are free to take different quantities from these datasets in order to investigate their correlation. For example, is the GDP growth of European countries related to unemployment?
While D3.js is great for building visually appealing and interactive figures, sometimes it can be hard to read the datasets in a convenient way. If you are more familiar with Python, you could inspect, prepare and pre-process the data using NumPy
, Pandas
or SciPy
. The code snippet below could help to get you started reading a processing a TSV file (tab seperated value).
Read a .tsv
file in Python
import csv
with open("source_file.tsv") as tsvfile:
tsvreader = csv.reader(tsvfile, delimiter="\t")
for line in tsvreader:
print(line)
''' Do some preprocessing such as:
- handle missing data
- correct the data
- etc.
'''
For reading big data files in Python you could use Panda
library. It makes things a lot easier and faster. Also, this library is very helpful in selecting rows and columns of your data based on conditions.
Read a large .csv
file in Python using Pandas
import pandas as pd
# relative path of the data file
data_path = './data/flights.csv'
# read .csv file with panda
data = pd.read_csv(data_path)
# for debugging purposes, only consider the first 100 rows
# if your data is really large (100s of thousands of rows)
rows = data.head(100)
# index all the rows, and only the 8th column (zero-based indexing)
column_seven = rows.ix[:, 7]
# now, get the values in that column
values = column_seven.values
For reading Excel spreadsheet files (.xls) in Python you could use the xlrd
package which can be downloaded from pypi.python.org. After reading and processing the data, the json
package can be used for exporting the data to JSON file which is suitable for reading in D3.js. Make sure to look at the original dataset table, to be sure that you are loading data correctly.
Write to a .json
file in Python
import json
with open("target_file.json", "w") as outfile:
json.dump(data, outfile)