-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata-viz-notes_healy_2020.rmd
440 lines (275 loc) · 15.7 KB
/
data-viz-notes_healy_2020.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
---
title: "Data Visualization (Healy, 2020) Notes"
author: "Austin Moss"
date: "2/9/2021"
output:
pdf_document: default
html_document:
keep_md: yes
---
## Data Visualization Notes
This is a starter RMarkdown template to accompany *Data Visualization* (Princeton University Press, 2019). You can use it to take notes, write your code, and produce a good-looking, reproducible document that records the work you have done. At the very top of the file is a section of *metadata*, or information about what the file is and what it does. The metadata is delimited by three dashes at the start and another three at the end. You should change the title, author, and date to the values that suit you. Keep the `output` line as it is for now, however. Each line in the metadata has a structure. First the *key* ("title", "author", etc), then a colon, and then the *value* associated with the key.
## This is an RMarkdown File
Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. A *code chunk* is a specially delimited section of the file. You can add one by moving the cursor to a blank line choosing Code > Insert Chunk from the RStudio menu. When you do, an empty chunk will appear:
```{r}
```
Code chunks are delimited by three backticks (found to the left of the 1 key on US and UK keyboards) at the start and end. The opening backticks also have a pair of braces and the letter `r`, to indicate what language the chunk is written in. You write your code inside the code chunks. Write your notes and other material around them, as here.
## Before you Begin
To install the tidyverse, make sure you have an Internet connection. Then *manually* run the code in the chunk below. If you knit the document if will be skipped. We do this because you only need to install these packages once, not every time you run this file. Either knit the chunk using the little green "play" arrow to the right of the chunk area, or copy and paste the text into the console window.
```{r install, eval = FALSE}
## This code will not be evaluated automatically.
## (Notice the eval = FALSE declaration in the options section of the
## code chunk)
my_packages <- c("tidyverse", "broom", "coefplot", "cowplot",
"gapminder", "GGally", "ggrepel", "ggridges", "gridExtra",
"here", "interplot", "margins", "maps", "mapproj",
"mapdata", "MASS", "quantreg", "rlang", "scales",
"survey", "srvyr", "viridis", "viridisLite", "devtools")
install.packages(my_packages, repos = "http://cran.rstudio.com")
```
## Set Up Your Project and Load Libraries
To begin we must load some libraries we will be using. If we do not load them, R will not be able to find the functions contained in these libraries. The tidyverse includes ggplot and other tools. We also load the socviz and gapminder libraries.
```{r setup, include=FALSE}
## By default, show code for all chunks in the knitted document,
## as well as the output. To override for a particular chunk
## use echo = FALSE in its options.
knitr::opts_chunk$set(echo = TRUE)
## Set the default size of figures
knitr::opts_chunk$set(fig.width=8, fig.height=5)
## Load the libraries we will be using
library(gapminder)
library(here)
library(socviz)
library(tidyverse)
```
Notice that here, the braces at the start of the code chunk have some additional options set in them. There is the language, `r`, as before. This is required. Then there is the word `setup`, which is a label for your code chunk. Labels are useful to briefly say what the chunk does. Label names must be unique (no two chunks in the same document can have the same label) and cannot contain spaces. Then, after the comma, an option is set: `include=FALSE`. This tells R to run this code but not to include the output in the final document.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
```{r look}
gapminder
```
The remainder of this document contains the chapter headings for the book, and an empty code chunk in each section to get you started. Try knitting this document now by clicking the "Knit" button in the RStudio toolbar, or choosing File > Knit Document from the RStudio menu.
## Look at Data
```{r}
```
## Get Started
Everything in R is an *object* that has a *name*, including variables (like `x` or `y`), datasets (like `mydata`), and functions (like `mean()`). The code directly below creates an object named `mynumbers` that is a vector of numbers.
```{r}
mynumbers <- c(1, 2, 3, 1, 3, 5, 25)
yournumbers <- c(5, 31, 71, 1, 3, 21, 6)
```
STOPPED AT PAGE 41 ON 2021-01-28
BEGINNING AT PAGE 41 ON 2021-02-09
Use the summary function to summarize the vector of numbers named `mynumbers'.
```{r}
mysummary <- summary(mynumbers)
mysummary
```
If you are not sure what an object is, ask for its class
```{r}
class(mynumbers)
class(mysummary)
class(summary)
```
R can store datasets in many different ways, but the most popular (and most often my use case) is the data frame. A data frame has columns representing variables and rows representing observations.
The $ operator allows us to reference a specific variable from a dataframe.
```{r}
titanic
class(titanic)
titanic$percent
```
The tidyverse libraries make extensive use of *tibbles*, which are very similar to dataframes.(Although the data.frame output above looks the same as the tibble output... Maybe an update to R? Anyways, should probably use Tibbles.)
```{r}
titanictb <- as_tibble(titanic)
titanictb
```
#Input data
The `read_csv()' function in the `readr' package is used to input CSV files. Other packages that can input STATA, SAS, and other software datasets directly can be found in the `haven' package.
STOPPED AT PAGE 50 ON 2021-02-09
```{r}
url <- "https://cdn.rawgit.com/kjhealy/viz-organdata/master/organdonation.csv"
organs <- read_csv(file = url)
```
## Make a Plot
View the Gapminder data to see what we are working with.
```{r}
gapminder
```
Create a basic scatter plot of the Gapminder data.
```{r}
p <- ggplot(data=gapminder,
mapping=aes(x=gdpPercap, y=lifeExp))
p + geom_point()
```
###Create another scatterplot using Gapminder data. Will elaborate on this in more detail.
```{r}
#First, tell ggplot the data we are using.
p <- ggplot(data=gapminder)
#Second, tell ggplot which variables in data should be represented by which visual elements in the plot.
p <- ggplot(data=gapminder,
mapping=aes(x=gdpPercap,
y=lifeExp))
```
So far, we have given the ggplot function two arguments: `data` and `mapping`.
The `mapping` argument is itself a FUNCTION (i.e., the `mapping` function is an argument to the `ggplot` function). The `mapping = aes(...)` argument
links variables to things you will see on the plot. `x` and `y` values are obvious. Other aesthetic mappings include color, shape, size, line type, etc. **IMPORTANT** A mapping does not say what particular color, shape, line type, etc. will be on the plot. It simply says which *variables* in the data will be *represented* by these aesthetics (e.g., % of fake news is represented by color).
At this point, we have created the `p` object using the `ggplot()` function and given it some basic information (i.e., `data` and `mapping`). `ggplot()` has also created `p` using a lot of other information as defaults. To see how much default information ask for `str(p)`.
Before a plot can be made, we must tell `ggplot()` what type of plot to draw. This is called adding a *layer* to the plot. This simply means picking a `geom_` function. The `geom_point()` function creates a scatterplot.
```{r}
#Create a scatterplot
p + geom_point()
```
### The basic recipe to create a plot with ggplot is:
1. Tell the ggplot() function what our data is.
2. Tell ggplot() what relationships we want to see. For convenience we will put the results of the first two steps in an object called p.
3. Tell ggplot how we want to see the relationships in our data.
4. Layer on geoms as needed, by adding them to the p object one at a time.
5. Use some additional functions to adjust scales, labels, tick marks, titles. We’ll learn more about some of these functions shortly.
### Let's try adding some additional layers to our scatterplot
```{r}
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y=lifeExp))
#Create a smoothed line with standard error shading
p + geom_smooth()
#Smoothed line PLUS scatterplot
p + geom_point() + geom_smooth()
#Smoothed line fit using a linear model PLUS scatterplot
p + geom_point() + geom_smooth(method = "lm")
```
Notice that `geom_point()` and `geom_smooth` inherited the information from object `p` as default. We can give geoms separate instructions that they will follow instead.
In our Gapminder data, the data is quite bunched to the left side. The plot may look better if we transformed the x-axis from a linear scale to a log scale. Additionally, test out what happens if we switch the order of the `geom_` functions... It seems that the plot is created in order of the functions specified, so functions later in the list are "on top" of earlier functions.
```{r}
p + geom_point() +
geom_smooth(method = "gam") +
scale_x_log10()
p + geom_smooth(method = "gam") +
geom_point() +
scale_x_log10()
```
### Let's clean up our scatterplot
We likely want to polish our plot with nicer axis labels, a title, and different x-axis notation (from scientific to dollars). Let's only worry about the x-axis notation for now.
```{r}
p + geom_point() +
geom_smooth(method = "gam") +
scale_x_log10(labels = scales::dollar)
```
Notice that the scale of the x-axis is changed by passing an argument to the `scale_` functions.The `scales` library contains useful pre-made formatting functions. If a library is not already loaded then a specific function can be grabbed from that library using the syntax `thelibrary::thefunction`. Otherwise, load the library using `library(scales)`.
### Mapping vs setting aesthetics
Let's map the variable `continent` to be represented by color.
```{r}
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp,
color = continent))
#Create a plot that represents each continent in our data with a different color
p + geom_point() +
geom_smooth(method = "loess") +
scale_x_log10()
```
If we want to *set* a property (i.e., change the scatterpoints to be a different color -- say, purple), we do this within the `geom_` function.
```{r}
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
#Create our original scatterplot and smoothed line using the color purple
p + geom_point(color = "purple") +
geom_smooth(method = "loess") +
scale_x_log10()
```
Show some additional adjustments that can be made to plots by giving the `geom_` functions various arguments. *alpha* changes the transparency of the objects. *se* turns off the standard errors shading. *size* changes the size of the line.
```{r}
p + geom_point(alpha = 0.3) +
geom_smooth(color = "orange",
se = FALSE,
size = 8,
method = "lm") +
scale_x_log10()
```
### Let's add some labels
Let's continue to polish our scatterplot by adding appropriate labels.
```{r}
p + geom_point(alpha = 0.3) +
geom_smooth(method = "gam") +
scale_x_log10(labels = scales::dollar) +
labs(x = "GDP Per Capita",
y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.")
```
Let's go back to mapping the continent variable to color.
```{r}
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp,
color = continent))
#Create a plot that represents each continent in our data with a different color
p + geom_point() +
geom_smooth(method = "loess") +
scale_x_log10()
```
We should shade the standard error ribbon for each line the same color as the line. The color of the standard error ribbon is controlled by the `fill` aesthetic. **whereas** the `color` aesthetic affects the appearance of lines and points, `fill` affects appearance of the filled areas of bars, polygons, and, in this case, the interior of the smoother's standard error ribbon.
```{r}
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp,
color = continent,
fill = continent))
#Create a plot that represents each continent in our data with a different color
p + geom_point() +
geom_smooth(method = "loess") +
scale_x_log10()
```
Let's say that we think the above plot is too busy with five fitted lines. Instead, we want only one fitted line but to still shade each data point according to its continent.
Remember, by default, geoms inherit their mappings from the `ggplot()` function. However, we can change this by mapping the aesthetics we want only to the `geom_` functions that we want them to apply to. We use the same `mapping = aes(...)` syntax that is in the `ggplot()` function, but now we write it directly in the `geom_` function.
```{r}
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
#Create a plot--directly map continent to color within the geom_point() function, so it only applies to the scatterpoints
p + geom_point(mapping = aes(color = continent)) +
geom_smooth(method = "loess") +
scale_x_log10()
```
We can also map continuous variables to color. It will present the data as a gradient of the color and provide a discretized scale for interpretation. Additionally, notice that we can transform variables directly within the `aes(...)` statement.
```{r}
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
p + geom_point(mapping = aes(color = log(pop))) +
scale_x_log10()
```
### Customizing plot export options
To change the size of your plots, you can either change the default settings for the entire `.Rmd` document by setting an option in your first code chunk. The syntax tells R to make 8X5 figures: `knitr::opts_chunk$set(fig.width=8, fig.height=5)`
More practically, we will want to customize the size of specific plots. This can be done by adding options to the `R` code chunk as follows:
```{r, fig.width = 12, fig.height = 9}
p + geom_point(mapping = aes(color = log(pop))) +
scale_x_log10()
```
To save a plot, the most convenient method is the `ggsave()` function. `ggsave()` will save the most recently displayed figure. Syntax: `ggsave(filename = "my_figure.png")`. Formats other than `.png` are available as well, most notably `.pdf`.
Instead of outputting our plots when we call our *layers*, we can assign a plot to an object just like any other thing in R. We can then save this plot at any point by giving the `plot` argument to the `ggsave()` function. For example:
```{r}
p1 <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp))
p1_out <- p1 + geom_point(mapping = aes(color = log(pop))) +
scale_x_log10()
ggsave("my_figure.pdf", plot = p1_out)
```
STOPPED AT PAGE 72 ON 2021-03-17
## Show the Right Numbers
```{r}
```
## Graph Tables, Make Labels, Add Notes
```{r}
```
## Work with Models
```{r}
```
## Draw Maps
```{r}
```
## Refine your Plots
```{r}
```