-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdataviz.Rmd
executable file
·209 lines (181 loc) · 10.2 KB
/
dataviz.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
---
title: "Data visualization / Visualisation de données"
output:
html_document:
code_folding: show
#toc: TRUE
---
<br>
<br>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, results = "hide", message = FALSE)
```
## Tidy Tuesdays {.tabset}
Below, you will find some of the graphs (and accompanying code) I made as part of the [R4DS](https://twitter.com/R4DScommunity) weekly visualization challenges. All of the entries I made for #TidyTuesday can be found on my [Github](https://github.com/florencevdubois/MyTidyTuesdays) page.
### 25 February 2020
On this week's TidyTuesday challenge, we worked with data on vaccination rates across the USA. I decided to make a graph of vaccination rates by county. See below for the final product. Start by loading the necessary packages and the measles data.
```{r}
# packages
library(tidyverse)
library(pacman)
library(tidyverse)
# data
measles <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-25/measles.csv')
```
Let's explore the data.
```{r}
summary(measles$mmr) # distribution of vaccination rates
summary(measles$overall) # distribution of vaccination rates
weird <- measles %>%
filter(overall == -1 | mmr == -1)# who are the "-1" ? Apparently they're missing data.
table(weird$state) # some have missing data for "mmr", some have missing data for "overall" -- we will use this later
length(unique(measles$county)) # how many unique counties?
table(measles$year, useNA = "always") # which years?
```
Then, we need to do some data clean-up.
```{r}
measles_plot_viz <- measles %>%
distinct(state, name, county, overall, mmr, lat, lng, enroll) %>% # remove duplicates
mutate(drop = ifelse(mmr >= 0 & overall >= 0, "drop", "keep")) %>%
filter(lng < 0) %>% # remove schools outside of the continent
gather(type, value, overall:mmr) %>%
filter(value >= 0) %>%
filter(!(drop == "drop" & type == "overall"))
```
Finally, let's draw the map! I will be using the `maps` package to do so. I had help from Kieran Healy's [book](https://socviz.co/maps.html#maps) during this step, as well as Timo Grossenbacher's [blog](https://timogrossenbacher.ch/2016/12/beautiful-thematic-maps-with-ggplot2-only/).
```{r}
# load packages
library(maps)
library(mapproj)
# 1. load the county data
us_counties <- map_data("county") %>%
mutate(state_county = str_c(region, subregion, sep = "_"))
head(us_counties)
# 2. check the distribution of vacc. rate
summary(measles_plot_viz$value)
# 3. draw a plain county map to see if it's working
p <- ggplot(mapping = aes(x = long, y = lat, group = group), data = us_counties)
p1 <- p + geom_polygon(color = "white", size = 0.1) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
guides(fill = FALSE) +
theme_minimal()
p1
```
Now that we know the map is working at the vaccination data makes sense, we can plot vaccination rates across the U.S. But first, for the map's caption, we need to check which type of vaccination rate is plotted in which state:
```{r results = "markup"}
table(measles_plot_viz$type, measles_plot_viz$state)
```
Now we can plot:
```{r results = "markup", fig.height = 15, fig.width = 15}
# 4. draw map with vacc. data
p <- ggplot()
p1 <- p +
geom_polygon(mapping = aes(x = long, y = lat, group = group),
data = us_counties,
fill = "#f5f5f2",
color = "black",
size = 0.1) + # add county lines
geom_point(mapping = aes(x = lng, y = lat, color = value),
data = subset(measles_plot_viz, value > 50),
size = 0.6) + # add school dots
coord_map(projection = "albers", lat0 = 39, lat1 = 45) + # make the map more "curvy"
theme_void() + # remove grid and axes
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5,
vjust = 0.5,
size = 15),
plot.background = element_rect(fill = "#f5f5f2",
color = NA),
plot.caption = element_text(hjust = 0.5,
size = 10),
plot.margin = margin(t = 1, r = 1, b = 1, l = 1,
unit = "cm"),
legend.margin = margin(t = 2, r = 2, b = 5, l = 2,
unit = "mm")) + # modify plot title/caption/background/legend
scale_color_viridis_c(option = "plasma",
direction = -1,
name = "Vaccination rate",
guide = guide_legend(direction = "horizontal",
title.position = "top",
title.hjust = 0.5)) + # change legend title
labs(title = "Schools' Vaccination Rate in 31 U.S. States",
caption = "Measles, Mumps, and Rubella (MMR) vaccination\nrate shown in FL, ID, IA, MI, NJ, NC, OK, RI, TN, WI.\nElsewhere, the overall vaccination rate is shown.") # deifne title and subtitle
p1
# save plot using ggsave(filename = "name.png", plot = last_plot(), width = 15, height = 15)
```
### 9 June 2020
For this week,s TidyTuesday, I create a graph that reports the names and achievements of African-Americans in the Arts and Entertainment. I chose a bar chart, where each bar corresponds to a decade of African-American accomplishments. Start by loading the necessary packages and the data on African-American firsts.
```{r}
library(tidyverse)
library(extrafont)
firsts <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-09/firsts.csv')
# Let's explore some variables
table(firsts$category)
```
Now, we will recode the data. We will focus on the `person` variable, which contains the information on who the achiever is, and what they've accomplished. Once we've cleaned up the `person` variable, we will create a categorical variable for the decade of achievement, and collapse everyone (and their accomplishment) from the same decade together into one observation. The last thing we will do is insert a new line every 20 characters. If we do not insert the new line, the text will run too wide in the graph.
```{r}
dat <- firsts %>%
mutate(person_sh = str_remove_all(person, "\\(.*\\)"), # remove everything in parenthesis
person_sh = str_remove_all(person, "\\(.*"), # remove everything in parenthesis
person_sh = str_remove_all(person_sh, "\\[.*\\]"), # remove brackets and their content
person_sh = str_trim(person_sh, "right"), # remove white space at the end of names
person_complete = str_c(person_sh, ", ", accomplishment), # create complete name + accomplishment title
word_count = str_count(person_complete), # count number of words in complete names
count = 1,
person_complete = str_replace_all(person_complete, "African-American", "Af-Am"), # shorten for visualization purposes
twenty_yr = ifelse(year <= 1780, '1760s-70s',
ifelse(year >= 1781 & year <= 1800, "1780s-90s",
ifelse(year >= 1801 & year <= 1820, "1800s-10s",
ifelse(year >= 1821 & year <= 1840, "1820s-30s",
ifelse(year >= 1841 & year <= 1860, "1840s-50s",
ifelse(year >= 1861 & year <= 1880, "1860s-70s",
ifelse(year >= 1881 & year <= 1900, "1880s-90s",
ifelse(year >= 1901 & year <= 1920, "1900s-10s",
ifelse(year >= 1921 & year <= 1940, "1920s-30s",
ifelse(year >= 1941 & year <= 1960, "1940s-50s",
ifelse(year >= 1961 & year <= 1980, "1960s-70s",
ifelse(year >= 1981 & year <= 2000, "1980s-90s", "2000s-10s"))))))))))))) %>% # create categorical variable for twenty years
group_by(twenty_yr, category) %>%
mutate(persons_20 = paste0(person_complete, collapse = "; ")) %>% # collapse strings for each twenty years and category
ungroup() %>%
group_by(category) %>%
mutate(count_cat = cumsum(count)) %>%
ungroup() %>%
distinct(twenty_yr, count_cat, category, persons_20) %>%
group_by(twenty_yr, category) %>%
arrange(count_cat) %>%
filter(row_number() == n())
dat$persons_20 <- gsub("(.{15,}?)\\s", "\\1\n", dat$persons_20) # insert new line every 15 characters
```
Let's make the graph! I was inspired by G. Karamanis's work to do this graph. You can check out his work [here](https://github.com/gkaramanis/tidytuesday/blob/master/2020-week23/marbles-race.R). I decided to plot accomplishments in the Arts and Entertainment only, but you could do the same with science, education, sports, etc.
```{r results = "markup", fig.height = 30, fig.width = 20}
# plot, Arts only
viz <- dat %>% filter(category == "Arts & Entertainment")
p <- ggplot(viz, aes(x = twenty_yr, y = count_cat))
p1 <- p +
scale_y_continuous(name = "",
limits = c(-700,700)) +
scale_x_discrete(name = "",
expand = c(0.2, 0)) +
annotate("text", x = viz$twenty_yr, # add names
y = viz$count_cat,
label = viz$persons_20,
size = 2.5,
color = "white",
family = "Arial Narrow") +
annotate("text", x = "1760s-70s", # add title
y = 400,
label = "First\nAfrican Americans\nin Arts & Entertainment",
size = 15,
hjust = 0,
color = "white",
family = "Andale Mono") +
theme_void() +
coord_cartesian(clip = "off") +
theme(plot.background = element_rect(fill = 'grey11', colour = 'grey11'),
plot.margin = margin(200, 40, 100, 30), # play with margins
axis.text.x = element_text(colour = "gray69", # modify x axis
size = 15,
family = "Andale Mono"))
p1
```