-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path09-cook-county-covid.Rmd
2264 lines (1856 loc) · 98.9 KB
/
09-cook-county-covid.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Case Study 3: COVID-19 Mortality in Cook County (March 2020 - March 2022) {#cook-county-covid}
*By: Sudipta Saha, Christian Testa*
## Introduction
In this case study, the outcome of interest is COVID-19 mortality in Cook County. The impact of COVID-19 has been marked by inequities by racialized/ethnic group and socioeconomic position. Here we investigate disparities in COVID-19 mortality, with a focus on comparing risks between Black Hispanic and non-Hispanic, White non-Hispanic, and Hispanic populations.
The Cook County Medical Examiner has made a dataset with COVID-19 related deaths publicly available with the following intent:
> Cook County Government has created the Medical Examiner COVID-19 Dashboard to provide direct, transparent access to critical information about COVID-19 deaths in the County for public health agencies, medical professionals, first responders, journalists, policymakers and residents. The Medical Examiner’s Office encourages visitors to use this data to explore trends, identify areas of concern and take appropriate action. The data can be utilized to identify communities that are most severely impacted by the virus and can inform proactive public policy.
Read more: https://datacatalog.cookcountyil.gov/stories/s/ttk4-trbu
Data Source: https://datacatalog.cookcountyil.gov/Public-Safety/Medical-Examiner-Case-Archive-COVID-19-Related-Dea/3trz-enys
The death records in this dataset includes individual-level fields for "Race" and "Latino", from which we obtain social membership in racialized/ethnic groups. The data set also includes residential ZIP Codes, which can be linked to US Census ZIP Code Tabulation Areas (ZCTAs), which we can use to link to area-based social measures (ABSMs) from the American Community Survey (ACS) 2015-19. We will explore how membership in racialized/ethnic groups and ABSMs is associated with COVID-19 mortality.
## Motivation, Research Questions, and Learning Objectives
This case study focuses on the following research questions:
- What are the differences in COVID-19 mortality rates by racialized group accounting for age?
- What are the gradients in overall COVID-19 mortality in relation to ABSMs measuring: racialized group composition, the ICE for racialized economic segregation, percent of population living below the poverty line, percent of people living in crowded housing, and median income?
- What are the gradients in COVID-19 mortality by racialized group in relation to the ABSMs?
- What is the spatial variation in COVID-19 mortality by racialized group?
In today's case example using these data, we'll show you how you can use the `tidycensus` package to download relevant area based population estimates and sociodemographic measures from the ACS.
:::: {.infobox .dataWrangling}
We'll show you how we cleaned the data and merged in data from the ACS in the **Cleaning the Data** section. However, we're also providing you a cleaned dataset that should allow you to pick up and follow along from the **Visualizing Your Data** section onwards.
::::
## Approach
- To look at overall differences in COVID-19 mortality by racialized group, we will model aggregate mortality rates across Cook County by racialized group
- To look at gradients in relation to ABSMs, we will model overall ZCTA-level mortality rates for each ABSM of interest.
- To look at gradients in relation to ABSMs by racialized group, we will model ZCTA-level mortality separately for each racialized group and each ABSM.
- To explore spatial variation we will use spatial models.
## Dependencies
These are the packages that you will need to run the code in this case example. Once you copy and run the code below to load the dependencies, you can jump ahead to the **Visualizing Your Data** Section if you want.
```{r load packages}
#| eval = FALSE,
#| warning = FALSE,
#| message = FALSE
# mission critical packages
library(tidycensus)
library(tidyverse)
library(sf)
library(tigris)
library(mapview)
library(INLA)
library(spdep)
# nice to have packages
library(magrittr) # for the %<>% and %$% pipe
library(janitor)
library(purrr)
library(Hmisc)
library(epitools)
library(leaflet)
library(scales)
```
## Cleaning the Data
### Denominator Data
Here we will use `tidycensus` to download relevant variables from the 2015-19 ACS dataset. The complete list of variables can be viewed online here: https://api.census.gov/data/2019/acs/acs5/variables.html. You will need a U.S. Census API key to download the data, which you can obtain from here: https://api.census.gov/data/key_signup.html. The key is for personal use only, so it has been redacted here.
We will be looking at a range of different ABSMs, and we also require age-stratified populations by racialized group. You can browse the variables to identify which tables contain the variables you need.
:::: {.infobox .dataWrangling}
It can be helpful to identify patterns in the variable names to efficiently query the Census API. In the code below you will see an approach that does not require us to type out all the variable names. Can you think of more efficient ways to query the API?
::::
```{r cleaning the data}
#| eval = FALSE,
#| echo = TRUE
# census_api_key("Your API KEY goes here")
# get population sizes from ACS -------------------------------------------
#Get a list of the variable names
acs_vars <- tidycensus::load_variables(2019, dataset = 'acs5')
# the racialized group and age-group stratified population estimates from ACS are
# in the B01001 table.
#
# B01001_001 through _049 are the sex/gender overall population size estimates,
# and B01001A through B01001I are the "race/ethnicity" specific tables. For the
# "race/ethnicity" specific tables the age-groups suffixes range from _001 to _031
#
# in the following three steps, we programmatically construct the population
# size variables that we want to retrieve from the ACS since otherwise there are
# a lot of them to type out.
race_chars <-
c(
white = 'H', # In the 2015-19 ACS data, a suffix of H indicates non-Hispanic White, but be careful because these suffixes may change from year to year!
black = 'B',
hispanic_or_latino = 'I'
)
# for each of H, B, and I, construct B01001*_001 through _031. If you look at the variable list at the link above, you will see that the variable names constructed below correspond to total and age-sex specific population estimates for each racialized/ethnic group.
sex_race_age_vars <-
paste0(rep(paste0('B01001', race_chars, '_0'), each = 31),
stringr::str_pad(
1:31,
width = 2,
side = 'left',
pad = '0'
))
# this adds on the sex/gender overall population estimates (e.g.B01001_001 is the total male population)
sex_race_age_vars %<>% c(.,
paste0(
'B01001_0',
stringr::str_pad(
1:49,
width = 2,
side = 'left',
pad = '0'
)
))
# The API gives us data whose labels are all in one string, with information separated by '!!'. We split the label to make it more useful.
acs_vars %<>%
tidyr::separate(label,
into = c('estimate', 'total', 'gender', 'age', 'subgroup'),
sep = '!!')
# clean label values (remove unnecessary colon characters and leading/trailing spaces)
acs_vars %<>% mutate_at(.vars = vars(estimate, total, gender, age, subgroup),
~ gsub(":", "", .) %>% stringr::str_trim())
# select only what we need
acs_vars %<>% dplyr::select(name, total, gender, age, concept)
# Now that we have the variables names, we use it to get sex, racialized group (incl. hispanic or latino), and age stratified population
# estimates
popsizes <-
tidycensus::get_acs(
geography = 'zip code tabulation area',
state = 'IL',
year = 2019,
geometry = TRUE, # This obtains the geometry / shapes of each zip-code tabulation area (ZCTA)
variables = sex_race_age_vars,
output = 'tidy',
cache_table = T #this will enable quicker loading if we load the data again in the future
)
# join in our variable names
popsizes %<>% left_join(acs_vars, by = c('variable' = 'name'))
popsizes %<>% janitor::clean_names() # clean column names
popsizes %<>% dplyr::select(-total) # remove total column that only contains 'total'
# While we set geometry = T so that we got the geometry data, we don't need this as we clean the data. It makes things slow, so we take out the geometry for use later
zip_geometry <- popsizes %>% dplyr::select(geoid) %>% unique()
# ggplot(zip_geometry) + geom_sf() # zip codes of illinois
# remove geometry before data reshaping
popsizes %<>% sf::st_drop_geometry()
if ('sf' %in% class(popsizes)) {
stop("popsizes data has geometry, this can cause errors in the group_by and summarize steps using dplyr.")
}
# aggregate by sex/gender
popsizes %<>%
group_by(geoid, name, age, concept) %>%
dplyr::summarize(
estimate = sum(estimate),
# here we're using the tidycensus::moe_sum function to create margin of error estimates for
# sums of population size estimates
moe = tidycensus::moe_sum(moe = moe, estimate = estimate)
)
```
Since the Cook County Medical Examiner Case Archive data includes records on deaths where the decedents' residential ZCTAs are inside as well as outside Cook County, Illinois, we want to restrict our dataset to deaths where their county of residence was in Cook County. An issue is that ZCTAs are not neatly nested within the county borders - often part of a ZCTA can lie outside the county.
:::: {.infobox .dataWrangling}
How would you work with this issue of ZCTAs crossing county borders? Here we calculated the percentage of an area in a ZCTA that also falls within the Cook County borders, and retained those with 90% overlap (an arbitrary threshold). Would you use a different threshold? We also had to deal with issues of parts of Lake Michigan (bordering Cook County's east coast) being included in some shapefiles but not others.
::::
:::: {.infobox .interpretation}
Note that the unit of geography for analysis is the US census-defined Zip Code Tabulation Area (ZCTA), which is related to but not identical with the US Postal Service ZIP Code (which is a unit of mail delivery, reflecting postal carrier routes, such that a given spatial area can encompass several ZIP Codes). To create ZCTAs, the US Census assigns each census block the most frequently occurring ZIP Code within the block. For technical information about ZCTAs, see: https://www.census.gov/programs-surveys/geography/guidance/geo-areas/zctas.html. This means that ZCTAs and zip codes may not cover the same area exactly. A discussion of pitfalls to keep in mind when using ZCTAs linked to residential Zip Code can be found in this paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1447194/ .
::::
```{r get relevant zip codes}
#| echo = TRUE,
#| eval = FALSE
# identify relevant zip codes ---------------------------------------------
# we originally downloaded the county shapefiles using tigris, but
# we found that those contained water area we wanted to remove.
#
# we tried using the tigris::erase_water function but it took a very
# long time to run for us, so we found that the alternative was easier
# to implement: downloading a county shapefile directly from the census
# and using that one which came already with the water areas removed.
#
# We got our county shapefile for Cook county in what follows from here:
# https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
########## Change to file path where the shape file is downloaded ###########
counties <- read_sf('./data/09-cook-county-covid/cb_2018_us_county_5m/cb_2018_us_county_5m.shp')
# get the map for cook county
cook_county <- counties %>% filter(
# get the 5-digit FIPS or GEOID for Cook County programmatically by filtering
# the tigris::fips_codes dataframe, or if you happen to know it's 17031
# you could code it explicitly instead
GEOID == tigris::fips_codes %>%
filter(county == 'Cook County', state == 'IL') %$%
paste0(state_code, county_code))
# get cook county sf object from tigris that is water inclusive for filtering. Since the ZIP geometry can include bits of water, we want to use the Cook County boundaries that include Lake Michigan. Otherwise some coastal ZIP codes may have less than 90% overlap and get dropped.
IL_counties <- tigris::counties(state = 'IL')
cook_county_incl_water <- IL_counties %>%
filter(
COUNTYFP == {
tigris::fips_codes %>%
filter(state == 'IL', county == 'Cook County') %>%
pull(county_code)
}
)
# Find the zip codes intersecting cook county. In the output of st_intersects, zips that intersect with cook county have length greater than 0. So here we keep Zips with any overlap
zips_with_overlap <- zip_geometry[st_intersects(zip_geometry, cook_county_incl_water) %>%
map_lgl( ~ length(.) > 0),]
# figure showing cook county and all ZCTAs intersecting it
ggplot() +
geom_sf(data = cook_county_incl_water, color = 'cadetblue', fill = 'cadetblue') +
geom_sf(data = zips_with_overlap, color = 'orange', fill = 'white', alpha = .4) +
theme_bw()
# interactive file
leaflet() %>%
addTiles() %>% #Adds the basemap
addPolygons(data = cook_county_incl_water, weight = .5) %>% #Adds the Cook county boundary
addPolygons(data = zips_with_overlap, color = 'orange', weight = 2, label = ~geoid) # Adds Zip boundaries
# calculate each zip code total area
zips_with_overlap$total_area <- st_area(zips_with_overlap)
# calculate the intersection of cook county and each zip code tabulation area
zips_with_overlap_intersection <- st_intersection(zips_with_overlap, cook_county_incl_water) #this returns polygons where county and ZCTA intersect
zips_with_overlap_intersection %<>% dplyr::select(geoid)
# calculate the area of each intersection
zips_with_overlap_intersection$overlapping_area_w_cook_cty <- st_area(zips_with_overlap_intersection)
# drop geometry so we can merge this back in
zips_with_overlap_intersection %<>% sf::st_drop_geometry()
# add in overlap calculation
zips_with_overlap %<>% left_join(
zips_with_overlap_intersection %>% dplyr::select(geoid, overlapping_area_w_cook_cty),
by = c('geoid')
)
# calculate the proportion of area for that zip code within cook county
zips_with_overlap %<>% mutate(
prop_area_overlap = as.numeric(overlapping_area_w_cook_cty / total_area))
# filter for zip codes with 90% land area inside cook county
zips_with_over_90pct_overlap <- zips_with_overlap %>%
filter(prop_area_overlap >= .9)
# Visualize Cook County and ZIPs with over 90 percent overlap
ggplot() +
geom_sf(data = cook_county, aes(color = "Cook County"), fill = 'white') + #Plots Cook county boundary
geom_sf(
data = zips_with_over_90pct_overlap, #plots zip codes
mapping = aes(color = "Zip Code Tabulation Areas"),
size = .5,
alpha = .6 #Sets transparency so we can see both layers
) +
scale_color_manual(
values = c("Cook County" = "#E69F00", "Zip Code Tabulation Areas" = "#56B4E9") #manually set colors
) +
# theme_bw() +
guides(color = guide_legend(override.aes = list(fill = c('white', 'grey95')))) +
scale_x_continuous(breaks = c(-88.3, -88.1, -87.9, -87.7, -87.5)) +
ggtitle("Cook County and Zip Code Tabulation Areas 90% Contained Within")
# save the plot
# ggsave(here("images/09-cook-county-covid/cook_county_and_zip_codes.png"), width = 10, height = 6)
```
Now that we have obtained the ZCTAs that we will keep in the study, we can now focus on these to prepare demographic and denominator data.
```{r aggregate by ZCTA}
#| echo = TRUE,
#| eval = FALSE
# group population sizes together by age groups and racialized group and zip code
zips_with_population_estimates_by_race_ethnicity_and_age <-
popsizes %>%
filter(! is.na(age) & concept != 'SEX BY AGE') %>% # filter for overall population
filter(geoid %in% zips_with_over_90pct_overlap$geoid) %$% #Filter to keep only the zips selected above
left_join(zips_with_over_90pct_overlap, .) #join with the zip dataframe to add the geometry column
# Change names of the Census racialized groups
zips_with_population_estimates_by_race_ethnicity_and_age %<>% mutate(
race_ethnicity = case_when(
concept == "SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)" ~ "Black, Hispanic or non-Hispanic",
concept == "SEX BY AGE (HISPANIC OR LATINO)" ~ "Hispanic",
concept == "SEX BY AGE (WHITE ALONE, NOT HISPANIC OR LATINO)" ~ "White, non-Hispanic"
))
# remove totals (across age groups) observations
zips_with_population_estimates_by_race_ethnicity_and_age %<>% filter(! is.na(age))
zips_with_population_estimates_by_race_ethnicity_and_age %<>% filter(! is.na(race_ethnicity))
# aggregate into larger age groups
zips_with_population_estimates_by_race_ethnicity_and_age %<>% mutate(
age_group = case_when(
age %in% c(
"Under 5 years",
"5 to 9 years",
"10 to 14 years",
"15 to 17 years",
"18 and 19 years",
"20 to 24 years"
) ~ "Under 25",
age %in% c(
"25 to 29 years",
"30 to 34 years"
) ~ "25 to 34 years",
age == "35 to 44 years" ~ "35 to 44 years",
age == "45 to 54 years" ~ "45 to 54 years",
age == "55 to 64 years" ~ "55 to 64 years",
age == "65 to 74 years" ~ "65 to 74 years",
age == "75 to 84 years" ~ "75 to 84 years",
age == "85 years and over" ~ "85 years and over"
))
# sum age group population sizes by racialized group
zips_with_population_estimates_by_race_ethnicity_and_age %<>% group_by(geoid, age_group, race_ethnicity) %>%
dplyr::summarize(
estimate = sum(estimate, na.rm=T),
moe = moe_sum(moe, estimate)
)
# check how large the population sizes are by racialized group and age
# in each ZCTA
zips_with_population_estimates_by_race_ethnicity_and_age %>%
ggplot(
aes(x = estimate) #estimate is the column name for the population size
) +
geom_histogram() +
facet_grid(race_ethnicity~age_group) + #creates grid of individual plots for each combination of racialized group and age group
scale_x_log10() + #x axis in log scale since distribution of population counts is skewed
xlab("Population Estimate") +
ggtitle("Population Sizes by Racialized Group and Age Group in ZCTAs") +
theme(axis.text.x = element_text(angle = 75, hjust = 1)) #rotate x-axis labels
# ggsave(here("images/09-cook-county-covid/population_count_histogram.png"), height = 8, width = 10)
# check how many have estimates less than 10
zips_with_population_estimates_by_race_ethnicity_and_age %>%
group_by(estimate < 10) %>% count()
```
```{r image cook county and zip codes}
#| echo = FALSE,
#| eval = TRUE
knitr::include_graphics(here("images/09-cook-county-covid/cook_county_and_zip_codes.png"))
```
### Case Data
Now we're ready to load the case data, and add the population size estimates for each racialized group (White non-Hispanic, Black non-Hispanic and Hispanic, and Hispanic) and age-group by ZIP code.
```{r add denominators to case data}
#| echo = TRUE,
#| eval = FALSE
# read in your data. This has been donwloaded from the Cook County Medical Examiner's office website, as mentioned above. https://datacatalog.cookcountyil.gov/Public-Safety/Medical-Examiner-Case-Archive-COVID-19-Related-Dea/3trz-enys
cook_county_deaths <-
readr::read_csv("./data/09-cook-county-covid/Medical_Examiner_Case_Archive_-_COVID-19_Related_Deaths.csv")
# merge racialized group and age-group specific denominators ----------------
# use janitor::clean_names to standardize column name syntax into snake_case
cook_county_deaths %<>% janitor::clean_names()
# check distribution of deaths if needed
#ggplot(cook_county_deaths, aes(x = age)) + geom_bar()
#ggplot(cook_county_deaths, aes(x = race)) + geom_bar()
#ggplot(cook_county_deaths, aes(x = latino)) + geom_bar() + facet_wrap(~race)
#ggplot(cook_county_deaths, aes(x = latino)) + geom_bar()
# code racialized groups into the following:
# Black (Hispanic or non-Hispanic)
# Hispanic and non-Hispanic
# White non-Hispanic
cook_county_deaths %<>%
mutate(
black = race == 'Black',
white_nh = (! latino) & race == 'White',
hispanic = latino
)
# we restrict our time-period to during March 2020 to March 2022 -- noting the
# information on the dataset stories web page:
#
# https://datacatalog.cookcountyil.gov/stories/s/ttk4-trbu
#
# Effective April 1, 2022, the Cook County Medical Examiner’s Office no longer
# takes jurisdiction over hospital, nursing home or hospice COVID-19 deaths
# unless there is another factor that falls within the Office’s jurisdiction.
#
cook_county_deaths %<>% filter(
lubridate::mdy_hms(date_of_death) >= lubridate::mdy("03/01/2020") & #convert date in character format to a date format (POSIXct). This allows evaluation of conditions such as date greater than x, etc.
lubridate::mdy_hms(date_of_death) <= lubridate::mdy("04/01/2022")
)
# Truncate Zip Codes to 5 Characters
cook_county_deaths %<>% mutate(residence_zip = stringr::str_extract(residence_zip, "^[0-9]{5}"))
# Categorize Age Group
cook_county_deaths %<>% mutate(
age_group = case_when(
age <= 24 ~ "Under 25",
age >= 25 & age <= 35 ~ "25 to 34 years",
age >= 35 & age <= 44 ~ "35 to 44 years",
age >= 45 & age <= 54 ~ "45 to 54 years",
age >= 55 & age <= 64 ~ "55 to 64 years",
age >= 65 & age <= 74 ~ "65 to 74 years",
age >= 75 & age <= 84 ~ "75 to 84 years",
age >= 85 ~ "85 years and over",
TRUE ~ NA_character_
)
)
#Convert to factor so that when plotting age groups are plotted in order
cook_county_deaths$age_group %<>% factor(
levels = c(
"Under 25",
"25 to 34 years",
"30 to 34 years",
"35 to 44 years",
"45 to 54 years",
"55 to 64 years",
"65 to 74 years",
"75 to 84 years",
"85 years and over")
)
# separate out deaths into the racialized group categories we're interested in
#
# note that these do have overlap: since the ACS variables do not include
# age tables for the Black non-Hispanic group, we are using the Black
# Hispanic or non-Hispanic population by age tables, and so there is overlap
# between them and the Hispanic group counts.
deaths_by_race_ethnicity_group <-
list(
black = cook_county_deaths %>% filter(black),
hispanic = cook_county_deaths %>% filter(hispanic),
white_nh = cook_county_deaths %>% filter(white_nh)
)
# Assign labels for the racialized groups
race_ethnicity_groups <- c(
"Black, Hispanic or non-Hispanic",
"Hispanic",
"White, non-Hispanic"
)
# for each racialized group, tabulate the number of deaths by age group and
# residence zip code.
#
# here, purrr::map is applying a function which does that tabulation to each of
# the data frames in the deaths_by_race_ethnicity_group list
deaths_by_race_ethnicity_group <-
purrr::map(1:3, function(i) {
group_by(deaths_by_race_ethnicity_group[[i]], residence_zip, age_group) %>%
count(name = 'deaths') %>%
mutate(race_ethnicity = race_ethnicity_groups[[i]])
})
# bind the tables for each of the racialized groups together
deaths_by_race_ethnicity_group %<>% bind_rows()
# check how many of the deaths proportionally were NA
ggplot(deaths_by_race_ethnicity_group, aes(x = age_group, y = deaths)) +
geom_col() +
facet_wrap(~race_ethnicity) +
ylab("Count of Deaths") +
theme(axis.text.x = element_text(angle = 75, hjust = 1))
# remove missing age, missing deaths in racialized groups
deaths_by_race_ethnicity_group %<>% filter(! is.na(age_group))
deaths_by_race_ethnicity_group %<>% filter(! is.na(race_ethnicity))
# check distribution of deaths by age group
deaths_by_race_ethnicity_group %>%
group_by(age_group) %>%
count()
# check distribution of deaths by racial/ethnic group
deaths_by_race_ethnicity_group %>%
group_by(race_ethnicity) %>%
count()
# merge the denominators and deaths together
df <-
zips_with_population_estimates_by_race_ethnicity_and_age %>% left_join(
deaths_by_race_ethnicity_group,
by = c('geoid' = 'residence_zip', 'age_group' = 'age_group', 'race_ethnicity' = 'race_ethnicity'))
# make age_group a factor variable
df$age_group %<>% factor(
levels = c(
'Under 25',
'25 to 34 years',
'35 to 44 years',
'45 to 54 years',
'55 to 64 years',
'65 to 74 years',
'75 to 84 years',
'85 years and over'
)
)
# add person-time
observation_time_in_years <- as.integer(lubridate::mdy("04-01-2022") - lubridate::mdy("03-01-2020")) / 365
df %<>% mutate(person_time = estimate * observation_time_in_years)
df %<>% mutate(deaths = ifelse(is.na(deaths), 0, deaths))
df %<>% mutate(mortality_per100k_py = deaths / person_time * 1e5)
df %>%
group_by(age_group, race_ethnicity) %>%
dplyr::summarize(deaths = sum(deaths)) %>%
ggplot(aes(x = age_group, y = deaths)) +
geom_col() +
facet_wrap(~race_ethnicity) +
xlab("Age Group") +
ylab("Count of Deaths") +
theme(axis.text.x = element_text(angle = 75, hjust = 1)) +
ggtitle("Age Distribution of COVID-19 Deaths in Cook County, IL")
# ggsave(here("images/09-cook-county-covid/age_distribution_deaths.png"), width = 8, height = 6)
```
```{r image age distribution of deaths}
#| echo = FALSE,
#| eval = TRUE
knitr::include_graphics(here("images/09-cook-county-covid/age_distribution_deaths.png"))
```
We can now add ABSMs that we are interested in to the data. We'll add the Index of Concentration at the Extremes for Racialized Economic Segregation, the proportion of the population under the poverty line, and the median income in each ZCTA.
```{r add absms}
#| echo = TRUE,
#| eval = FALSE
# add area based socioeconomic measures ----------------------------------
# get zip code rates for
# - poverty
# - ICEraceinc
# - median income
# We create a data dictionary for ABSMS. The first column indicates the total variable code, the second the variable name, and the third the description.
absms_dictionary <- tibble::tribble(
~var, ~varname, ~description,
# total population
"B01001_001", "total_popsize", "total population estimate",
# racial composition
'B01003_001', "race_ethnicity_total", "race_ethnicity_total",
# ICEraceinc
"B19001_001", 'hhinc_total', "total population for household income estimates",
"B19001A_002", 'hhinc_w_1', "white n.h. pop with household income <$10k",
"B19001A_003", 'hhinc_w_2', "white n.h. pop with household income $10k-14 999k",
"B19001A_004", 'hhinc_w_3', "white n.h. pop with household income $15k-19 999k",
"B19001A_005", 'hhinc_w_4', "white n.h. pop with household income $20k-24 999k",
"B19001A_014", 'hhinc_w_5', "white n.h. pop with household income $100 000 to $124 999",
"B19001A_015", 'hhinc_w_6', "white n.h. pop with household income $125k-149 999k",
"B19001A_016", 'hhinc_w_7', "white n.h. pop with household income $150k-199 999k",
"B19001A_017", 'hhinc_w_8', "white n.h. pop with household income $196k+",
"B19001_002", 'hhinc_total_1', "total pop with household income <$10k",
"B19001_003", 'hhinc_total_2', "total pop with household income $10k-14 999k",
"B19001_004", 'hhinc_total_3', "total pop with household income $15k-19 999k",
"B19001_005", 'hhinc_total_4', "total pop with household income $20k-24 999k",
# poverty
"B05010_002", 'in_poverty', "population with household income < poverty line",
"B05010_001", 'total_pop_for_poverty_estimates', "total population for poverty estimates",
# median income
"B06011_001", 'median_income', "median income estimate for total population",
# crowded housing
"B25014_005", 'owner_occupied_crowding1', 'owner occupied, 1 to 1.5 per room',
"B25014_006", 'owner_occupied_crowding2', 'owner occupied, 1.51 to 2 per room',
"B25014_007", 'owner_occupied_crowding3', 'owner occupied, 2.01 or more per room',
"B25014_011", 'renter_occupied_crowding1', 'owner occupied, 1 to 1.5 per room',
"B25014_012", 'renter_occupied_crowding2', 'owner occupied, 1.51 to 2 per room',
"B25014_013", 'renter_occupied_crowding3', 'owner occupied, 2.01 or more per room',
"B25014_001", 'crowding_total', 'total for crowding (occupants per room)',
"B01001I_001", 'total_hispanic', 'total hispanic population estimate',
"B01001B_001", 'total_black', 'total black, hispanic or non-hispanic estimate',
"B01001H_001", 'total_white_nh', 'total white, non-hispanic population estimate'
)
# We create a function which takes as a argument a vector of zip codes, and queries the Census API using those to obtain our ABSM variables for those ZCTAs, and then calculates the ABSMs of interest.
get_absms <- function(zip_codes) {
absms <- tidycensus::get_acs(
year = 2019,
geography = 'zcta',
state = 'IL',
zcta = zip_codes,
variables = absms_dictionary$var, #Get the variables indicated in the data dictionary.
geometry = FALSE # We already have the geometry so we don't need that.
)
# pivot wider so that each row corresponds to a ZCTA
absms %<>% dplyr::select(-moe) %>%
tidyr::pivot_wider(names_from = variable, values_from = estimate)
# Change the new column names to reflect variables names from the dictionary
rename_vars <- setNames(absms_dictionary$var, absms_dictionary$varname)
absms <- absms %>% rename(!!rename_vars)
absms %<>%
mutate(
# we calculate the people of color low income counts as the overall
# low income counts minus the white non-hispanic low income counts
people_of_color_low_income =
(hhinc_total_1 + hhinc_total_2 + hhinc_total_3 + hhinc_total_4) -
(hhinc_w_1 + hhinc_w_2 + hhinc_w_3 + hhinc_w_4),
# sum up the white non-hispanic high income counts
white_non_hispanic_high_income =
(hhinc_w_5 + hhinc_w_6 + hhinc_w_7 + hhinc_w_8),
# calculate the index of concentration at the extremes for racialized
# economic segregation (high income white non-hispanic vs. low income
# people of color)
ICEraceinc =
(white_non_hispanic_high_income - people_of_color_low_income) /
hhinc_total,
prop_in_poverty = in_poverty / total_pop_for_poverty_estimates,
crowding = (owner_occupied_crowding1 + owner_occupied_crowding2 + owner_occupied_crowding3 +
renter_occupied_crowding1 + renter_occupied_crowding2 + renter_occupied_crowding3) / crowding_total,
prop_black = total_black / total_popsize,
prop_hispanic = total_hispanic / total_popsize,
prop_white_nh = total_white_nh / total_popsize
) %>%
dplyr::select(GEOID, ICEraceinc, prop_in_poverty, median_income, crowding, prop_black, prop_hispanic, prop_white_nh)
return(absms)
}
absms <- get_absms(unique(df$geoid))
# merge in absm data to outcome data
df %<>% left_join(absms, by = c('geoid' = 'GEOID'))
# get cutpoints for ICEraceinc in illinois --------------------------------
# we're doing this because we want our cutpoints to be in reference to the
# state distribution for the ICEraceinc variable; other cutpoints can be used,
# but it's important to be transparent about what cutpoints are used and consider
# how the choice of cutpoints may affect the analysis.
#
illinois_absms <-
tidycensus::get_acs(
year = 2019,
geography = 'zcta',
state = 'IL', # Getting for all of IL
variables = absms_dictionary$var,
geometry = FALSE
) %>%
dplyr::select(-moe) %>%
tidyr::pivot_wider(names_from = variable, values_from = estimate)
rename_vars <- setNames(absms_dictionary$var, absms_dictionary$varname)
illinois_absms %<>% rename(!!rename_vars)
illinois_absms %<>%
mutate(
# we calculate the people of color low income counts as the overall
# low income counts minus the white non-hispanic low income counts
people_of_color_low_income =
(hhinc_total_1 + hhinc_total_2 + hhinc_total_3 + hhinc_total_4) -
(hhinc_w_1 + hhinc_w_2 + hhinc_w_3 + hhinc_w_4),
# sum up the white non-hispanic high income counts
white_non_hispanic_high_income =
(hhinc_w_5 + hhinc_w_6 + hhinc_w_7 + hhinc_w_8),
# calculate the index of concentration at the extremes for racialized
# economic segregation (high income white non-hispanic vs. low income
# people of color)
ICEraceinc =
(white_non_hispanic_high_income - people_of_color_low_income) /
hhinc_total,
# we don't need poverty here since we're going to use pre-specified cutpoints
# of 0-5%, 5-10%, 10-20% and 20%+
# calculate crowding
crowding = (owner_occupied_crowding1 + owner_occupied_crowding2 + owner_occupied_crowding3 +
renter_occupied_crowding1 + renter_occupied_crowding2 + renter_occupied_crowding3) / crowding_total,
# racial/ethnic composition variables
prop_black = total_black / total_popsize,
prop_hispanic = total_hispanic / total_popsize,
prop_white_nh = total_white_nh / total_popsize
)
# Using the state-wide distribution of ABSMs at the ZCTA-level, we select cutpoints for quantiles.
illinois_ICEraceinc_cutpoints <-
Hmisc::wtd.quantile(illinois_absms$ICEraceinc, #Weighted quantile of ICE
illinois_absms$total_popsize, #Weights proportional to population size
seq(0, 1, .2), # Cutpoints at 0, 0.2, 0.4, 0.8, 1
na.rm = T)
illinois_ICEraceinc_cutpoints <-
Hmisc::wtd.quantile(illinois_absms$ICEraceinc,
illinois_absms$total_popsize,
seq(0, 1, .2),
na.rm = T)
illinois_median_income_cutpoints <-
Hmisc::wtd.quantile(illinois_absms$median_income,
illinois_absms$total_popsize,
seq(0, 1, .2),
na.rm = T)
illinois_crowding_cutpoints <-
Hmisc::wtd.quantile(illinois_absms$crowding,
illinois_absms$total_popsize,
seq(0, 1, .2),
na.rm = T)
illinois_prop_hispanic_cutpoints <-
Hmisc::wtd.quantile(illinois_absms$prop_hispanic,
illinois_absms$total_popsize,
seq(0, 1, .2),
na.rm = T)
illinois_prop_black_cutpoints <-
Hmisc::wtd.quantile(illinois_absms$prop_black,
illinois_absms$total_popsize,
seq(0, 1, .2),
na.rm = T)
illinois_prop_white_nh_cutpoints <-
Hmisc::wtd.quantile(illinois_absms$prop_white_nh,
illinois_absms$total_popsize,
seq(0, 1, .2),
na.rm = T)
# create cutpoint-leveled version of key ABSM variables
df$ICEraceinc_cut <- df$ICEraceinc %>% cut(., illinois_ICEraceinc_cutpoints, include.lowest=TRUE)
df$median_income_cut <- df$median_income %>% cut(., illinois_median_income_cutpoints, include.lowest=TRUE)
df$prop_in_poverty_cut <- df$prop_in_poverty %>% cut(., c(0, .05, .1, .2, 1), include.lowest=TRUE)
df$crowding_cut <- df$crowding %>% cut(., illinois_crowding_cutpoints, include.lowest=TRUE)
df$prop_black_cut <- df$prop_black %>% cut(., illinois_prop_black_cutpoints, include.lowest=TRUE)
df$prop_hispanic_cut <- df$prop_hispanic %>% cut(., illinois_prop_hispanic_cutpoints, include.lowest=TRUE)
df$prop_white_nh_cut <- df$prop_white_nh %>% cut(., illinois_prop_white_nh_cutpoints, include.lowest=TRUE)
```
We have added discretized (cut, i.e. categorical) versions of the different ABSMs. This allows the model to fit nonlinear responses with increasing levels in these covariates. Alternative approaches could involve fitting the models with smoothing splines on these variables instead, but we aren’t including these approaches here. We used Illinois-wide distribution of ABSMs at the ZCTA level to create quantiles for all variables except poverty. For poverty, we use pre-specified cutpoints.
:::: {.infobox .interpretation}
Here we set the cutpoints for the continuous ABSMs based on the state-wide distribution of ZCTAs. How might this affect our interpretation? How might your research question and context affect how you decide what the ABSM cutpoints are?
::::
:::: {.infobox .communication}
How might the selection of cutpoints affect your visualizations? How might your cutpoints change depending on what you want to communicate through the map?
::::
We have a few remaining necessary cleaning steps to perform before we're ready to model these data.
We have to make sure our factor variables have appropriate reference levels set. Typically we set the reference level as the most privileged group so that we can frame the results as "the _____ group has X times the mortality rate of the reference group."
```{r final cleaning steps}
#| echo = TRUE,
#| eval = FALSE
# prepare data for modeling ----------------------------------------------
# remove infinite or NA mortality rates since they will cause errors in trying
# to fit the models
df_prepped <- df %>% filter(
is.finite(mortality_per100k_py) &
! is.na(mortality_per100k_py))
# ungroup
df_prepped %<>% ungroup()
# make the most privileged the reference category
df_prepped %<>% mutate(
race_ethnicity = forcats::fct_relevel(factor(race_ethnicity), "White, non-Hispanic"),
age_group = age_group,
ICEraceinc_cut = forcats::fct_rev(ICEraceinc_cut),
median_income_cut = forcats::fct_rev(median_income_cut)
)
# rename the estimate variable to 'population_estimate' to be more clear
df_prepped %<>% rename(population_estimate = estimate)
# save data
# saveRDS(df_prepped, here("data/09-cook-county-covid/cook_county_mortality_cleaned.rds"))
```
## Visualizing Your Data
If you want to start by using the clean dataset you can start from this point on. Now that we have a clean dataset, we can create some exploratory visualizations. In the code chunk below we do some exploratory visualizations of the data. The first series of maps show the crude mortality rates by ZCTA, racialized group and age group. **To see a larger version of this or any other image right click and select open image in new tab.**
```{r visualize data, fig.width = 8, fig.height = 8}
#| eval = FALSE,
#| echo = TRUE
#change this to point to where your downloaded file is
df <- readRDS("./data/09-cook-county-covid/cook_county_mortality_cleaned.rds")
#Read in the county boundary shapefile to plot maps
counties <- read_sf('./data/09-cook-county-covid/cb_2018_us_county_5m/cb_2018_us_county_5m.shp')
# get the map for cook county
cook_county <- counties %>% filter(
# get the 5-digit FIPS or GEOID for Cook County programmatically by filtering
# the tigris::fips_codes dataframe, or if you happen to know it's 17031
# you could code it explicitly instead
GEOID == tigris::fips_codes %>%
filter(county == 'Cook County', state == 'IL') %$%
paste0(state_code, county_code))
# let's start by mapping the crude mortality rates in each racialized group and
# age strata
df %<>% mutate(deaths = ifelse(is.na(deaths), 0, deaths)) # Convert missing cells to 0
df %<>% mutate(mortality_per100k_py = deaths / person_time * 1e5) # Calculate mortality rate per 100,000 person years
# plot raw mortality rates
df %>%
ggplot(aes(fill = mortality_per100k_py)) +
geom_sf(data = cook_county, fill = 'dimgrey') +
geom_sf(size = 0) +
facet_grid(forcats::fct_rev(race_ethnicity)~age_group) + # Create maps of mortality rates by age-group and racialized/ethnic groups
scale_fill_distiller(palette = 'Reds',
trans = scales::pseudo_log_trans(sigma = 100), #log transform since distribution is skewed
direction = 1, #Higher is darker
labels = scales::comma_format(),
na.value = 'dimgrey', #NAs as grey
breaks = c(0, 1000, 10000, 80000)) +
scale_x_continuous(breaks = c( -88.1, -87.7)) +
theme_bw() +
theme(legend.position = 'bottom',
panel.grid = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
legend.text = element_text(angle = 75, hjust=1)
) +
labs(fill = 'COVID-19 Mortality per 100,000 person years') +
ggtitle("Crude COVID-19 Mortality Rates by ZCTA, Racialized Group and Age Group",
subtitle = "March 2020 - March 2022")
# ggsave(here("images/09-cook-county-covid/raw_rates_by_zcta.png"), width = 14, height = 8)
# plot population sizes
df %>% ggplot(aes(fill = population_estimate)) +
geom_sf(data = cook_county, fill = 'dimgrey') +
geom_sf(size = 0) +
facet_grid(forcats::fct_rev(race_ethnicity)~age_group) +
scale_fill_distiller(palette = 'Greens', trans = scales::pseudo_log_trans(sigma = 100), direction = 1,
labels = scales::comma_format(), na.value = 'dimgrey',
limits = c(0, NA),
breaks = c(0, 1000, 10000, 30000)) +
scale_x_continuous(breaks = c( -88.1, -87.7)) +
theme_bw() +
theme(legend.position = 'bottom',
panel.grid = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
legend.text = element_text(angle = 75, hjust=1)
) +
labs(fill = 'Population Estimate') +
ggtitle("Population Size by ZCTA, Racialized Group and Age Group",
subtitle = "Data from ACS 2015-2019")
# ggsave("./images/09-cook-county-covid/population_size_by_zcta.png", width = 14, height = 8)
#Plot mortality rates
df %>% ggplot(aes(x = mortality_per100k_py, fill = race_ethnicity)) +
geom_histogram(alpha = .8, position = 'identity') +
facet_grid(forcats::fct_rev(race_ethnicity) ~ age_group,
scales = 'free') +
scale_x_continuous(
trans = scales::pseudo_log_trans(sigma = 10),
labels = scales::comma_format(),
breaks = c(0, 100, 1000, 5e4)
) +
xlab("Mortality Rate per 100,000 Person Years") +
ylab("Count") +
labs(fill = 'Racialized Groups') +
ggtitle("Histogram of Mortality Rates by ZCTA, Age Group and Racialized Group",
subtitle = "March 2020 - March 2022") +
theme(legend.position = 'bottom')
# ggsave(here("images/09-cook-county-covid/raw_rates_histogram.png"), width = 16, height = 8)
df %>%
ggplot(
aes(x = population_estimate,
y = mortality_per100k_py,
color = race_ethnicity)) +
geom_point(alpha = .8) +
facet_grid( forcats::fct_rev(race_ethnicity) ~ age_group) +
scale_y_continuous(
trans = scales::pseudo_log_trans(sigma = 10),
labels = scales::comma_format(),
breaks = c(0, 100, 1000, 5e4)
) +
scale_x_continuous(
trans = scales::log_trans(),
labels = scales::comma_format()
) +
ylab("Mortality Rate per 100,000 Person Years") +
xlab("Population Size") +
theme(legend.position = 'bottom') +
labs(fill = 'Racialized Groups') +
ggtitle("Population Size and Mortality Rates by ZCTA, Age Group and Racialized Group")
# ggsave(here("images/09-cook-county-covid/raw_rates_scatter.png"), width = 16, height = 8)
```
This figure shows a series of maps of the COVID-19 mortality rate per 100,000 person years between 2020-2022, by age and racialized groups.
```{r image raw rates by ZCTA}
#| eval = TRUE,
#| echo = FALSE
knitr::include_graphics(here(
"images/09-cook-county-covid/raw_rates_by_zcta.png"))
```
This figure shows a series of maps highlighting the underlying population at risk in each strata.
```{r image population size by zcta}
#| eval = TRUE,
#| echo = FALSE
knitr::include_graphics(here(
"images/09-cook-county-covid/population_size_by_zcta.png"))
```
This figure shows the histogram of these rates to visualizate the distribution of the data.
```{r image raw rates histogram}
#| eval = TRUE,
#| echo = FALSE
knitr::include_graphics(here(
"images/09-cook-county-covid/raw_rates_histogram.png"))
```
This figure shows the relationship between mortality rates and population size in each strata.
```{r image raw rates scatter}
#| eval = TRUE,
#| echo = FALSE
knitr::include_graphics(here(
"images/09-cook-county-covid/raw_rates_scatter.png"))
```
You can see that there are a number of zero-counts, and that rates are noisier when population sizes are smaller.
:::: {.infobox .interpretation}