-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathgeocodingAddress-tutorial.Rmd
161 lines (107 loc) · 9.72 KB
/
geocodingAddress-tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# Geocoding Resource Locations {#geocode-points}
## Overview{#GA-research-question}
A common goal in opioid environment research is to calculate and compare access metrics to different providers of Medications for Opioid Overuse Disorder (MOUDs). Before we can run any analytics on the resource location data, we need to convert resource addresses to spatial data points, which can be then used to calculate access metrics.
**Geocoding** is the process of converting addresses (like a street address) into geographic coordinates using a known coordinate reference system. We can then use these coordinates (latitude, longitude) to spatially enable our data. This means we convert to a spatial data frame (sf) within R for spatial analysis within our R session, and then save as a shapefile (a spatial data format) for future use. In this tutorial we demonstrate how to geocode resource location addresses and convert to spatial data points that can be used for future mapping and geospatial analysis.
Our objectives are thus to:
* Geocode addresses to get geographic coordinates
* Visualize the resource locations as points on a map in R
* Transform a flat file (.CSV) to a spatially enabled shapefile (.SHP)
## Environment Setup {#GA-setup}
To replicate the code & functions illustrated in this tutorial, you’ll need to have R and RStudio downloaded and installed on your system. This tutorial assumes some familiarity with the R programming language.
### Input/Output {#GA-i-o}
Our input will be a **CSV** file that include addresses of our resources. This files can be found [here](https://github.com/GeoDaCenter/opioid-environment-toolkit/tree/master/data).
* Chicago Methadone Clinics, `chicago_methadone.csv`
We will convert these addresses to geographic coordinates using an appropriate coordinate reference system (CRS), and then spatially enable the data for mapping. We will then export the spatial dataframe as a **shapefile**.
### Load Libraries {#GA-packages1}
We will use the following packages in this tutorial:
- `sf`: to manipulate spatial data
- `tmap`: to visualize and create maps
- `tidygeocoder`: to convert addresses to geographic coordinates
Then load the libraries for use. *Note:* The messages you see about GEOS, GDAL, and PROJ refer to software libraries that allow you to work with spatial data.
```{r warning = FALSE, messages=FALSE}
library(sf)
library(tidygeocoder)
library(tmap)
```
### Load Data
We will use a CSV that includes methadone clinic addresses in Chicago as an example. We start with a small dataset to test our geocoding workflow, as best practice.
Let's take a look at the first few rows of the dataset. Our data includes addresses but not geographic coordinates.
```{r read}
methadoneClinics <- read.csv("data/chicago_methadone_nogeometry.csv")
head(methadoneClinics)
```
## Geocode addresses {#GA-geocode-addresses}
### Quality Control
Before geocoding, perform an initial quality check on the data. Note that the address, city, state, and zip code are all separated as different columns. This will make it easier to stitch together for a coherent, standard address for geocoding. Furthermore, there do not appear to be any major errors. The city name "Chicago" is spelled consistently, without missing addresses or zip codes. This will not always be the case, unfortunately. Data must be cleaned prior to loading into a geocoding service.
### Selecting Geocoding Service
To get a geographic coordinate for each site, we'll need to geocode. There are a number of geocoding options in R; here we use we the `tidygeocoder` package. It uses mutliple geocoding services, providing the user with an option to choose. It also provides the option to use a `cascade` method which queries other geocoding services incase the default method fails to provide coordinates.
When considering which geocoding service to use, consider scale and potential geocoding errors. Some geocoding services are more accurate than others, so if your coordinates were not coded precisely, try a different service. If you have thousands of addresses to geocode, you may require more complex data pipelines. The default method used here is via [US Census geocoder](https://geocoding.geo.census.gov/), which allows around 10,000 addresses to be geocoded at once. Others have varying daily limits. The Google Maps API and ESRI Geocoding service are additional high-quality geocoding services with varying cost associated.
### Test Geocoding Service
Before geocoding your entire dataset, first review the documentation for the geocoding service you'll be using. In our example we use `tidygeocoder`, with documentation found [here](https://cran.r-project.org/web/packages/tidygeocoder/vignettes/tidygeocoder.html). Let's test the service by starting with one address:
```{r}
sample <- geo("4545 North Broadway St. Chicago, IL", lat = latitude, long = longitude, method = 'cascade')
sample
```
What did the output look like? Get familiar with the input parameters, expected output, and review the documentation further if needed.
### Prepare input parameter
To apply the function to multiple addresses, we first we need ensure that we have a *character vector* of full addresses.
```{r}
str(methadoneClinics)
```
Next we convert all fields to character first to avoid issues with factors (a common peril of R!).
```{r}
methadoneClinics$fullAdd <- paste(as.character(methadoneClinics$Address),
as.character(methadoneClinics$City),
as.character(methadoneClinics$State),
as.character(methadoneClinics$Zip))
head(methadoneClinics)
```
### Batch Geocoding
Now we are ready to geocode the addresses. The "tibble" data structure below shows us the address, latitude, longitude and also the geocoding service used to get the coordinates. Note that geocoding takes a bit of time, so patience is required.
```{r geocode}
geoCodedClinics <- methadoneClinics %>%
geocode(methadoneClinics, address = 'fullAdd',
lat = latitude, long = longitude, method = 'cascade')
geoCodedClinics
```
The code worked for all addresses except the first two. We already resolved the `4545 North Broadway St.`address above but here in the dataframe we get NAs. It is pointing to some issue with the string input. These were missed in the previous quality check, but give us a clue to the types of errors we could see if geocoding more addresses. Unfortunately, such quirks are common across geocoding services in R and we just have to handle them. We manually update the full address strings to get apprpriate coordinates.
```{r}
methadoneClinics[1,'fullAdd'] <- '4453 North Broadway St.,Chicago IL 60640'
methadoneClinics[2,'fullAdd'] <- '4545 North Broadway St.,Chicago IL 60640'
```
Now we can geocode the full suite of addresses with success:
```{r}
geoCodedClinics <- methadoneClinics %>%
geocode(methadoneClinics, address = 'fullAdd',
lat = latitude, long = longitude, method = 'cascade')
geoCodedClinics
```
## Convert to Spatial Data {#GA-spatial-dataframe}
While we have geographic coordinates loaded in our data, it is still not spatially enabled. To convert to a spatial data format, we have to enable to coordinate reference system that connects the latitude and longitude recorded to actual points on Earth.
### Spatial Reference Systems
There are thousands of ways to model the Earth, and each requires a different spatial reference system. This is a very complicated domain of spatial applications (for a primer [see here](https://developers.arcgis.com/documentation/core-concepts/spatial-references)), but for our purposes, we simplify by using a geodetic CRS that uses coordinates longitude and latitude. Not all coordinates will appear as a latitude/longitude, however, so it's important to at least check for the CRS used when working with geographic data. The lat/long coordinates provided by the geocoding service we used report data using the CRS coded as **4326**, a World Geodetic System (WGS84) model also used by Google Earth and many other applications. In this system, distance is measured as degrees and distorted. So while useful for visualizing points, we will need to convert to another CRS for other types of spatial analysis.
### Enable Points
Next we convert our dataframe to a spatial data frame using the `st_as_sf()` function. The `coords` argument specifies which two columns are the X and Y for your data. We set the crs argument equal to 4326.
*Please note longitude is entered as first column rather than the latitude. It is a very common mistake.* The X, Y field actually refers to longitude, latitude.
```{r spatialdf}
methadoneSf <- st_as_sf(geoCodedClinics,
coords = c("longitude", "latitude"),
crs = 4326)
head(data.frame(methadoneSf))
```
Note that this is a data frame, but that it has a final column called “geometry” that stores the spatial information.
### Visualize Points
We can now plot the location of the methadone clinics with base R. This is a recommended step to confirm that you translated your coordinates correctly. A common mistake is switching the lat/long values, so your points could plot across the globe. If that happens, repeat the step above with flipped long/lat values.
First we switch the `tmap` mode to view so we can look at the points with a live basemap layer.
```{r view}
tmap_mode("view")
```
Next, we plot our points as dots, and add the basemap.
```{r plot}
tm_shape(methadoneSf) + tm_dots() + tm_basemap("OpenStreetMap")
```
### Save Data
Finally, we save this spatial dataframe as a shapefile which can be used for further spatial analysis.
```{r write shp,eval= FALSE}
write_sf(methadoneSf, "data/methadone_clinics.shp")
```