Skip to content

Commit

Permalink
end of class!!!
Browse files Browse the repository at this point in the history
  • Loading branch information
caalo committed May 21, 2024
1 parent 16bb8f2 commit 38dee9c
Show file tree
Hide file tree
Showing 387 changed files with 32,258 additions and 194 deletions.
14 changes: 6 additions & 8 deletions 02-Data_cleaning_1-exercises.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

Suppose that you want to load in data "students.csv" in a CSV format, and you don't know what tools to use. You decide to see whether the package "readr" can be useful to solve your problem. Where should you look?

All R packages must be stored on CRAN (Comprehensive R Archive Network), and all packages have a website that points to the reference manual (what is pulled up using the `?` command), source code, vignettes examples, and dependencies on other packages. Here is [the website](https://cran.r-project.org/web/packages/readr/) for "readr".
All R packages must be stored on CRAN (Comprehensive R Archive Network), and all packages have a website that points to the reference manual (what is pulled up using the `?` command), source code, vignettes examples, and dependencies on other packages. Here is [the website](https://cran.r-project.org/web/packages/readr/) for "readr". Within the website, I like to look at the [URL page](https://readr.tidyverse.org/) for more documentation, or the Vignettes page, such as [this page](https://cran.r-project.org/web/packages/readr/vignettes/column-types.html), for examples.

In the package, you find some potential functions for importing your data:

Expand Down Expand Up @@ -51,7 +51,7 @@ We see that the only *required* argument is the `file` variable, which is docume

Load in "students.csv" via `read_csv()` function as a dataframe variable `students` and take a look at its contents via `View()`.

```{r}
```{r, message=F, warning=F}
library(tidyverse)
Expand Down Expand Up @@ -107,7 +107,7 @@ Recode "five" to 5 in the `age` column:

Create a new column `age_category` so that it has value "toddler" if `age` is \< 6, and "child" if `age` is \>= 6.

(Hint: You can create a new column via `mutate`, or you can directly refer to the new column via ``` student$``age_category ```.)
(Hint: You can create a new column via `mutate`, or you can directly refer to the new column via `student$age_category`.)

```{r}
Expand Down Expand Up @@ -165,20 +165,18 @@ Let's select a few columns of interest and give them column names that doesn't c
#names(melanoma_incidence) = c("County", "Age_adjusted_incidence_rate")
```

Take a look at the column `Age_adjusted_incidence_rate`. It has missing data coded as "\* " (notice the space after \*). Recode "\* " as `NA`.
Take a look at the column `Age_adjusted_incidence_rate`. It has missing data coded as "\*." (notice the space after \*). Recode "\*" as `NA`.

```{r}
```

Finally, notice that the data type for `Age_adjusted_incidence_rate` is character, if you run the function `is.character()` or `class()` on it. Convert it to a numeric data type.
Finally, notice that the data type for `Age_adjusted_incidence_rate` is character, if you run the function `is.character()` or `class()` on it. Convert it to a numeric data type. I recommend to use `melanoma_incidence$Age_adjusted_incidence_rate` to access the column.

```{r}
```



## Feedback!

How many hours did you spend on this exercise?
Expand All @@ -191,4 +189,4 @@ If you worked with other peers, write their names down in the following characte

```{r}
peers = c("myself")
```
```
8 changes: 4 additions & 4 deletions 02-Data_cleaning_1.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ grade2 = if_else(grade > 60, TRUE, FALSE)

3. If-else_if-else

```
```{r}
grade3 = case_when(grade >= 90 ~ "A",
grade >= 80 ~ "B",
grade >= 70 ~ "C",
Expand Down Expand Up @@ -191,7 +191,7 @@ simple_df2 = mutate(simple_df, grade = ifelse(grade > 60, TRUE, FALSE))

3. If-else_if-else

```
```{r}
simple_df3 = simple_df
simple_df3$grade = case_when(simple_df3$grade >= 90 ~ "A",
Expand All @@ -203,7 +203,7 @@ simple_df3$grade = case_when(simple_df3$grade >= 90 ~ "A",

or

```
```{r}
simple_df3 = mutate(simple_df, grade = case_when(grade >= 90 ~ "A",
grade >= 80 ~ "B",
grade >= 70 ~ "C",
Expand Down Expand Up @@ -236,7 +236,7 @@ if(expression_is_TRUE) {
3. If-else_if-else:

```
if(expression_A_is_TRUE)
if(expression_A_is_TRUE) {
#code goes here
}else if(expression_B_is_TRUE) {
#other code goes here
Expand Down
81 changes: 80 additions & 1 deletion 04-Functions-exercises.Rmd
Original file line number Diff line number Diff line change
@@ -1 +1,80 @@
# Functions Exercises
# Functions Exercises

## Part 1: Writing your function

Create a function, called `num_na` in which the function takes in any vector, and then return a single numeric value. This numeric value is the number of `NA`s in the vector. Use cases: `num_na(c(NA, 2, 3, 4, NA, 5)) = 2` and `num_na(c(2, 3, 4, 5)) = 0`.

Hint 1: Use `is.na()` function. Hint 2: Given a logical vector, you can count the number of `TRUE` values by using `sum()`, such as `sum(c(TRUE, TRUE, FALSE)) = 2`.

Create a function, called `medicaid_eligible` in which the function takes in one argument: a numeric vector called `age`. The function returns a numeric vector with the same length as `age`, in which elements are `0` for indicies that are less than 65 in `age`, and `1` for indicies 65 or higher in `age`. (Hint: This is a data recoding problem!) Use cases: `medicaid_eligible(c(30, 70)) = c(0, 1)`

Let's improve the use of this function a little bit. What happens if the user run `medicaid_eligible(c("hello", "there"))`? It still runs, but some kind of weird coercing happened (try it yourself!). A better design of this function would prevent the user using the function if it wasn't a numeric data type.

We need to add the following logical structure: "if age is a numeric data type, then run rest of the function. else, return `NA`". We can do this via **conditionals**, which looks like this:

```
if (condition_is_TRUE) {
#do this
}else {
#do something else
}
```

In the context of this problem:

```
medicaid_eligible = function(age) {
if (is.numeric(age)) {
#do this
}else {
return(NA)
}
}
```

Modify `medicaid_eligible()` so that if you give the input a numeric value, it returns the original value as before, but if you give the input anything other data type, it returns `NA`. Test it yourself to see that it works.

```{r}
```

## Part 2: Functions in State Cancer Profiles

Let's look at the analysis code we have written for State Cancer Profile again.

```{r, message=FALSE, warning=FALSE}
library(tidyverse)
library(cancerprof)
```

We load in cancer incidence rates and do some data cleaning, for females and males.

```{r}
#female
female_melanoma_WA = incidence_cancer("WA", "county", "melanoma of the skin", "all races (includes hispanic)", "females", "all ages", "all stages", "latest 5 year average")
female_melanoma_WA = select(female_melanoma_WA, County, `Age Adjusted Incidence Rate`)
names(female_melanoma_WA) = c("County", "Age_adjusted_incidence_rate")
female_melanoma_WA$Age_adjusted_incidence_rate[female_melanoma_WA$Age_adjusted_incidence_rate == "* "] = NA
female_melanoma_WA$Age_adjusted_incidence_rate = as.numeric(female_melanoma_WA$Age_adjusted_incidence_rate)
#male
male_melanoma_WA = incidence_cancer("WA", "county", "melanoma of the skin", "all races (includes hispanic)", "males", "all ages", "all stages", "latest 5 year average")
male_melanoma_WA = select(male_melanoma_WA, County, `Age Adjusted Incidence Rate`)
names(male_melanoma_WA) = c("County", "Age_adjusted_incidence_rate")
male_melanoma_WA$Age_adjusted_incidence_rate[male_melanoma_WA$Age_adjusted_incidence_rate == "* "] = NA
male_melanoma_WA$Age_adjusted_incidence_rate = as.numeric(male_melanoma_WA$Age_adjusted_incidence_rate)
```

The code for females and males are nearly identical with the exception that one of the input arguments for `incidence_cancer()` is "female" and the other is "male". There is a lot of redundancy as a result.

Write a function, `process_incidence_cancer()` in which the function takes in one argument: a character `sex`. The function returns a dataframe with the columns `County` and `Age_adjusted_incidence_rate` for that particular sex. Use cases: `process_incidence_cancer("females")`, and `process_incidence_cancer("males")`

```{r}
```

Can you improve `process_incidence_cancer()` so that if `sex` is "males" or "females", it runs as it has been, but if `sex` is anything else, it returns `NA`?

```{r}
```
10 changes: 9 additions & 1 deletion 04-Functions.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,15 @@ The function did not work as expected because we used hard-coded variables from
my_dim(penguins)
```
- Create a function, called `medicaid_eligible` in which the function takes in one argument: a numeric vector called `age`. The function returns a numeric vector with the same length as `age`, in which elements are `0` for indicies that are less than 65 in `age`, and `1` for indicies 65 or higher in `age`. Use cases: `medicaid_eligible(c(30, 70)) = c(0, 1)`
- Create a function, called `num_na` in which the function takes in any vector, and then return a single numeric value. This numeric value is the number of `NA`s in the vector. Use cases: `num_na(c(NA, 2, 3, 4, NA, 5)) = 2` and `num_na(c(2, 3, 4, 5)) = 0`. Hint 1: Use `is.na()` function. Hint 2: Given a logical vector, you can count the number of `TRUE` values by using `sum()`, such as `sum(c(TRUE, TRUE, FALSE)) = 2`.
```{r}
num_na = function(x) {
return(sum(is.na(num_na)))
}
```
- Create a function, called `medicaid_eligible` in which the function takes in one argument: a numeric vector called `age`. The function returns a numeric vector with the same length as `age`, in which elements are `0` for indicies that are less than 65 in `age`, and `1` for indicies 65 or higher in `age`. (Hint: This is a data recoding problem!) Use cases: `medicaid_eligible(c(30, 70)) = c(0, 1)`
```{r}
medicaid_eligible = function(age) {
Expand Down
167 changes: 167 additions & 0 deletions 05-Iteration-exercises.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Iteration Exercises

```{r, message=FALSE, warning=FALSE}
library(cancerprof)
library(tidyverse)
library(palmerpenguins)
```

## Part 1: Iteration warm-up

Write a function called `num_unique()` that takes any vector input, and then returns a numeric that gives the number of unique elements in the vector.

Hint: use the functions `unique()`, which gives you the unique elements of a vector, and `length().`

```{r}
```

Test that the function works.

```{r}
```

Now, we will use this function to iterate over the `penguins` columns to get the number of unique elements for each column.

Before using the functional, let's practice writing the first iteration down:

```{r}
```

Then, to do this functionally, we think about:

- Variable we need to loop through: `penguins`

- The repeated task as a function: `num_unique()`

- The looping mechanism, and its output: ?

```{r}
```

## Part 2: Repetition in State Cancer Profiles

From the previous' week exercise, we created a function to process and clean cancer incidence data when given the demographic's sex. In this version, a new column is added to designate the sex:

```{r}
process_incidence_cancer = function(sex) {
df = incidence_cancer("WA", "county", "melanoma of the skin", "all races (includes hispanic)", sex, "all ages", "all stages", "latest 5 year average")
df = select(df, County, `Age Adjusted Incidence Rate`)
names(df) = c("County", "Age_adjusted_incidence_rate")
df$Age_adjusted_incidence_rate[df$Age_adjusted_incidence_rate == "* "] = NA
df$Age_adjusted_incidence_rate = as.numeric(df$Age_adjusted_incidence_rate)
df$Sex = sex #new column!
return(df)
}
```

We are ready to use this function for females and males. Suppose we anticipate that we will use our function for many different inputs in the future will scale up in the future: perhaps we modify the function to take in state, and we want to analyze all states. As a proof of concept, can we iterate through "females" and "males" via a Functional?

Before using the functional, let's practice writing the first iteration down:

```{r}
parameters = c("females", "males")
#first iteration:
```

Then, to do this functionally, we think about:

- Variable we need to loop through: `parameters`

- The repeated task as a function: `process_incidence_cancer()`

- The looping mechanism, and its output: ?

Store your result as: `incidence_cancer_list`. It should be a list containing two elements, and each element is a dataframe.

```{r}
```

To continue the analysis, previously we had used `full_join()` and `pivot_longer()` to get the dataframe in the right format for plotting. Here's an alternative way: because in this version of the function we added a column sex, if we stack both of these dataframes on top of each other, we would have the columns "County", "Age_adjusted_incidence_rate", and "Sex" ready for plotting. To do so, we use the function `rbind()` (row bind) to bind two dataframes together by rows:

```{r}
incidence_cancer_analysis = rbind(incidence_cancer_list[[1]], incidence_cancer_list[[2]])
```

Plot:

```{r}
ggplot(incidence_cancer_analysis, aes(x = County, y = Age_adjusted_incidence_rate, fill = Sex)) + geom_bar(position="dodge", stat="identity") + theme(axis.text.x = element_text(angle = 90)) + labs(title = "Melanoma", y = "Incidence Rate per 100k people")
```

## Part 3: Repetition in State Cancer Profiles for states

Now, let's change up our analysis so that we iterate through all states and get the melanoma cancer incidence rate.

Write a new function, `process_incidence_cancer_by_state()` so that it takes in the input argument `state` as a character. When calling `incidence_cancer()` within your new function, the first argument, `area`, should use your `state` input argument, and the `sex` argument should have a fixed value of your choice.

The function returns a dataframe with the following columns: `County`, `Age_adjusted_incidence_rate`, and `State`. You will have to create the `State` column to have the value of `state` near the end of the function.

```{r}
process_incidence_cancer_by_state = function(state) {
df = incidence_cancer(state, "county", "melanoma of the skin", "all races (includes hispanic)", "females", "all ages", "all stages", "latest 5 year average")
df = select(df, County, `Age Adjusted Incidence Rate`)
names(df) = c("County", "Age_adjusted_incidence_rate")
df$Age_adjusted_incidence_rate[df$Age_adjusted_incidence_rate == "* "] = NA
df$Age_adjusted_incidence_rate = as.numeric(df$Age_adjusted_incidence_rate)
df$State = state
return(df)
}
```

Test that it works on a few states of choice:

```{r}
```

Let's get all the state abbreviations (except Louisiana, Alaska - no data for some reason). Here's a dataset:

```{r}
states = read_delim("classroom_data/states.txt", delim = "\t", col_names = c("fullname", "shorthand"))
head(states)
```

Now, let's run `process_incidence_cancer_by_state()` for all states: use a functional to apply `process_incidence_cancer_by_state()` on each element of `states$shorthand`, and store it as `results`.

```{r}
```

Great! Now we have a large list, in which each element is a dataframe belonging to a state. To consolidate all of these dataframes together, we could use `rbind()` to start: `rbind(results[[1]], results[[2]])`, but we would need to continue this with `rbind(rbind(results[[1]], results[[2]]), results[[3]])`, which becomes unwieldy very quickly. We wish `rbind()` was designed to handle combining two dataframes more elegantly...

A solution to this is to use the `reduce()` function: it takes a list or vector of length *n* and produces a single value by calling a function with a pair of values at a time:

![](https://d33wubrfki0l68.cloudfront.net/9c239e1227c69b7a2c9c2df234c21f3e1c74dd57/eec0e/diagrams/functionals/reduce.png){width="300"}

`reduce(c(1, 2, 3, 4), f)` is equivalent to `f(f(f(1, 2), 3), 4)`.

We can think of `reduce()` as a useful way to generalise a function that works with two inputs to work with any number of inputs.

Try `reduce()` on `results` with `rbind`.

```{r}
```

Follow up with analysis of your choice!

```{r}
```
Loading

0 comments on commit 38dee9c

Please sign in to comment.