-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathProject code_Sophie.Rmd
352 lines (249 loc) · 12.6 KB
/
Project code_Sophie.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
---
title: "Project Report - Code"
author: "Matthew Yau, ZiYing(Sophie) Chen, Xinyu Dong"
date: "2023-03-09"
output:
pdf_document:
latex_engine: xelatex
fontsize: 11pt
fig_height: 6
fig_width: 6
df_print: kable
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## Dataset Introduction and
There are 14 categorical and 4 numerical variables in the dataset, and our target variable is “HeartDisease”. This is a clean dataset without any missing data. Among the 319,795 observations, we removed 18,078 duplicates. Therefore, the following explanatory data analysis would only perform on 301,717 observations.
## Data Processed
```{r}
library(dplyr)
data <- read.csv("heart_2020_cleaned.csv")
# head(data)
heart_2020 <- data[!duplicated(data), ] # remove duplicate data
heart_2020$Race[heart_2020$Race != "White"] <- "non-white" # make race binary
# we might consider to regroup the *AgeGroup* into only 4 groups (< 40, 40-59, 60-79, >80)
heart_2020$Race[heart_2020$Race != "White"] <- "non-white"
heart_2020$AgeCategory[heart_2020$AgeCategory == "18-24" |heart_2020$AgeCategory == "25-29" | heart_2020$AgeCategory == "30-34" | heart_2020$AgeCategory == "35-39"] <- "<40"
heart_2020$AgeCategory[heart_2020$AgeCategory == "40-44" |heart_2020$AgeCategory == "45-49" | heart_2020$AgeCategory == "50-54" | heart_2020$AgeCategory == "55-59"] <- "40-59"
heart_2020$AgeCategory[heart_2020$AgeCategory == "60-64" |heart_2020$AgeCategory == "65-69" | heart_2020$AgeCategory == "70-74" | heart_2020$AgeCategory == "75-79"] <- "60-79"
heart_2020 %>% mutate_if(is.character, as.factor) -> heart_2020
```
## Decision Tree
Through the explanatory analysis, we found that our target variable *HeartDisease* is unbalanced which most of the cases do not have heart disease.
### Fiting and Tunning
This is a seemingly strange but interesting decision tree. We can see from the results of the decision tree that the most certain thing about the decision tree is that if you are **under 54 years** of age and **in good overall health**, then the decision tree will assume that you will not develop heart disease. Other factors such as mental health, race, physical activity, etc., have little impact on whether or not you will develop heart disease if you meet the age and overall health criteria.
Now we have identified the low risk group: *they are under 45 years of age and their overall health is good or above*. So let's go a step further and identify those who are not in this range. We will remove the people who meet the age under 45 and overall overall health. The conclusion is that the decision tree is still trying to identify people who do not have heart disease by their age and overall health status. Therefore, we directly remove the union set that satisfies both categories.
```{r,fig.align = 'center',out.width="50%"}
# Load data
library(readr)
library(tree)
# Fit decision tree model
model.trees <- tree(HeartDisease ~ ., data = heart_2020)
heart.treecv <- cv.tree(model.trees) # Cross-validation
plot(heart.treecv)
# Tuning
heart.tree.prune <- prune.misclass(model.trees, best=5)
plot(heart.tree.prune)
text(heart.tree.prune,pretty = 0)
```
After removing those who passed age threshold and were in good overall health, we selected a total sample of 31,837 (only 10.55% of the total sample). The proportion of people suffering from heart disease in this sample rose to 29.78%. Compared to only 9.0% for the entire sample frame, it is already a significant and encouraging improvement - don't forget that we are only referring to the simple conditions of age 54+ and overall health below health.
```{r,fig.align = 'center',out.width="50%"}
obs_not_low_risk <- subset(heart_2020,!(AgeCategory %in% c('18-24',
'25-29',
'30-34',
'35-39',
'40-44',
'45-49',
'50-54') |
GenHealth %in% c('Good','Very good','Excellent'))
)
#New data Investigation
dim(obs_not_low_risk)
nrow(obs_not_low_risk)/nrow(heart_2020)
prop.table(table(obs_not_low_risk$HeartDisease))
prop.table(table(heart_2020$HeartDisease))
```
Taking the screened non-low-risk people to the next step of decision tree regression, we identified another important signal: **whether or not the person had a stroke.** If there was no stroke the decision tree would assume that the person would not have had a heart attack. In fact if we pick out the people who are already in our risk population who also had a stroke, we can see that the risk of having heart disease if they had a stroke increased from 29.78% to 49.65%. This is also a significant increase. Such a result is not difficult to explain. Stroke is often associated with hardening and blockage of blood vessels, and this often indicates that the patient has a worse blood circulation, which is also an indicator of heart disease. Now that we have greatly identified our high-risk group by age, overall health status, and whether or not we have had a stroke. Let's go one step further and see if there are other factors that can help us determine this. We'll use the data from the further targeted high-risk group in a decision tree regression.\
```{r,fig.align = 'center',out.width="50%"}
# Fit decision tree model
model.trees <- tree(HeartDisease ~ ., data = obs_not_low_risk)
heart.treecv <- cv.tree(model.trees) # Cross-validation
plot(heart.treecv)
# Tuning
heart.tree.prune <- prune.misclass(model.trees, best=3)
plot(heart.tree.prune)
text(heart.tree.prune,pretty = 0)
Stroked <- subset(obs_not_low_risk,Stroke == 'Yes')
#New data Investigation
dim(Stroked)
nrow(Stroked)/nrow(heart_2020)
prop.table(table(Stroked$HeartDisease))
prop.table(table(heart_2020$HeartDisease))
```
We were given the simplest decision tree, whether it was male or female. What it tells us is that for these people who are more prone to heart disease, men are at higher risk than women (56.28% for men and 43.46% for women).
```{r,fig.align = 'center',out.width="50%"}
# Fit decision tree model
model.trees <- tree(HeartDisease ~ ., data = Stroked)
# Tuning
#heart.treecv <- cv.tree(model.trees) # Cross-validation
#plot(heart.treecv)
plot(model.trees)
text(model.trees,pretty = 0)
tapply(Stroked$HeartDisease, Stroked$Sex, function(x) prop.table(table(x)))
```
The ramification plot is
```{r,fig.align = 'center',out.width="50%"}
# To be ploted
probability <- c(0.2978 ,0.4965,0.5628)
```

```{r}
# remove stroke to fit logistic regression
# Stroked <- Stroked[, -5]
# sapply(lapply(Stroked, unique), length)
# heart_logistic_stro <- glm(HeartDisease~.,data = Stroked,family = 'binomial')
# nullmod_stro <- glm(HeartDisease~1, data = Stroked, family="binomial")
#
# 1-logLik(heart_logistic_stro)/logLik(nullmod_stro)
# 1-logLik(stepwise_model)/logLik(nullmod)
#
# pred <- predict(heart_logistic_stro, type = "response")
# predicted.classes <- ifelse(pred> 0.5, "Yes", "No")
# table <- table(Stroked$HeartDisease, predicted.classes)
# table
# # sensitivity
# sensitivity <- table[2,2]/(table[2,1] + table[2,2])
# sensitivity
```
### Model Interpreatation
To be organized
## logistic regression
Logistic regression is a statistical method used to analyze and model relationships between a binary dependent variable (i.e., one that takes on only two values, such as 0 or 1) and one or more independent variables (also known as predictors or explanatory variables). It is a type of regression analysis that is used to predict the probability of an event occurring based on the values of the independent variables.
### Fiting and Tunning
```{r}
heart.logistic <- glm(HeartDisease~.,data = heart_2020,family = 'binomial')
summary(heart.logistic)
stepwise_model <- step(heart.logistic, direction = "both", k = log(nrow(heart_2020)), trace = 0)
summary(stepwise_model)
# calculate R^2
nullmod <- glm(HeartDisease~1, data = heart_2020, family="binomial")
1-logLik(heart.logistic)/logLik(nullmod)
1-logLik(stepwise_model)/logLik(nullmod)
# predict power
pred <- predict(heart.logistic, type = "response")
predicted.classes <- ifelse(pred> 0.05, "Yes", "No")
table <- table(heart_2020$HeartDisease, predicted.classes)
table
# sensitivity
sensitivity <- table[2,2]/(table[2,1] + table[2,2])
sensitivity
```
### Add interaction terms
The R squared of these two are similar and both poor, so we decided to try adding interaction terms to see if we can predict heart disease better. Interaction term shows that one's effect on response variable depends on the other, with only one of the variable might not have predictive power, but combine them together, we can predict. In case other predictors depends on *PhysicalActivity* (which removed by stepwise process), we keep it to build the model with all interaction terms.
According to our EDA, there are no strong correlation among numerical variables, so we would not include interaction terms of them.
```{r}
formula <- HeartDisease ~ BMI + Smoking + AlcoholDrinking + Stroke + MentalHealth + DiffWalking + Sex + AgeCategory +
Race + Diabetic + GenHealth + SleepTime + Asthma + KidneyDisease + SkinCancer + PhysicalActivity + Race*AgeCategory + Sex*AgeCategory
heart_logistic_interaction <- glm(formula = formula, data = heart_2020, family = 'binomial')
summary(heart_logistic_interaction)
1-logLik(heart_logistic_interaction)/logLik(nullmod)
```
### Analysis of the model
After stepwise selection, the predictors *PhysicalHealth* and *PhysicalActivity* were removed from the heart.logistic model, we can apply F-test to see if the reduced model is statistically better. Since the p-value 0.0004292 is pretty small at 0.05 significant level, hence, the reduced model is significantly better than the full model.
```{r}
anova(stepwise_model, heart.logistic,test = "Chisq")
1 - pchisq(14.476,2) # 1.5329 is the deviance difference
stepwise_model$formula
```
### Model Checking
Applying bootstrap to check the model, each term in the log-likelihood sum should be reasonably large in the model. Any terms that are very small indicates the failure to be modeled properly.
```{r}
library(boot)
set.seed(92573948)
logit_test <- function(data,indices) {
d <- data[indices,]
fit <- glm(HeartDisease ~ BMI + Smoking + AlcoholDrinking + Stroke + MentalHealth +
DiffWalking + Sex + AgeCategory + Race + Diabetic + GenHealth +
SleepTime + Asthma + KidneyDisease + SkinCancer, data = d, family = "binomial")
return(coef(fit))
}
boot_fit <- boot(
data = heart_2020,
statistic = logit_test,
R = 100
)
```
```{r}
boot_fit
plot(boot_fit, index=2)
```
- Based on this model we can hava a quantitative understanding of this model. Considering the AgeCategory which is considered as the most dominant factor in our decision tree. Their coeffiecnts can be show as
```{r}
stepwise_model$coefficients
```
```{r}
library(ggplot2)
coef <- c(0.1270495,
0.4866544,
0.5948967,
0.9958578,
1.3184067,
1.7268540,
1.9641295,
2.2261981,
2.4682742,
2.7536005,
2.9551423,
3.2117057)
coef_sd <- c(
0.1241807,
0.1110910,
0.1063721,
0.1000683,
0.0964936,
0.0931507,
0.0916889,
0.0908493,
0.0905748,
0.0905089,
0.0910431,
0.0907736
)
ageCategroy <- c( '25-29',
'30-34',
'35-39',
'40-44',
'45-49',
'50-54',
'55-59',
'60-64',
'65-69',
'70-74',
'75-79',
'80 or older')
coefframe <- data.frame(ageCategroy,coef,coef_sd)
ggplot(coefframe, aes(ageCategroy, coef)) +
geom_errorbar(aes(ymin = coef - coef_sd, ymax = coef + coef_sd), width = 0.2) +
geom_point() +
labs(title = "Coeffients Graph of Age Category and Standard Deviation") +
xlab("Age Category") +
ylab("Coefficients Value")+
theme_bw()
```
### Model Interpreatation
## GAM model
```{r}
library(mgcv)
heart_2020$AgeCategory <- factor(heart_2020$AgeCategory)
formula_am <- HeartDisease ~ s(BMI) + Smoking + AlcoholDrinking + Stroke + s(MentalHealth) + DiffWalking + Sex + Race + Diabetic + GenHealth + s(SleepTime) + Asthma + KidneyDisease + SkinCancer + PhysicalActivity + s(PhysicalHealth)
formula <- HeartDisease ~ BMI + Smoking + AlcoholDrinking + Stroke + MentalHealth +
DiffWalking + Sex + AgeCategory + Race + Diabetic + GenHealth +
SleepTime + Asthma + KidneyDisease + SkinCancer
heart_am <- gam(formula_am, data = heart_2020, family = binomial)
heart_glm <- gam(formula, data = heart_2020, family = binomial)
summary(heart_am)
summary(heart_glm)
AIC(heart_am)
AIC(heart_glm)
```