-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathslides_codealong.qmd
174 lines (118 loc) · 5.46 KB
/
slides_codealong.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "code-along"
format: html
editor: visual
---
## Example Dataset: A Brief Overview
We'll explore the relationship between physical activity and pulse in the National Health and Nutrition Examination Survey (NHANES) dataset, which has been modified for this course with simulated variables.
To try this out in practice, we load the NHANES dataset, only keep the variables we need for the example.
The variables used in the dataset:
- exposure of interest (a): physical activity (continuous variable)
- outcome (y): Pulse (continuous variable)
- outcome (y2): Diabetes(binary variable)
- mediator (m): BMI (continuous variable)
Note that the example is created to demonstrate the methods, and there is no clinical relevance of the results. For the convenience of analysis, we have transformed the mediator (m) and outcome variable (y) so that they follow a normal distribution.
Four demographic variables are considered in the analysis: age, gender, education, and smoking. We assume that adjusting for these four confounding is sufficient to block the backdoor paths.
##
```{r}
# load the library
library(readr)
# load the data
nhanes <- read_csv(here::here("data/NHANES.csv"))
library(dplyr)
# keep the variables used in the analyses
data <- nhanes %>%
dplyr::select(ID, w1 = Age, w2 = Gender, w3 = Education, w4 = Smoke100, a = PhysActive, m = BMI, y = Pulse, y2 = Diabetes ) %>%
na.omit()
```
## Continuous outcome
```{r}
# Physical activity is a binary variable, dichotomized as 'Yes' or 'No'.
table(data$a)
# check the distribution of outcome
summary(data$y)
hist(data$y)
# check the distribution of mediation
summary(data$m)
hist(data$m)
# check the education categories and relevel
table(data$w3)
data$w3 <- relevel(factor(data$w3), ref = "College Grad")
```
i\) run a model for the mediator only adjusting for confouding factors.
```{r}
lm_m <- lm(m ~ a + w1 + w2 + w3 + w4, data = data)
summary(lm_m)
```
We see that physical activity (a) is associated with lower BMI.
ii\) run the model for the outcome, including an interaction term between exposure and mediator.
```{r}
lm_y <- lm(y ~ a + m + a:m + w1 + w2 + w3 + w4, data = data)
summary(lm_y)
```
In this model we can see:
- physical activity reduces pulse rate, independent of BMI (m);
- BMI is positively associated with pulse rate (y);
- There is interaction between physical activity (a) and BMI (m)
### We will use the CMAverse R package to conduct the mediation analyses.
First, we load the package and then we setup the model object. We have to specify:
- model: this is the type of model, or the approach for the causal mediation. We used the regression-based approach, so we will also do that here. However, one can also use weighting-based approach marginal structural models or the gformula.
- outcome: your outcome variable
- exposure: your exposure/intervention/treatment
- mediator: your mediator
- mreg: the type of model for the mediator. It is a list because you can have multiple mediators
- yreg: type of regression for the outcome. E.g., linear, logistic, cox.
- astar: Control/comparison intervention (value from)
- a: Intervention (value to)
- mval: the value for the mediator when estimating the CDE
- EMin0t: indicator of whether there should be exposure and mediator interaction in the models.
- estimation: method for estimating causal effects. Can use parametric functions or imputation.
- inference: how to calculate confidence intervals. For regression-based we use the delta method. But for the other methods we use a bootstrap.
Load the library:
```{r}
## install.packages("CMAverse") #reinstall the package if the library is not loaded
library(CMAverse)
```
Build the models:
- Since both the mediator(BMI) and outcome(pulse rate) are continuous, we use linear regression models.
-
- We set mediator (BMI9 values to 25kg/m2, suppose mediator could be intervened to normal weight
- I will first pretent there is no interaction, set EMint to FALSE.
```{r}
res_rb <- cmest(
data = data, model = "rb", outcome = "y", exposure = "a",
mediator = "m", basec = c("w1","w2","w3","w4"), EMint = FALSE,
mreg = list("linear"), yreg = "linear", #specifiy model types
astar = 0, a = 1, mval = list(25),
estimation = "paramfunc", inference = "delta"
)
summary(res_rb)
```
Now change EMint to TRUE to allow for interaction.
```{r}
res_rb <- cmest(
data = data, model = "rb", outcome = "y", exposure = "a",
mediator = "m", basec = c("w1","w2","w3","w4"), EMint = TRUE,
mreg = list("linear"), yreg = "linear", #specifiy model types
astar = 0, a = 1, mval = list(25),
estimation = "paramfunc", inference = "delta"
)
summary(res_rb)
```
## Binary outcome
Suppose now i am interested in how physical activity influences the risk of diabetes. I will first check the proportion of diabetes in the sample.
`{# y2 is diaebtes} table(data$y2)`
```{r}
# recode diabetes to 1(Yes) and 0(No)
data$y2 <- case_when(
data$y2 == "Yes" ~ 1,
data$y2 == "No" ~ 0
)
```
```{r}
res_rb <- cmest( data = data, model = "rb", outcome = "y2", exposure = "a", mediator = "m", basec = c("w1","w2","w3","w4"), EMint = TRUE,
mreg = list("linear"),
yreg = "logistic", mval = list(25), estimation = "paramfunc", inference = "delta" )
summary(res_rb)
```
We can see that a great proportion of the effect of physical activity on diabetes risk is mediated through BMI (65%). undefinedThe remaining \~35% would represent physical activity's direct effect on diabetes risk through other pathways