-
Notifications
You must be signed in to change notification settings - Fork 58
/
Copy pathGLM1.rmd
1077 lines (833 loc) · 46.4 KB
/
GLM1.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Generalized Linear Models 1"
author: "Joshua F. Wiley"
date: "`r Sys.Date()`"
output:
tufte::tufte_html:
toc: true
number_sections: true
---
```{r, include=FALSE, echo=FALSE}
library(tufte)
```
# Setup
You can download the `R`markdown file for this content here
[https://jwiley.github.io/MonashHonoursStatistics/GLM1.rmd](https://jwiley.github.io/MonashHonoursStatistics/GLM1.rmd).
We will be using a few packages for this content and also the baseline
data only from the data collection exercise.
```{r setup}
library(haven)
library(data.table)
library(JWileymisc)
library(ggplot2)
library(ggpubr)
library(visreg)
## read in data
db <- as.data.table(read_sav("B 19032020.sav")) # baseline
## create a categorical age variable
db[age < 22, AgeCat := "< 22y"]
db[age >= 22, AgeCat := ">= 22y"]
db[, AgeCat := factor(AgeCat, levels = c("< 22y", ">= 22y"))]
```
# Simple Linear Regression
Before we talk about generalized linear models, let's review linear
regression. A simple linear regression is based off the equation:
$$ y_i = b_0 + b_1 * x_i + \varepsilon_i $$
where:
- $y$ is the outcome variable
- $x$ is the predictor/explanatory variable
- $\varepsilon$ is the residual/error term
- $b_0$ is the intercept, the expected (model predicted) value of $y$
when $x = 0$, written, $E(Y | x = 0)$
- $b_1$ is the slope of the line, and captures how much $y$ is
expected to change for a one unit change in $x$.
The subscript $i$ indicates that while the model *parameters*, $b_0$
and $b_1$, the regression coefficients, are the same for all
participants. Each person (or observation) has its own value of $y$
and its own value of $x$ and because the model is not likely perfect,
there will be some unexplained residual, $\varepsilon_i$.
If we want to talk about only what is predicted based on the
regression coefficients (the model parameters), we often write it:
$$ \hat{y_i} = b_0 + b_1 * x_i $$
leaving off the residual error term, $\varepsilon_i$.
`r margin_note("If you are looking at the HTML file, you may wonder where the R code to make the graphs has gone. It is still there in the Rmarkdown file, but as this part of the content is focused on conceptual understanding, not how to create graphs in R, I have hidden it by asking R not to echo [show] the source code in the final HTML file being created from the Rmarkdown file.")`
Visually, the intercept, $b_0$, and slope, $b_1$,
coefficients/parameters look like the following figure.
```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Example image of a simple, linear regression model."}
ggplot() +
geom_hline(yintercept = 0, linetype = 3) +
geom_vline(xintercept = 0, linetype = 3) +
geom_abline(intercept = 0, slope = 0.5) +
scale_x_continuous("x (predictor/explanatory variable)",
breaks = c(-3, -2, -1, 0, 1, 2, 3)) +
scale_y_continuous("y (outcome variable)",
breaks = c(-1.5, -1, -0.5, 0, 0.5, 1, 1.5)) +
coord_cartesian(xlim = c(-3, 3),
ylim = c(-1.5, 1.5)) +
theme_pubr() +
annotate("point", x = 0, y = 0, colour = "black") +
annotate("text", x = -.1, y = .1, label = "b[0]", parse = TRUE) +
annotate("segment", x = 0, xend = 1, y = 0, yend = 0, colour = "blue") +
annotate("segment", x = 1, xend = 1, y = 0, yend = 0.5, colour = "blue") +
annotate("text", x = 1.1, y = 0.25, label = "b[1]", parse = TRUE, colour = "blue")
```
Changes to the intercept indicate that the height, sometimes called
the level, of the line is shifted. Positive values of the intercept,
that is $b_0 > 0$, will shift the line up. For example as shown in the
following figure. Negative fvalues of the intercept, that is
$b_0 < 0$, will shift the line down. Just because the intercept
changes does not mean the slope has to change.
```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Example image of a simple, linear regression model with a different intercept value."}
ggplot() +
geom_hline(yintercept = 0, linetype = 3) +
geom_vline(xintercept = 0, linetype = 3) +
geom_abline(intercept = 1, slope = 0.5) +
scale_x_continuous("x (predictor/explanatory variable)",
breaks = c(-3, -2, -1, 0, 1, 2, 3)) +
scale_y_continuous("y (outcome variable)",
breaks = c(-1.5, -1, -0.5, 0, 0.5, 1, 1.5)) +
coord_cartesian(xlim = c(-3, 3),
ylim = c(-1.5, 1.5)) +
theme_pubr() +
annotate("point", x = 0, y = 1, colour = "black") +
annotate("text", x = -.1, y = 1.1, label = "b[0]", parse = TRUE) +
annotate("segment", x = 0, xend = 1, y = 1, yend = 1, colour = "blue") +
annotate("segment", x = 1, xend = 1, y = 1, yend = 1.5, colour = "blue") +
annotate("text", x = 1.1, y = 1.25, label = "b[1]", parse = TRUE, colour = "blue")
```
Changes to the slope indicate that the steepness of the line is
shifted. Positive values indicate a more positive (steeper) slope,
such that as $x$ increases, $y$ also is expected to increase. Negative
values indicate a more negative (also maybe steeper declining) slope,
such that as $x$ increases, $y$ is expected to decrease.
```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Example image of a simple, linear regression model with a different slope."}
ggplot() +
geom_hline(yintercept = 0, linetype = 3) +
geom_vline(xintercept = 0, linetype = 3) +
geom_abline(intercept = 0, slope = -0.5) +
scale_x_continuous("x (predictor/explanatory variable)",
breaks = c(-3, -2, -1, 0, 1, 2, 3)) +
scale_y_continuous("y (outcome variable)",
breaks = c(-1.5, -1, -0.5, 0, 0.5, 1, 1.5)) +
coord_cartesian(xlim = c(-3, 3),
ylim = c(-1.5, 1.5)) +
theme_pubr() +
annotate("point", x = 0, y = 0, colour = "black") +
annotate("text", x = -.1, y = .1, label = "b[0]", parse = TRUE) +
annotate("segment", x = 0, xend = 1, y = 0, yend = 0, colour = "blue") +
annotate("segment", x = 1, xend = 1, y = 0, yend = -0.5, colour = "blue") +
annotate("text", x = 1.1, y = -0.25, label = "b[1]", parse = TRUE, colour = "blue")
```
In simple linear regression, we are not creating an arbitrary line, we
are creating the line of best fit. Here, "best fit" means the line
that minimizes the sum of squared residuals.
To better understand residuals, let's create a visual look at the
residuals in a simple linear regression. To keep this example small,
we will use a little dataset built into `R`, the `mtcars` dataset,
which has data on just 32 cars. The following graph shows the scatter
plot of two variables, the weight of cars (`wt`) and their horse power
(`hp`). The solid black line is the regression line of best fit, and
the blue lines between each observed value and the line are the
residuals. The residuals capture the difference between what the model
predicted and what was actually observed. That leads to another way we
can think about and write the equation for residuals: the difference
between the observed and predicted outcome values.
$$ \varepsilon_i = y_i - \hat{y_i} $$
```{r, echo = FALSE}
m <- lm(hp ~ wt, data = mtcars)
mtcars$yhat <- predict(m)
ggplot(mtcars, aes(x = wt, y = hp)) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, colour = "black") +
geom_point() +
geom_segment(aes(xend = wt, yend = yhat), colour = "blue") +
theme_pubr()
```
When we fit a linear regression, equations are solved to give
*estimates* of $b_0$ and $b_1$ that minimize the sum of squared
residuals, and those parameter estimates (or regression coefficients)
produce the line of best fit.
## Muliple Linear Regression
Multiple linear regression works in principle basically the same way
as does simple linear regression. The difference is that more then one
predictor (explanatory) variable is allowed in a single model.
As before we have an outcome, $y$ predicted by the combination of
model parameters, the $b$s but instead of only one predictor, $x$, we
can have an indeterminate number, $k$. It could be two predictors,
three, twenty, the number does not change the underlying model.
$$ y_i = b_0 + b_1 * x_{1i} + ... + b_k * x_{ki} + \varepsilon_i $$
The regression coefficients (or more generally referred to as model
parameters, where in this case the model is a regression), are
interpretted fairly similarly to those in simple linear regression,
but with some extra requirements.
- $b_0$ is the intercept, the expected (model predicted) value of $y$
when all predictors are 0written, $E(Y | x_1 = 0, ..., x_k = 0)$
- $b_1$ is the slope of the line, and captures how much $y$ is
expected to change for a one unit change in $x_1$, holding all other
predictors constant (or said equivalently, "controlling for all
other predictors").
- $b_k$ is the slope of the line, and captures how much $y$ is
expected to change for a one unit change in $x_k$, holding all other
predictors constant (or said equivalently, "controlling for all
other predictors").
- $\varepsilon$ is the residual/error term, the difference between the
model predicted value and the observed value.
Visually, it becomes more difficult to visualize, but remains posible
with a 3D graph. The following figure shows the regression surface
predicted by a multiple regression of neuroticism on both stress and
selfesteem. Spend some time moving it around to view it from different
angles and see the unique association between stress and neuroticism
as well as the unique association between selfesteem and neuroticism.
The word "unique" here is important as it conveys that this is not
two simple linear regressions--in a multiple regression, the
regression coefficients give the unique, or independent, association
of each predictor with the outcome independent, or controlling for,
the other predictors.
```{r, echo = FALSE}
## this package only used for demo purposes
library(plotly)
m <- lm(neuroticism ~ stress + selfesteem, data = db)
stress <- seq(from = min(db$stress, na.rm=TRUE), to = max(db$stress, na.rm=TRUE),
length.out = 100)
selfesteem <- seq(from = min(db$selfesteem, na.rm=TRUE), to = max(db$selfesteem, na.rm=TRUE),
length.out = 100)
neuroticism <- expand.grid(stress = stress, selfesteem = selfesteem)
neuroticism$neuroticism <- predict(m, newdata = neuroticism)
neuroticism <- as.matrix(
reshape(neuroticism, v.names = "neuroticism", timevar = "selfesteem",
idvar = "stress", direction = "wide")[, -1])
plot_ly(x = ~ stress, y = ~ selfesteem, z = ~ neuroticism) %>% add_surface()
```
# Generalized Linear Models
The linear regression models we saw fit more broadly into what are
often called "linear models", which is a class of models that are all
fundamentally equivalent in that they are linear models and include
t-tests, analysis of variance, pearson correlations, and linear
regressions as specific types of linear models.
All these cases share in common that a linear association between
predictors or explanatory variables and the outcome is assumed, and
the outcome must be continuous and is assumed to be normally
distributed.
The linear model is itself a *special case* of the
**generalized linear model** or **GLM**. In particular, the GLM
extends the linear model to different outcomes. A few examples are:
- continuous, normally distributed variables (linear regression)
- binary 0/1 variables (logistic regression, probit regression)
- count variables (poisson regression, negative binomial regression)
This is beneficial because it is fairly common in the real world to
cross a variable that is not normally distributed and yet may be
important to study. In this content, we are going to focus on the GLM
for normal outcomes, i.e., linear regression. However, we will move
towards a GLM framework as seeing and understanding linear regression
through the GLM framework will allow us to use other types of GLMs for
non-normal data more easily.
## Notation and some Theory
We already learned about linear regression
$$ y_i = b_0 + b_1 * x_{1i} + ... + b_k * x_{ki} + \varepsilon_i $$
First, we separate the model into pieces. For simplicity, I am going
to drop the "i" subscript and use a boldface font to indicate when a
variable is a vector that would have different observed values for
each person. The parameters (e.g., $b_0$) are not bolded because they
are not vectors at this point.
$$ \boldsymbol{\eta} = b_0 + b_1 * \boldsymbol{x_1} + ... + b_k * \boldsymbol{x_k} $$
$$ \boldsymbol{y} = \boldsymbol{\eta} + \boldsymbol{\varepsilon} $$
You have probably heard many times that in linear regression, we
assume a *normal* distribution. Before we jump into the assumption for
linear regression, we need to talk a bit more about **probability
distributions**. A probability distribution is a function that
maps a value to a particular probability of occuring. Most
probability distributions we deal with, such as the normal
distribution, is not a single distribution, but a family of
distributions. This is because the specific normal distribution we get
depends on some parameters.
In particular, the normal distribution has two parameters.
- mean ($\mu$), also sometimes called the "location" between it
controls the location fo the center of the distribution.
- standard deviation ($\sigma$), also sometimes called the "scale"
because it controls the scale or spread of the distribution.
Formally, th normal distribution can be written:
$$ \mathcal{N}(\mu, \sigma) $$
When we talk about a normal distribution, we may think of the standard
normal distribution, which has mean zero and unit variance/standard
deviation:
$$ \mathcal{N}(0, 1) $$
However, there are many normal distributions. The following figure
shows a few different normal distributions, some with the same mean
and different standard deviations, some with different means.
What this shows is that to figure out the probability, it is not
enough to specify that you are assuming a normal distribution, you
must also specify which specific normal distribution by specifying its
parameters, the mean and standard deviation.
```{r, echo = FALSE, fig.width = 6, fig.height = 4}
ggplot(data.frame(x = seq(-4, 4, by = .05)), aes(x)) +
stat_function(fun = dnorm, args = list(mean = 0, sd = 2),
colour = "blue", size = 1) +
stat_function(fun = dnorm, args = list(mean = 0, sd = 1),
colour = "black", size = 1) +
stat_function(fun = dnorm, args = list(mean = 0, sd = 0.5),
colour = "red", size = 1) +
stat_function(fun = dnorm, args = list(mean = 1, sd = 0.5),
colour = "green", size = 1) +
theme_pubr() +
xlab("x") + ylab("Density of the normal distribution")
```
Getting back to linear regression, what does it really mean when we
say that normality is assumed? You may have heard that the outcome has
to be normally distributed, but that is not quite true. In fact, what
we assume is that the outcome is conditionally normal, conditional on
the model. In practice, this tends to written one of two ways, which
are effectively equivalent.
$$ \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\eta}, \sigma_\varepsilon) $$
Which is read, "Y is distributed as a normal distribution with mean
equal to $\boldsymbol{\eta}$ and standard deviation $\sigma$".
$$ \boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma_\varepsilon) $$
The only difference is whether we use the residuals and assume a mean
of 0 (because residuals are always constructed to have mean 0) or the
raw outcome variable, but assume mean equal to $\boldsymbol{\eta}$. In
effect, we are assuming that the outcome is distributed normally
around the regression line, with standard deviation based on the
standard deviation of the residuals.
------------------------------------------
Any GLM has to be a linear model at some level, that is, this is
always true:
$$ \boldsymbol{\eta} = b_0 + b_1 * \boldsymbol{x_1} + ... + b_k * \boldsymbol{x_k} $$
indeed, even t-tests and ANOVAs can be fit into this framework.
The GLM generalizes linear regression by defining a *function* that
links or transforms $\boldsymbol{\eta}$ from a linear space to the
outcome space.
The link function is always called $g()$. Likewise an inverse link
function exists that simple "undoes" whatever the link function
did. This is called $g()^{-1}$.
In normal linear regression, the link function and inverse link
function are the identity function, by default, so *there is no
transformation*. For other GLMs, the link and inverse link functions
do some transformation.
In terms of formulae, we define the expected value of
$\boldsymbol{y}$, its mean, as $\mu$
- $E(y) = \mu = mean~~of~~y$
Then
- $\eta = g(\mu)$
- $\mu = g^{-1}(\eta)$
For normal linear regression because the link and inverse link
functions are the identity function this works out to be
- $\eta = g(\mu) = \mu$
- $\mu = g^{-1}(\eta) = \eta$
The powerful concept of the GLM is that by using link functions, the
linear model
$$ \boldsymbol{\eta} = b_0 + b_1 * \boldsymbol{x_1} + ... + b_k * \boldsymbol{x_k} $$
always stays the same whether the outcome is continuous and normally
distributed or binary and not normally distributed or count or any
other type of variable and assumed distribution. This means in some
sense, all GLMs are interpretted like linear regression, which
facilitates modelling and interpretation.
Now, let's take a look at linear regression in `R` and work through
examples running and interpretting the results.
# Linear Regression in `R`
Linear regression in `R` is generally conducted using the `lm()`
function, which fits a **l**inear **m**odel. It uses a formula
interface to specify the desired model, with the format:
`outcome ~ predictor`. We can store the results of the linear
regression in an object, `m`, and then call various functions on this
to get further information, for example `summary()` to get a quick
summary or `confint()` to get confidence intervals.
When we used `summary()` we get quite a bit of output. The headings
are:
- Call: this is the code we used to call `R` to produce the model, its
a handy reminder of the variables and outcome especially if you are
running and saving many models.
- Residuals: this section provides some descriptive statistics on the
residuals.
- Coefficients: this is really the main table normally reported for
regressions. The column "Estimate" give the model parameter
estimates or regression coefficients (the $b$s). The column "Std. Error" gives
the standard error (SE) for each coefficient, a measure of
uncertainty. The column "t value" gives the t-value for each
parameter, used to calculate the p-values, and defined as
$\frac{b}{SE}$. Finally the column "Pr(>|t|)" gives the probability
value, the p-value for each coefficient. The significance codes
below are used to indicate what the asterisks mean, but really you
can just interpret the exact p-values yourself.
- Next is the residual standard error,
$\sigma_\varepsilon$ and the residual degrees of freedom (used with
the t-value to calculate p-values).
- Next are the overall model $R^2$, the proportion
of variance in the outcome explained by the model in this specific
sample data and the adjusted $R^2$, a better estimate of the true
population $R^2$.
- Finally the F-statistic and degrees of freedom and p-value are
provided. The F-test is an overall test of whether the model is
statistically significant overall or not. It will test all the
predictor(s) simultaneously. Because we only have one predictor
right now, the p-value for the F-test is identical to the p-value
for the t-test of the `stress` coefficient. In multiple regression
models, they will not be identical.
Using `confint()` gives us 95% confidence intervals for the parameter
estimates, the regression coefficients.
```{r}
m <- lm(neuroticism ~ stress, data = db)
summary(m)
confint(m)
```
`r margin_note("If you look at this in the Rmarkdown file, its quite complex as I am having R input the relevant numbers directly into the text. You don't need to do this. I hate copying and pasting and want to make sure I don't make errors that may confuse interpretation.")`
We could interpret the results of the regression as follows.
There
`r ifelse(coef(summary(m))["stress", "Pr(>|t|)"] < .05, "was", "was not")`
a stastistically significant association between stress and
neuroticism. A one unit higher stress score was associated with a
`r sprintf("%0.2f", round(coef(m)[["stress"]], 2))`
[95% CI `r sprintf("%0.2f - %0.2f", round(confint(m)["stress", 1], 2), round(confint(m)["stress", 2], 2))`]
higher neuroticism score,
`r formatPval(coef(summary(m))["stress", "Pr(>|t|)"], includeP = TRUE)`.
Stress explained
`r sprintf("%0.1f", summary(m)$adj.r.squared * 100)`% of the variance
in neuroticism.
We can also use the `visreg` package to help us visualize the results
from the regression to better understand what it means. The defaults
work fairly well in this case, giving us the regression line (in blue)
and the shaded grey region shows the 95% confidence interval around
the regression line. Partial residuals are plotted as well by
default. We use the argument, `gg = TRUE` to ask `visreg()` to make it
a ggplot2 graph which lets us do things like customize the theme and
work with the usual ggplot2 graphing framework we are accustomed to
now.
```{r, fig.width = 6, fig.height = 4, fig.cap = "simple regression plot from visreg"}
visreg(m, xvar = "stress", gg = TRUE) +
theme_pubr()
```
If you want just the regression line, you can set `parital = FALSE`
and `rug = FALSE` to stop plotting partial residuals. As with other
graphs in ggplot2, we can use `+` to add additional elements, beyond
the theme, like controlling the title.
```{r, fig.width = 6, fig.height = 4, fig.cap = "simple regression plot from visreg with customization"}
visreg(m, xvar = "stress", partial = FALSE, rug = FALSE, gg = TRUE) +
theme_pubr() +
ggtitle("Linear regression of neuroticism on stress",
subtitle = "Shaded region shows 95% confidence interval.")
```
You can use the `annotate()` function to add annotations to the graph,
such as a label indicating the coefficient and p-value. Figuring out
the best position and size to be attractive is typically a process of
some trial and error in any graph. Play around the with x and y
coordinates and the size until you are happy with it.
```{r, fig.width = 6, fig.height = 4, fig.cap = "simple regression plot from visreg with even more customization"}
visreg(m, xvar = "stress", partial = FALSE, rug = FALSE, gg = TRUE) +
theme_pubr() +
annotate("text", x = 22, y = 8, label = "b = 0.14, p = .003", size = 5) +
ggtitle("Linear regression of neuroticism on stress",
subtitle = "Shaded region shows 95% confidence interval.")
```
## Categorical Predictors in Linear Regression
`r margin_note("You might find this page helpful for further explanation of dummy coding: https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqwhat-is-dummy-coding/")`
In addition to continuous variables, categorical variables can be
included as predictors / explanatory variables in linear
regression. The most common approach to including a categorical
predictor is to use dummy coding. The basic idea behind dummy coding
is to turn a categorical predictor with $k$ levels into $k$ separate
dummy code variables, where each dummy coded variable is coded as 1
for a particular level of the variable and 0 otherwise. The simplest
case of this is a binary variable, a categorical variable with only
two levels (e.g., younger adults, older adults) where you might create
a new dummy coded variable where 1 = older adults and 0 = younger
adults. This dummy coded variable is now a numeric predictor (it has
0s and 1s) and can be included in regression as usual.
The only real difference between interpretting a dummy coded variable
from a regular continuous predictor is that: (1) the 0 point is not
arbitrary, it represents one particular group (e.g., younger adults)
and (2) a one unit change in the dummy coded variable is not
arbitrary, it represents the difference between groups (because 0 to 1
**is** a one unit change).
`R` automatically dummy codes any factor variable you enter into a
linear regression as a predictor, so you do not have to do it
manually. If the variable is not a factor, you may want to convert it
to a factor first using `factor()`.
Let's take a look at a simple linear regression with a categorical
predictor, `AgeCat`. We will store these results in `m2`, for
model 2. We can also graph the results.
```{r, fig.width = 6, fig.height = 4, fig.cap = "simple regression plot from visreg with categorical predictor"}
m2 <- lm(neuroticism ~ AgeCat, data = db)
summary(m2)
visreg(m2, xvar = "AgeCat", partial = FALSE, rug = FALSE, gg = TRUE) +
theme_pubr() +
ggtitle("A categorical predictor")
```
In this simple example, since there are no other predictors, the
intercept, the expected value of neuroticism when the predictors are 0
is the mean neuroticism level in the younger adults, and the
regression coefficient for `AgeCat` is the **difference** in
neuroticism on average between younger and older adults, which we can
see
`r ifelse(coef(summary(m2))["AgeCat>= 22y", "Pr(>|t|)"] < .05, "is", "is not")`
statistically significant.
Incidentally, the results from this linear regression with dummy
coding will exactly match what we would get from a t-test (compare the
mean in the younger group with the intercept, and the mean difference
as well as the p-value for `AgeCat`).
```{r}
t.test(neuroticism ~ AgeCat, data = db, var.equal = TRUE)
```
It is also the same as we would get from an ANOVA (compare the
F-values and the p-values).
```{r}
summary(aov(neuroticism ~ AgeCat, data = db))
```
## Linear Regression Diagnostics in `R`
Linear regressions have a few assumptions, they assume that:
- $$ \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\eta}, \sigma_\varepsilon) $$ (normality)
- observations are independent of each other (independence)
- the residual variance, $\sigma_\varepsilon$, is homogenous
(homogeneity of variance).
The normality assumption is typically tested by comparing the
residuals from the model against a normal distribution.
The independence assumption cannot readily be tested, but we normally
assume it is true if participants are independent of each other (e.g.,
are not siblings, we don't have repeated measures on the same person,
etc.).
The homogeneity of variance assumption is typically assessed by
plotting the residuals against predicted values. The expectation is
that across the range of predicted values, the spread (variance) of
the residuals is about equal and that there should not be any
systematic trends in the residuals.
Although not an assumption, we often want to identify outliers or
extreme values that may unduly influence our findings. This is often
done by examining the residuals from the model.
Most of these diagnostics and assumptions can be checked
graphically. We use the `modelDiagnostics()` function to calculate
diagnostics and the `plot()` function to create a visual display.
This should be very familiar from the Data Visualization 1 topic where
we used `testDistribution()` and plot to test a specific variable. It
is a similar idea but now operating on the residuals.
```{r, fig.width = 9, fig.height = 5, fig.fullwidth = TRUE, fig.cap = "Regression model diagnostics. Left panel shows a plot of the residuals to assess normality and they will be highlighted black if an extreme value. The right panel shows a plot of the predicted versus residual values to help assess the homogeneity of variance assumption."}
m <- lm(neuroticism ~ stress, data = db)
## assess model diagnostics
md <- modelDiagnostics(m, ev.perc = .005)
plot(md, ncol = 2, ask = FALSE)
```
The **left panel** shows a density plot of the residuals (solid black
line) and for reference the density of what a perfectly normal
distribution would look like with the same mean and residual variance
(dashed blue line). The rug plot under the density plot (small,
vertical lines) shows where raw data points fall. The x axis labels
(for the Residuals) shows the minimum, 25th percentile, median, 75th
percentile, and maximum, providing a quick quantitative summary. The
bottom of the left panel shows a QQ deviates plot, this is the same as
a regular QQ plot but it has been rotated so that the line is
horizontal instead of diagonal to take up less space. If the points
fall exactly on the line, that indicates they fall exactly where they
would be expected under a normal distribution. Both the rug plot and
the dots plotted in the QQ deviates plot are light grey if they are
not extreme and are colored solid black if they meet the criteria for
an extreme value, as specified in the call to `modelDiagnostics()`.
The **right panel** shows a scatter plot of the model predicted values
against the residuals. This can help asses the homogeneity of variance
assumption. If the residual variance is homogenous, we would expect
the spread of the residuals (the y axis) to be about the same at all
levels of the predicted values. Sometimes this is very easy to see
visually. Sometimes it is not so easy to see. To aid in the visual
interpretation, quantile regression lines are added. Quantile
regression lines allow a regression model to predict the 10th and 90th
percentile. These predicts are added as dashed blue lines. Under the
assumption of homogenous variance, one would expect these lines to be
approximately horizontal and parallel to each other. If they are not
parallel, that indicates that the variance is larger / smaller for
some predicted values than for others, known as heterogenous
variance. The solid blue line in the middle is a loess smooth line,
which also would ideally be about flat and about zero, to indicate
that there is no systematic bias in the residuals. Systematic bias can
occur, for example when for low predicted values, the residuals are
always positive and for high predicted values the residuals are always
negative and indicate, typically, either ceiling / floor effects in
the data and/or issues that would ideally be addressed by revising the
model. As with the density plot, the axes in the scatter plot show the
minimum, 25th percentile, median, 75th percentile, and maximum,
providing a nice quantitative summary of the residuals and predicted
values.
These graph looks fairly good. They are not perfect, but
real data never are and the degree of variation in the spread of
residuals is not extreme. We might be reasonably satisfied that the
assumption or normality and homogeneity of variance are met or at
least not terribly violated. There also do not appear to be extreme
values in the residuals.
For the sake of example, let's look at another outcome measure.
`Duration__in_seconds_` which is how many seconds it took for each
survey to be completed. Again we will use `stress` as a predictor.
```{r, fig.width = 9, fig.height = 5, fig.fullwidth = TRUE, fig.cap = "Regression model diagnostics. Left panel shows a plot of the residuals to assess normality and they will be highlighted black if an extreme value. The right panel shows a plot of the predicted versus residual values to help assess the homogeneity of variance assumption."}
malt <- lm(Duration__in_seconds_ ~ stress, data = db)
## assess model diagnostics
maltd <- modelDiagnostics(malt, ev.perc = .005)
plot(maltd, ncol = 2, ask = FALSE)
```
These graphs show clear problems in the model. The residuals are not
at all normally distributed. There are two extreme values, based on
the current criteria. The homogeneity of variance plot does not appear
as bad, but the extreme values on the residuals are apparent. At a
minimum, we would want to remove these extreme values. However, the
marked non-normality also is a concern. We could try a log
transformation of the outcome.
```{r, fig.width = 9, fig.height = 5, fig.fullwidth = TRUE, fig.cap = "Regression model diagnostics after log transforming the outcome."}
malt2 <- lm(log(Duration__in_seconds_) ~ stress, data = db)
## assess model diagnostics
malt2d <- modelDiagnostics(malt2, ev.perc = .005)
plot(malt2d, ncol = 2, ask = FALSE)
```
Even after the log transformation, the extreme values remain. Let's
remove those. Extreme values are stored and accessible within the
diagnostics object output, `malt2d`. It shows the scores on the
outcome, the index (the row in the data the extreme values come from)
and the effect type where they are an extreme value, in this case,
extreme on the residuals. We use the `Index` values to remove them
from the dataset and then refit our model. We can use `-` in front
indices to tell data table that instead of selecting those rows, we
want to **exclude** those rows.
```{r, fig.width = 9, fig.height = 5, fig.fullwidth = TRUE, fig.cap = "Regression model diagnostics after log transforming the outcome and excluding extreme values."}
malt2d$extremeValues
malt3 <- lm(log(Duration__in_seconds_) ~ stress,
data = db[-malt2d$extremeValues$Index])
## assess model diagnostics
malt3d <- modelDiagnostics(malt3, ev.perc = .005)
plot(malt3d, ncol = 2, ask = FALSE)
```
`r margin_note("If you would like some more examples or additional explanation, have a look at: http://joshuawiley.com/JWileymisc/articles/diagnostics-vignette.html")`
After excluding the initial extreme values, one new extreme value
emerges. However, it is relatively less extreme and overall the model
diagnostics look much better now.
## Multiple Linear Regression in `R`
You can conduct multiple linear regressions in `R` almost the same way
as you would a simple linear regression. Additional predictors can be
added by simply using `+` between them. The following code shows an
example adding both a continuous and categorical predictor
simultaneously. Next, we run diagnostics, and only after diagnostics
take a look at the summary. We leave the `summary()` until after
diagnostics on purpose as there is not much point interpretting the
coefficients if the model assumptions are badly violated. We would
want to first fix the model and then interpret.
```{r, fig.width = 9, fig.height = 5, fig.fullwidth = TRUE, fig.cap = "Multiple regression model diagnostics."}
m3 <- lm(neuroticism ~ stress + AgeCat, data = db)
m3d <- modelDiagnostics(m3, ev.perc = .005)
plot(m3d, ncol = 2, ask = FALSE)
summary(m3)
```
We can also get a visual summary of the model using `visreg()`. We
separate the plot by `AgeCat` since it is a categorical predictor it
is easy to make separate lines for each age group. So that the lines
are plotted in the same graph and not separate graphs, we specify
`overlay = TRUE`.
```{r, fig.width = 7, fig.height = 5, fig.cap = "multiple regression plot from visreg"}
visreg(m3, xvar = "stress", by = "AgeCat", partial = FALSE, rug = FALSE,
overlay = TRUE, gg = TRUE) +
theme_pubr() +
ggtitle("A multiple regression - separate lines for each age group")
```
In this case, `AgeCat` has such a minor effect on the outcome that the
two lines are virtually on top of each other. If `AgeCat` had a larger
effect, they would be more separated.
We could interpret the results of the regression as follows.
There
`r ifelse(coef(summary(m3))["stress", "Pr(>|t|)"] < .05, "was", "was not")`
a stastistically significant association between stress and
neuroticism, controlling for age group. A one unit higher stress score was associated with a
`r sprintf("%0.2f", round(coef(m3)[["stress"]], 2))`
[95% CI `r sprintf("%0.2f - %0.2f", round(confint(m3)["stress", 1], 2), round(confint(m3)["stress", 2], 2))`]
higher neuroticism score,
`r formatPval(coef(summary(m3))["stress", "Pr(>|t|)"], includeP = TRUE)`.
There
`r ifelse(coef(summary(m3))["AgeCat>= 22y", "Pr(>|t|)"] < .05, "was", "was not")`
a stastistically significant association between age group and
neuroticism, controlling for stress. Compared to younger adults, older
adults had
`r sprintf("%0.2f", round(coef(m3)[["AgeCat>= 22y"]], 2))`
[95% CI `r sprintf("%0.2f - %0.2f", round(confint(m3)["AgeCat>= 22y", 1], 2), round(confint(m3)["AgeCat>= 22y", 2], 2))`]
higher neuroticism scores,
`r formatPval(coef(summary(m3))["AgeCat>= 22y", "Pr(>|t|)"], d = 2, includeP = TRUE)`.
Overall, stress and age group explained
`r sprintf("%0.1f", summary(m3)$adj.r.squared * 100)`% of the variance
in neuroticism.
When running a simple linear regression, the overall model variance
explained, $R^2$, serves as an effect size estimate. However, in
multiple regression, that only serves as an effect size for the
overall model. In this case, even though we can tell that clearly most
the effect is attributable to stress, not age group, we do not have a
specific quantitative effect size estimate for each predictor.
We can get a nice set of tests for a model using the `modelTest()`
function and have a nicely formatted set of output using the
`APAStyler()` function. We wrap it in `knitr::kable()` to get nice
output in our HTML file.
```{r, results = "asis"}
m3test <- modelTest(m3)
knitr::kable(APAStyler(m3test))
```
The table has several parts. The "Term" column tells us what we are looking at,
the "Est" column gives us the values, and the "Type" column is the
broad category. First up, we have the fixed effects, they are called
fixed effects because they do not vary. These are our model
parameters, the regression coefficients. What is reported is the
estimates, asterisks to indicate p-values, and the 95% confidence
interals in brackets. Rounding to two decimal points is the default
consistent with standard APA style.
`r margin_note("According to Cohen's (1988) guidelines, f2 >= 0.02, f2
>= 0.15, and f2 >= 0.35 represent small, medium, and large effect
sizes, respectively.")`
The next section presents information about the overall model, like
how many observations were included, the log likelihood (logLik), the
degrees of freedom for the overall log likelihood, information
criterion (AIC, BIC), which we will learn about later for model
comparisons, and several overall model effect sizes: cohen's $f^2$
effect size, the $R^2$ and adjusted $R^2$. Cohen's $f^2$ effect size
for the overall model is defined as
$$ \frac{R^{2}}{1 - R^{2}} $$
it is based on the model $R^2$ but divided by the variance not
explained, $1 - R^{2}$.
Finally at the bottom, are cohen's $f^2$ effect sizes for each
individual predictor along with p-values for each predictor
overall. The Cohen's $f^2$ effect size for individual coefficients is
defined as:
$$ \frac{R_{AB}^{2} - R_{A}^{2}}{1 - R_{AB}^{2}} $$
`r margin_note("If you would like some more examples or additional
explanation, have a look at:
http://joshuawiley.com/JWileymisc/articles/model-test-vignette.html")`
where what is compared is the the variance explained by the overall
model minus the variance explained by a model excluding that specific
predictor. In essence, this provides an estimate of the unique
variance in the outcome explained by each predictor, relative to the
total variance not explained. Cohen advocated this approach because,
for example, if a model already explains 90% of the variance in an
outcome, it is very difficult to explain an additional 9% (that would
bring the model to nearly 100% variance explained). In contrast, if a
model explains 0% of the variance, adding one more predictor may have
an easier time explaining 9% of the variance. Thus Cohen's $f^2$
effect size accounts for this. The square root of this often is used
in power analyses for regression as well (e.g., in G*Power).
## Moderation in `R`
Sometimes the association between two variables depends on a third
variable. This is called moderation. The way that we formally test for
moderation in regression is by adding an interaction term. An
interaction term is the product of two (or more) variables. In an
equation:
$$ \boldsymbol{y} = b_0 + b_1 * \boldsymbol{x} + b_2 * \boldsymbol{w} + b_3 * \boldsymbol{x} * \boldsymbol{w} + \boldsymbol{\varepsilon} $$
The interaction term gets an additional regression coefficient
(another model parameter), $b_3$ in this example, that captures how
different the association of each individual variable ($b_1$ or $b_2$)
with the outcome is depending on the level of the other variable.
`r margin_note("As we are focused on moderation here, we are not
evaluating model diagnostics. In research or applied settings, you
would definitely want to conduct diagnostics before being confident
the model results are trustworthy.")`
It is very easy to including interactions in `R`. Inside of a
regression formula, we use the `*` operator, which expands to mean,
give me the main effects and the interactions between these
variables. Here is an example evaluating an interaction between stress
and age group on neuroticism. Note that when there are interactions,
`modelTest()` does not deal well with categorical predictors that are
automatically dummy coded. The solution is to manually dummy code.
The following code estimates the model and makes a nice summary with
effect sizes.
```{r, results = "asis"}
db[, AgeDummy := ifelse(AgeCat == ">= 22y", 1L, 0L)]
mint <- lm(neuroticism ~ stress * AgeDummy, data = db)
knitr::kable(APAStyler(modelTest(mint)))
```
We can also use `visreg()` to make a nice interaction plot.
```{r, fig.width = 7, fig.height = 5, fig.cap = "moderated regression plot from visreg"}
visreg(mint, xvar = "stress", by = "AgeDummy", partial = FALSE, rug = FALSE,
overlay = TRUE, gg = TRUE) +
theme_pubr() +
ggtitle("A moderated multiple regression")
```
Now that we have the output and the interaction plot, let's interpret
what we found. In the presence of an interaction, the variables
involved in the interaction on their own become simple effects.
There was not a statistically significant interaction between stress
and age group (b = -0.16, p = .099, $f^2 = 0.05$). It is likely that if
this interaction was not central in practice you would drop the
interaction, re-run the model and interpret the results. However, to
show what could be interpretted if there was a significant
interaction, we will proceed.
There was a significant simple effect of stress
on neuroticism, such that for younger adults (i.e., when AgeDummy =
0), each one unit higher stress was associated iwth 0.25 higher
neuroticism [95% CI .09 - .40], p = .002 and this was a moderate
effect, $f^2 = 0.20$. There was no simple effect
of age group on neuroticism when stress was held at 0, (b = 4.38, [95%
CI -1.00 - 9.77], p = .109) and the effect size was small, $f^2 =
.05$. Although the interaction was not significant, a graph of the
interaction showed that the pattern was that the association between
stress and neuroticism was stronger in younger adults than in older
adults. The overall model explained 20% of the variance in neuroticism
in the sample.
It also is possible to have an interaction between two continuous
variables. The `R` code is virtually identical, but we can no longer
talk about "groups". We will use an almost identical model, but with
continuous age in years.
```{r, results = "asis"}
mint2 <- lm(neuroticism ~ stress * age, data = db)
knitr::kable(APAStyler(modelTest(mint2)))
```
To better understand what the interaction of two continuous variables
means, we can look at it in 3D. What we can see in the following
figure is that there is a continuous "bending" in the association
between either individual variable and neuroticism depending on the
level of the other. What used to be like a flat plan, is now "warped".
```{r, echo = FALSE}
## this package only used for demo purposes
library(plotly)
stress <- seq(from = min(db$stress, na.rm=TRUE), to = max(db$stress, na.rm=TRUE),
length.out = 100)