GLM1.rmd

---
title: "Generalized Linear Models 1"
author: "Joshua F. Wiley"
date: "`r Sys.Date()`"
output:
  tufte::tufte_html: 
    toc: true
    number_sections: true
---

```{r, include=FALSE, echo=FALSE}
library(tufte)
```

# Setup

You can download the `R`markdown file for this content here
[https://jwiley.github.io/MonashHonoursStatistics/GLM1.rmd](https://jwiley.github.io/MonashHonoursStatistics/GLM1.rmd).

We will be using a few packages for this content and also the baseline
data only from the data collection exercise.

```{r setup}

library(haven)
library(data.table)
library(JWileymisc)
library(ggplot2)
library(ggpubr)
library(visreg)

## read in data
db <- as.data.table(read_sav("B 19032020.sav")) # baseline

## create a categorical age variable
db[age < 22, AgeCat := "< 22y"]
db[age >= 22, AgeCat := ">= 22y"]
db[, AgeCat := factor(AgeCat, levels = c("< 22y", ">= 22y"))]

``` 

# Simple Linear Regression

Before we talk about generalized linear models, let's review linear
regression. A simple linear regression is based off the equation:

$$ y_i = b_0 + b_1 * x_i + \varepsilon_i $$

where:

- $y$ is the outcome variable
- $x$ is the predictor/explanatory variable
- $\varepsilon$ is the residual/error term
- $b_0$ is the intercept, the expected (model predicted) value of $y$
  when $x = 0$, written, $E(Y | x = 0)$
- $b_1$ is the slope of the line, and captures how much $y$ is
  expected to change for a one unit change in $x$.

The subscript $i$ indicates that while the model *parameters*, $b_0$
and $b_1$, the regression coefficients, are the same for all
participants. Each person (or observation) has its own value of $y$
and its own value of $x$ and because the model is not likely perfect,
there will be some unexplained residual, $\varepsilon_i$.

If we want to talk about only what is predicted based on the
regression coefficients (the model parameters), we often write it:

$$ \hat{y_i} = b_0 + b_1 * x_i $$

leaving off the residual error term, $\varepsilon_i$.


`r margin_note("If you are looking at the HTML file, you may wonder where the R code to make the graphs has gone. It is still there in the Rmarkdown file, but as this part of the content is focused on conceptual understanding, not how to create graphs in R, I have hidden it by asking R not to echo [show] the source code in the final HTML file being created from the Rmarkdown file.")`

Visually, the intercept, $b_0$, and slope, $b_1$,
coefficients/parameters look like the following figure.

```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Example image of a simple, linear regression model."}

ggplot() +
  geom_hline(yintercept = 0, linetype = 3) +
  geom_vline(xintercept = 0, linetype = 3) +   
  geom_abline(intercept = 0, slope = 0.5) +
  scale_x_continuous("x (predictor/explanatory variable)",
                     breaks = c(-3, -2, -1, 0, 1, 2, 3)) +
  scale_y_continuous("y (outcome variable)",
                     breaks = c(-1.5, -1, -0.5, 0, 0.5, 1, 1.5)) +  
  coord_cartesian(xlim = c(-3, 3),
                  ylim = c(-1.5, 1.5)) +
  theme_pubr() +
  annotate("point", x = 0, y = 0, colour = "black") +  
  annotate("text", x = -.1, y = .1, label = "b[0]", parse = TRUE) + 
  annotate("segment", x = 0, xend = 1, y = 0, yend = 0, colour = "blue") +
  annotate("segment", x = 1, xend = 1, y = 0, yend = 0.5, colour = "blue") +    
  annotate("text", x = 1.1, y = 0.25, label = "b[1]", parse = TRUE, colour = "blue")

``` 

Changes to the intercept indicate that the height, sometimes called
the level, of the line is shifted. Positive values of the intercept,
that is $b_0 > 0$, will shift the line up. For example as shown in the
following figure. Negative fvalues of the intercept, that is 
$b_0 < 0$, will shift the line down. Just because the intercept
changes does not mean the slope has to change.

```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Example image of a simple, linear regression model with a different intercept value."}

ggplot() +
  geom_hline(yintercept = 0, linetype = 3) +
  geom_vline(xintercept = 0, linetype = 3) +   
  geom_abline(intercept = 1, slope = 0.5) +
  scale_x_continuous("x (predictor/explanatory variable)",
                     breaks = c(-3, -2, -1, 0, 1, 2, 3)) +
  scale_y_continuous("y (outcome variable)",
                     breaks = c(-1.5, -1, -0.5, 0, 0.5, 1, 1.5)) +  
  coord_cartesian(xlim = c(-3, 3),
                  ylim = c(-1.5, 1.5)) +
  theme_pubr() +
  annotate("point", x = 0, y = 1, colour = "black") +  
  annotate("text", x = -.1, y = 1.1, label = "b[0]", parse = TRUE) + 
  annotate("segment", x = 0, xend = 1, y = 1, yend = 1, colour = "blue") +
  annotate("segment", x = 1, xend = 1, y = 1, yend = 1.5, colour = "blue") +    
  annotate("text", x = 1.1, y = 1.25, label = "b[1]", parse = TRUE, colour = "blue")

``` 

Changes to the slope indicate that the steepness of the line is
shifted. Positive values indicate a more positive (steeper) slope,
such that as $x$ increases, $y$ also is expected to increase. Negative
values indicate a more negative (also maybe steeper declining) slope,
such that as $x$ increases, $y$ is expected to decrease.

```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Example image of a simple, linear regression model with a different slope."}

ggplot() +
  geom_hline(yintercept = 0, linetype = 3) +
  geom_vline(xintercept = 0, linetype = 3) +   
  geom_abline(intercept = 0, slope = -0.5) +
  scale_x_continuous("x (predictor/explanatory variable)",
                     breaks = c(-3, -2, -1, 0, 1, 2, 3)) +
  scale_y_continuous("y (outcome variable)",
                     breaks = c(-1.5, -1, -0.5, 0, 0.5, 1, 1.5)) +  
  coord_cartesian(xlim = c(-3, 3),
                  ylim = c(-1.5, 1.5)) +
  theme_pubr() +
  annotate("point", x = 0, y = 0, colour = "black") +  
  annotate("text", x = -.1, y = .1, label = "b[0]", parse = TRUE) + 
  annotate("segment", x = 0, xend = 1, y = 0, yend = 0, colour = "blue") +
  annotate("segment", x = 1, xend = 1, y = 0, yend = -0.5, colour = "blue") +    
  annotate("text", x = 1.1, y = -0.25, label = "b[1]", parse = TRUE, colour = "blue")

``` 

In simple linear regression, we are not creating an arbitrary line, we
are creating the line of best fit. Here, "best fit" means the line
that minimizes the sum of squared residuals.

To better understand residuals, let's create a visual look at the
residuals in a simple linear regression. To keep this example small,
we will use a little dataset built into `R`, the `mtcars` dataset,
which has data on just 32 cars. The following graph shows the scatter
plot of two variables, the weight of cars (`wt`) and their horse power
(`hp`). The solid black line is the regression line of best fit, and
the blue lines between each observed value and the line are the
residuals. The residuals capture the difference between what the model
predicted and what was actually observed. That leads to another way we
can think about and write the equation for residuals: the difference
between the observed and predicted outcome values.

$$ \varepsilon_i = y_i - \hat{y_i} $$

```{r, echo = FALSE}

m <- lm(hp ~ wt, data = mtcars)
mtcars$yhat <- predict(m)

ggplot(mtcars, aes(x = wt, y = hp)) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, colour = "black") +
  geom_point() +  
  geom_segment(aes(xend = wt, yend = yhat), colour = "blue") +
  theme_pubr()

``` 

When we fit a linear regression, equations are solved to give
*estimates* of $b_0$ and $b_1$ that minimize the sum of squared
residuals, and those parameter estimates (or regression coefficients)
produce the line of best fit.

## Muliple Linear Regression

Multiple linear regression works in principle basically the same way
as does simple linear regression. The difference is that more then one
predictor (explanatory) variable is allowed in a single model.
As before we have an outcome, $y$ predicted by the combination of
model parameters, the $b$s but instead of only one predictor, $x$, we
can have an indeterminate number, $k$. It could be two predictors,
three, twenty, the number does not change the underlying model.

$$ y_i = b_0 + b_1 * x_{1i} + ... + b_k * x_{ki} + \varepsilon_i $$

The regression coefficients (or more generally referred to as model
parameters, where in this case the model is a regression), are
interpretted fairly similarly to those in simple linear regression,
but with some extra requirements.

- $b_0$ is the intercept, the expected (model predicted) value of $y$
  when all predictors are 0written, $E(Y | x_1 = 0, ..., x_k = 0)$
- $b_1$ is the slope of the line, and captures how much $y$ is
  expected to change for a one unit change in $x_1$, holding all other
  predictors constant (or said equivalently, "controlling for all
  other predictors").
- $b_k$ is the slope of the line, and captures how much $y$ is
  expected to change for a one unit change in $x_k$, holding all other
  predictors constant (or said equivalently, "controlling for all
  other predictors").
- $\varepsilon$ is the residual/error term, the difference between the
  model predicted value and the observed value.

Visually, it becomes more difficult to visualize, but remains posible
with a 3D graph. The following figure shows the regression surface
predicted by a multiple regression of neuroticism on both stress and
selfesteem. Spend some time moving it around to view it from different
angles and see the unique association between stress and neuroticism
as well as the unique association between selfesteem and neuroticism.
The word "unique" here is important as it conveys that this is not
two simple linear regressions--in a multiple regression, the
regression coefficients give the unique, or independent, association
of each predictor with the outcome independent, or controlling for,
the other predictors.

```{r, echo = FALSE}

## this package only used for demo purposes
library(plotly)

m <- lm(neuroticism ~ stress + selfesteem, data = db)
stress <- seq(from = min(db$stress, na.rm=TRUE), to = max(db$stress, na.rm=TRUE),
                length.out = 100)
selfesteem <- seq(from = min(db$selfesteem, na.rm=TRUE), to = max(db$selfesteem, na.rm=TRUE),
                length.out = 100)
neuroticism <- expand.grid(stress = stress, selfesteem = selfesteem)
neuroticism$neuroticism <- predict(m, newdata = neuroticism)
neuroticism <- as.matrix(
  reshape(neuroticism, v.names = "neuroticism", timevar = "selfesteem",
          idvar = "stress", direction = "wide")[, -1])

plot_ly(x = ~ stress, y = ~ selfesteem, z = ~ neuroticism) %>% add_surface()

``` 

# Generalized Linear Models

The linear regression models we saw fit more broadly into what are
often called "linear models", which is a class of models that are all
fundamentally equivalent in that they are linear models and include
t-tests, analysis of variance, pearson correlations, and linear
regressions as specific types of linear models.

All these cases share in common that a linear association between
predictors or explanatory variables and the outcome is assumed, and
the outcome must be continuous and is assumed to be normally
distributed.

The linear model is itself a *special case* of the 
**generalized linear model** or **GLM**. In particular, the GLM 
extends the linear model to different outcomes. A few examples are:

- continuous, normally distributed variables (linear regression)
- binary 0/1 variables (logistic regression, probit regression)
- count variables (poisson regression, negative binomial regression)

This is beneficial because it is fairly common in the real world to
cross a variable that is not normally distributed and yet may be
important to study. In this content, we are going to focus on the GLM
for normal outcomes, i.e., linear regression. However, we will move
towards a GLM framework as seeing and understanding linear regression
through the GLM framework will allow us to use other types of GLMs for
non-normal data more easily.


## Notation and some Theory

We already learned about linear regression

$$ y_i = b_0 + b_1 * x_{1i} + ... + b_k * x_{ki} + \varepsilon_i $$

First, we separate the model into pieces. For simplicity, I am going
to drop the "i" subscript and use a boldface font to indicate when a
variable is a vector that would have different observed values for
each person. The parameters (e.g., $b_0$) are not bolded because they
are not vectors at this point.

$$ \boldsymbol{\eta} = b_0 + b_1 * \boldsymbol{x_1} + ... + b_k * \boldsymbol{x_k} $$

$$ \boldsymbol{y} = \boldsymbol{\eta} + \boldsymbol{\varepsilon} $$

You have probably heard many times that in linear regression, we
assume a *normal* distribution. Before we jump into the assumption for
linear regression, we need to talk a bit more about **probability
distributions**. A probability distribution is a function that
maps a value to a particular probability of occuring. Most
probability distributions we deal with, such as the normal
distribution, is not a single distribution, but a family of
distributions. This is because the specific normal distribution we get
depends on some parameters.

In particular, the normal distribution has two parameters.

- mean ($\mu$), also sometimes called the "location" between it
  controls the location fo the center of the distribution.
- standard deviation ($\sigma$), also sometimes called the "scale"
  because it controls the scale or spread of the distribution.

Formally, th normal distribution can be written: 

$$ \mathcal{N}(\mu, \sigma) $$

When we talk about a normal distribution, we may think of the standard
normal distribution, which has mean zero and unit variance/standard
deviation:

$$ \mathcal{N}(0, 1) $$

However, there are many normal distributions. The following figure
shows a few different normal distributions, some with the same mean
and different standard deviations, some with different means.
What this shows is that to figure out the probability, it is not
enough to specify that you are assuming a normal distribution, you
must also specify which specific normal distribution by specifying its
parameters, the mean and standard deviation.

```{r, echo = FALSE, fig.width = 6, fig.height = 4} 

ggplot(data.frame(x = seq(-4, 4, by = .05)), aes(x)) +  
  stat_function(fun = dnorm, args = list(mean = 0, sd = 2),
                colour = "blue", size = 1) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1),
                colour = "black", size = 1) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 0.5),
                colour = "red", size = 1) +     
  stat_function(fun = dnorm, args = list(mean = 1, sd = 0.5),
                colour = "green", size = 1) +
  theme_pubr() +
  xlab("x") + ylab("Density of the normal distribution")

```

Getting back to linear regression, what does it really mean when we
say that normality is assumed? You may have heard that the outcome has
to be normally distributed, but that is not quite true. In fact, what
we assume is that the outcome is conditionally normal, conditional on
the model. In practice, this tends to written one of two ways, which
are effectively equivalent.

$$ \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\eta}, \sigma_\varepsilon) $$

Which is read, "Y is distributed as a normal distribution with mean
equal to $\boldsymbol{\eta}$ and standard deviation $\sigma$".

$$ \boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma_\varepsilon) $$

The only difference is whether we use the residuals and assume a mean
of 0 (because residuals are always constructed to have mean 0) or the
raw outcome variable, but assume mean equal to $\boldsymbol{\eta}$. In
effect, we are assuming that the outcome is distributed normally
around the regression line, with standard deviation based on the
standard deviation of the residuals.

------------------------------------------

Any GLM has to be a linear model at some level, that is, this is
always true:

$$ \boldsymbol{\eta} = b_0 + b_1 * \boldsymbol{x_1} + ... + b_k * \boldsymbol{x_k} $$

indeed, even t-tests and ANOVAs can be fit into this framework.
The GLM generalizes linear regression by defining a *function* that
links or transforms $\boldsymbol{\eta}$ from a linear space to the
outcome space.

The link function is always called $g()$. Likewise an inverse link
function exists that simple "undoes" whatever the link function
did. This is called $g()^{-1}$.

In normal linear regression, the link function and inverse link
function are the identity function, by default, so *there is no
transformation*. For other GLMs, the link and inverse link functions
do some transformation. 

In terms of formulae, we define the expected value of
$\boldsymbol{y}$, its mean, as $\mu$

- $E(y) = \mu = mean~~of~~y$

Then

- $\eta = g(\mu)$
- $\mu = g^{-1}(\eta)$

For normal linear regression because the link and inverse link
functions are the identity function this works out to be

- $\eta = g(\mu) = \mu$
- $\mu = g^{-1}(\eta) = \eta$

The powerful concept of the GLM is that by using link functions, the
linear model

$$ \boldsymbol{\eta} = b_0 + b_1 * \boldsymbol{x_1} + ... + b_k * \boldsymbol{x_k} $$

always stays the same whether the outcome is continuous and normally
distributed or binary and not normally distributed or count or any
other type of variable and assumed distribution. This means in some
sense, all GLMs are interpretted like linear regression, which
facilitates modelling and interpretation.

Now, let's take a look at linear regression in `R` and work through
examples running and interpretting the results.

# Linear Regression in `R`

Linear regression in `R` is generally conducted using the `lm()`
function, which fits a **l**inear **m**odel. It uses a formula
interface to specify the desired model, with the format:
`outcome ~ predictor`. We can store the results of the linear
regression in an object, `m`, and then call various functions on this
to get further information, for example `summary()` to get a quick
summary or `confint()` to get confidence intervals.

When we used `summary()` we get quite a bit of output. The headings
are:

- Call: this is the code we used to call `R` to produce the model, its
  a handy reminder of the variables and outcome especially if you are
  running and saving many models.
- Residuals: this section provides some descriptive statistics on the
  residuals.
- Coefficients: this is really the main table normally reported for
  regressions. The column "Estimate" give the model parameter
  estimates or regression coefficients (the $b$s). The column "Std. Error" gives
  the standard error (SE) for each coefficient, a measure of
  uncertainty. The column "t value" gives the t-value for each
  parameter, used to calculate the p-values, and defined as
  $\frac{b}{SE}$. Finally the column "Pr(>|t|)" gives the probability
  value, the p-value for each coefficient. The significance codes
  below are used to indicate what the asterisks mean, but really you
  can just interpret the exact p-values yourself.
- Next is the residual standard error,
  $\sigma_\varepsilon$ and the residual degrees of freedom (used with
  the t-value to calculate p-values).
- Next are the overall model $R^2$, the proportion
  of variance in the outcome explained by the model in this specific
  sample data and the adjusted $R^2$, a better estimate of the true
  population $R^2$.
- Finally the F-statistic and degrees of freedom and p-value are
  provided. The F-test is an overall test of whether the model is
  statistically significant overall or not. It will test all the
  predictor(s) simultaneously. Because we only have one predictor
  right now, the p-value for the F-test is identical to the p-value
  for the t-test of the `stress` coefficient. In multiple regression
  models, they will not be identical.
  
  
Using `confint()` gives us 95% confidence intervals for the parameter
estimates, the regression coefficients.

```{r}

m <- lm(neuroticism ~ stress, data = db)

summary(m)

confint(m)

```

`r margin_note("If you look at this in the Rmarkdown file, its quite complex as I am having R input the relevant numbers directly into the text. You don't need to do this. I hate copying and pasting and want to make sure I don't make errors that may confuse interpretation.")`

We could interpret the results of the regression as follows.

There 
`r ifelse(coef(summary(m))["stress", "Pr(>|t|)"] < .05, "was", "was not")`
a stastistically significant association between stress and
neuroticism. A one unit higher stress score was associated with a 
`r sprintf("%0.2f", round(coef(m)[["stress"]], 2))` 
[95% CI `r sprintf("%0.2f - %0.2f", round(confint(m)["stress", 1], 2), round(confint(m)["stress", 2], 2))`]
higher neuroticism score, 
`r formatPval(coef(summary(m))["stress", "Pr(>|t|)"], includeP = TRUE)`.
Stress explained 
`r sprintf("%0.1f", summary(m)$adj.r.squared * 100)`% of the variance
in neuroticism.

We can also use the `visreg` package to help us visualize the results
from the regression to better understand what it means. The defaults
work fairly well in this case, giving us the regression line (in blue)
and the shaded grey region shows the 95% confidence interval around
the regression line. Partial residuals are plotted as well by
default. We use the argument, `gg = TRUE` to ask `visreg()` to make it
a ggplot2 graph which lets us do things like customize the theme and
work with the usual ggplot2 graphing framework we are accustomed to
now.

```{r, fig.width = 6, fig.height = 4, fig.cap = "simple regression plot from visreg"}

visreg(m, xvar = "stress", gg = TRUE) +
  theme_pubr()

```

If you want just the regression line, you can set `parital = FALSE`
and `rug = FALSE` to stop plotting partial residuals. As with other
graphs in ggplot2, we can use `+` to add additional elements, beyond
the theme, like controlling the title.

```{r, fig.width = 6, fig.height = 4, fig.cap = "simple regression plot from visreg with customization"}

visreg(m, xvar = "stress", partial = FALSE, rug = FALSE, gg = TRUE) +
  theme_pubr() +
  ggtitle("Linear regression of neuroticism on stress",
          subtitle = "Shaded region shows 95% confidence interval.")

```

You can use the `annotate()` function to add annotations to the graph,
such as a label indicating the coefficient and p-value. Figuring out
the best position and size to be attractive is typically a process of
some trial and error in any graph. Play around the with x and y
coordinates and the size until you are happy with it.

```{r, fig.width = 6, fig.height = 4, fig.cap = "simple regression plot from visreg with even more customization"}

visreg(m, xvar = "stress", partial = FALSE, rug = FALSE, gg = TRUE) +
  theme_pubr() +
  annotate("text", x = 22, y = 8, label = "b = 0.14, p = .003", size = 5) + 
  ggtitle("Linear regression of neuroticism on stress",
          subtitle = "Shaded region shows 95% confidence interval.")

```

## Categorical Predictors in Linear Regression

`r margin_note("You might find this page helpful for further explanation of dummy coding: https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqwhat-is-dummy-coding/")`

In addition to continuous variables, categorical variables can be
included as predictors / explanatory variables in linear
regression. The most common approach to including a categorical
predictor is to use dummy coding. The basic idea behind dummy coding
is to turn a categorical predictor with $k$ levels into $k$ separate
dummy code variables, where each dummy coded variable is coded as 1
for a particular level of the variable and 0 otherwise. The simplest
case of this is a binary variable, a categorical variable with only
two levels (e.g., younger adults, older adults) where you might create
a new dummy coded variable where 1 = older adults and 0 = younger
adults. This dummy coded variable is now a numeric predictor (it has
0s and 1s) and can be included in regression as usual.

The only real difference between interpretting a dummy coded variable
from a regular continuous predictor is that: (1) the 0 point is not
arbitrary, it represents one particular group (e.g., younger adults)
and (2) a one unit change in the dummy coded variable is not
arbitrary, it represents the difference between groups (because 0 to 1
**is** a one unit change).

`R` automatically dummy codes any factor variable you enter into a
linear regression as a predictor, so you do not have to do it
manually. If the variable is not a factor, you may want to convert it
to a factor first using `factor()`.

Let's take a look at a simple linear regression with a categorical
predictor, `AgeCat`. We will store these results in `m2`, for
model 2. We can also graph the results. 

```{r, fig.width = 6, fig.height = 4, fig.cap = "simple regression plot from visreg with categorical predictor"}

m2 <- lm(neuroticism ~ AgeCat, data = db)

summary(m2)

visreg(m2, xvar = "AgeCat", partial = FALSE, rug = FALSE, gg = TRUE) +
  theme_pubr() +
  ggtitle("A categorical predictor")

```
In this simple example, since there are no other predictors, the
intercept, the expected value of neuroticism when the predictors are 0
is the mean neuroticism level in the younger adults, and the
regression coefficient for `AgeCat` is the **difference** in
neuroticism on average between younger and older adults, which we can
see 
`r ifelse(coef(summary(m2))["AgeCat>= 22y", "Pr(>|t|)"] < .05, "is", "is not")`
statistically significant. 

Incidentally, the results from this linear regression with dummy
coding will exactly match what we would get from a t-test (compare the
mean in the younger group with the intercept, and the mean difference
as well as the p-value for `AgeCat`).

```{r}

t.test(neuroticism ~ AgeCat, data = db, var.equal = TRUE)

```

It is also the same as we would get from an ANOVA (compare the
F-values and the p-values).

```{r}

summary(aov(neuroticism ~ AgeCat, data = db))

```

## Linear Regression Diagnostics in `R`

Linear regressions have a few assumptions, they assume that:

- $$ \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\eta}, \sigma_\varepsilon) $$ (normality)
- observations are independent of each other (independence)
- the residual variance, $\sigma_\varepsilon$, is homogenous
  (homogeneity of variance).
  
The normality assumption is typically tested by comparing the
residuals from the model against a normal distribution.
The independence assumption cannot readily be tested, but we normally
assume it is true if participants are independent of each other (e.g.,
are not siblings, we don't have repeated measures on the same person,
etc.).
The homogeneity of variance assumption is typically assessed by
plotting the residuals against predicted values. The expectation is
that across the range of predicted values, the spread (variance) of
the residuals is about equal and that there should not be any
systematic trends in the residuals.

Although not an assumption, we often want to identify outliers or
extreme values that may unduly influence our findings. This is often
done by examining the residuals from the model.

Most of these diagnostics and assumptions can be checked
graphically. We use the `modelDiagnostics()` function to calculate
diagnostics and the `plot()` function to create a visual display.
This should be very familiar from the Data Visualization 1 topic where
we used `testDistribution()` and plot to test a specific variable. It
is a similar idea but now operating on the residuals.

```{r, fig.width = 9, fig.height = 5, fig.fullwidth = TRUE, fig.cap = "Regression model diagnostics. Left panel shows a plot of the residuals to assess normality and they will be highlighted black if an extreme value. The right panel shows a plot of the predicted versus residual values to help assess the homogeneity of variance assumption."}

m <- lm(neuroticism ~ stress, data = db)

## assess model diagnostics
md <- modelDiagnostics(m, ev.perc = .005)
plot(md, ncol = 2, ask = FALSE)

```

The **left panel** shows a density plot of the residuals (solid black
line) and for reference the density of what a perfectly normal
distribution would look like with the same mean and residual variance
(dashed blue line). The rug plot under the density plot (small,
vertical lines) shows where raw data points fall. The x axis labels
(for the Residuals) shows the minimum, 25th percentile, median, 75th
percentile, and maximum, providing a quick quantitative summary. The
bottom of the left panel shows a QQ deviates plot, this is the same as
a regular QQ plot but it has been rotated so that the line is
horizontal instead of diagonal to take up less space. If the points
fall exactly on the line, that indicates they fall exactly where they
would be expected under a normal distribution. Both the rug plot and
the dots plotted in the QQ deviates plot are light grey if they are
not extreme and are colored solid black if they meet the criteria for
an extreme value, as specified in the call to `modelDiagnostics()`. 

The **right panel** shows a scatter plot of the model predicted values
against the residuals. This can help asses the homogeneity of variance
assumption. If the residual variance is homogenous, we would expect
the spread of the residuals (the y axis) to be about the same at all
levels of the predicted values. Sometimes this is very easy to see
visually. Sometimes it is not so easy to see. To aid in the visual
interpretation, quantile regression lines are added. Quantile
regression lines allow a regression model to predict the 10th and 90th
percentile. These predicts are added as dashed blue lines. Under the
assumption of homogenous variance, one would expect these lines to be
approximately horizontal and parallel to each other. If they are not
parallel, that indicates that the variance is larger / smaller for
some predicted values than for others, known as heterogenous
variance. The solid blue line in the middle is a loess smooth line,
which also would ideally be about flat and about zero, to indicate
that there is no systematic bias in the residuals. Systematic bias can
occur, for example when for low predicted values, the residuals are
always positive and for high predicted values the residuals are always
negative and indicate, typically, either ceiling / floor effects in
the data and/or issues that would ideally be addressed by revising the
model. As with the density plot, the axes in the scatter plot show the
minimum, 25th percentile, median, 75th percentile, and maximum,
providing a nice quantitative summary of the residuals and predicted
values. 

These graph looks fairly good. They are not perfect, but
real data never are and the degree of variation in the spread of
residuals is not extreme. We might be reasonably satisfied that the
assumption or normality and homogeneity of variance are met or at
least not terribly violated. There also do not appear to be extreme
values in the residuals.

For the sake of example, let's look at another outcome measure. 
`Duration__in_seconds_` which is how many seconds it took for each
survey to be completed. Again we will use `stress` as a predictor.

```{r, fig.width = 9, fig.height = 5, fig.fullwidth = TRUE, fig.cap = "Regression model diagnostics. Left panel shows a plot of the residuals to assess normality and they will be highlighted black if an extreme value. The right panel shows a plot of the predicted versus residual values to help assess the homogeneity of variance assumption."}

malt <- lm(Duration__in_seconds_ ~ stress, data = db)

## assess model diagnostics
maltd <- modelDiagnostics(malt, ev.perc = .005)
plot(maltd, ncol = 2, ask = FALSE)

```

These graphs show clear problems in the model. The residuals are not
at all normally distributed. There are two extreme values, based on
the current criteria. The homogeneity of variance plot does not appear
as bad, but the extreme values on the residuals are apparent. At a
minimum, we would want to remove these extreme values. However, the
marked non-normality also is a concern. We could try a log
transformation of the outcome.

```{r, fig.width = 9, fig.height = 5, fig.fullwidth = TRUE, fig.cap = "Regression model diagnostics after log transforming the outcome."}

malt2 <- lm(log(Duration__in_seconds_) ~ stress, data = db)

## assess model diagnostics
malt2d <- modelDiagnostics(malt2, ev.perc = .005)
plot(malt2d, ncol = 2, ask = FALSE)

```

Even after the log transformation, the extreme values remain. Let's
remove those. Extreme values are stored and accessible within the
diagnostics object output, `malt2d`. It shows the scores on the
outcome, the index (the row in the data the extreme values come from)
and the effect type where they are an extreme value, in this case,
extreme on the residuals. We use the `Index` values to remove them
from the dataset and then refit our model. We can use `-` in front
indices to tell data table that instead of selecting those rows, we
want to **exclude** those rows. 

```{r, fig.width = 9, fig.height = 5, fig.fullwidth = TRUE, fig.cap = "Regression model diagnostics after log transforming the outcome and excluding extreme values."}

malt2d$extremeValues

malt3 <- lm(log(Duration__in_seconds_) ~ stress,
            data = db[-malt2d$extremeValues$Index])

## assess model diagnostics
malt3d <- modelDiagnostics(malt3, ev.perc = .005)
plot(malt3d, ncol = 2, ask = FALSE)

```

`r margin_note("If you would like some more examples or additional explanation, have a look at: http://joshuawiley.com/JWileymisc/articles/diagnostics-vignette.html")`

After excluding the initial extreme values, one new extreme value
emerges. However, it is relatively less extreme and overall the model
diagnostics look much better now.


## Multiple Linear Regression in `R`

You can conduct multiple linear regressions in `R` almost the same way
as you would a simple linear regression. Additional predictors can be
added by simply using `+` between them. The following code shows an
example adding both a continuous and categorical predictor
simultaneously. Next, we run diagnostics, and only after diagnostics
take a look at the summary. We leave the `summary()` until after
diagnostics on purpose as there is not much point interpretting the
coefficients if the model assumptions are badly violated. We would
want to first fix the model and then interpret.

```{r, fig.width = 9, fig.height = 5, fig.fullwidth = TRUE, fig.cap = "Multiple regression model diagnostics."}

m3 <- lm(neuroticism ~ stress + AgeCat, data = db)

m3d <- modelDiagnostics(m3, ev.perc = .005)
plot(m3d, ncol = 2, ask = FALSE)

summary(m3)

```

We can also get a visual summary of the model using `visreg()`. We
separate the plot by `AgeCat` since it is a categorical predictor it
is easy to make separate lines for each age group. So that the lines
are plotted in the same graph and not separate graphs, we specify 
`overlay = TRUE`.

```{r, fig.width = 7, fig.height = 5, fig.cap = "multiple regression plot from visreg"}

visreg(m3, xvar = "stress", by = "AgeCat", partial = FALSE, rug = FALSE,
       overlay = TRUE, gg = TRUE) +
  theme_pubr() +
  ggtitle("A multiple regression - separate lines for each age group")

```

In this case, `AgeCat` has such a minor effect on the outcome that the
two lines are virtually on top of each other. If `AgeCat` had a larger
effect, they would be more separated.


We could interpret the results of the regression as follows.

There 
`r ifelse(coef(summary(m3))["stress", "Pr(>|t|)"] < .05, "was", "was not")`
a stastistically significant association between stress and
neuroticism, controlling for age group. A one unit higher stress score was associated with a 
`r sprintf("%0.2f", round(coef(m3)[["stress"]], 2))` 
[95% CI `r sprintf("%0.2f - %0.2f", round(confint(m3)["stress", 1], 2), round(confint(m3)["stress", 2], 2))`]
higher neuroticism score, 
`r formatPval(coef(summary(m3))["stress", "Pr(>|t|)"], includeP = TRUE)`.

There 
`r ifelse(coef(summary(m3))["AgeCat>= 22y", "Pr(>|t|)"] < .05, "was", "was not")`
a stastistically significant association between age group and
neuroticism, controlling for stress. Compared to younger adults, older
adults had 
`r sprintf("%0.2f", round(coef(m3)[["AgeCat>= 22y"]], 2))` 
[95% CI `r sprintf("%0.2f - %0.2f", round(confint(m3)["AgeCat>= 22y", 1], 2), round(confint(m3)["AgeCat>= 22y", 2], 2))`]
higher neuroticism scores, 
`r formatPval(coef(summary(m3))["AgeCat>= 22y", "Pr(>|t|)"], d = 2, includeP = TRUE)`.
Overall, stress and age group explained 
`r sprintf("%0.1f", summary(m3)$adj.r.squared * 100)`% of the variance
in neuroticism.

When running a simple linear regression, the overall model variance
explained, $R^2$, serves as an effect size estimate. However, in
multiple regression, that only serves as an effect size for the
overall model. In this case, even though we can tell that clearly most
the effect is attributable to stress, not age group, we do not have a
specific quantitative effect size estimate for each predictor.

We can get a nice set of tests for a model using the `modelTest()`
function and have a nicely formatted set of output using the
`APAStyler()` function. We wrap it in `knitr::kable()` to get nice
output in our HTML file.

```{r, results = "asis"}

m3test <- modelTest(m3)
knitr::kable(APAStyler(m3test))

``` 

The table has several parts. The "Term" column tells us what we are looking at,
the "Est" column gives us the values, and the "Type" column is the
broad category. First up, we have the fixed effects, they are called
fixed effects because they do not vary. These are our model
parameters, the regression coefficients. What is reported is the
estimates, asterisks to indicate p-values, and the 95% confidence
interals in brackets. Rounding to two decimal points is the default
consistent with standard APA style.


`r margin_note("According to Cohen's (1988) guidelines, f2 >= 0.02, f2
   >= 0.15, and f2 >= 0.35 represent small, medium, and large effect
   sizes, respectively.")` 
   
The next section presents information about the overall model, like
how many observations were included, the log likelihood (logLik), the
degrees of freedom for the overall log likelihood, information
criterion (AIC, BIC), which we will learn about later for model
comparisons, and several overall model effect sizes: cohen's $f^2$
effect size, the $R^2$ and adjusted $R^2$. Cohen's $f^2$ effect size
for the overall model is defined as 

$$ \frac{R^{2}}{1 - R^{2}} $$

it is based on the model $R^2$ but divided by the variance not
explained, $1 - R^{2}$.

Finally at the bottom, are cohen's $f^2$ effect sizes for each
individual predictor along with p-values for each predictor
overall. The Cohen's $f^2$ effect size for individual coefficients is
defined as:

$$ \frac{R_{AB}^{2} - R_{A}^{2}}{1 - R_{AB}^{2}} $$

`r margin_note("If you would like some more examples or additional
   explanation, have a look at:
   http://joshuawiley.com/JWileymisc/articles/model-test-vignette.html")` 
   
where what is compared is the the variance explained by the overall
model minus the variance explained by a model excluding that specific
predictor. In essence, this provides an estimate of the unique
variance in the outcome explained by each predictor, relative to the
total variance not explained. Cohen advocated this approach because,
for example, if a model already explains 90% of the variance in an
outcome, it is very difficult to explain an additional 9% (that would
bring the model to nearly 100% variance explained). In contrast, if a
model explains 0% of the variance, adding one more predictor may have
an easier time explaining 9% of the variance. Thus Cohen's $f^2$
effect size accounts for this. The square root of this often is used
in power analyses for regression as well (e.g., in G*Power).

## Moderation in `R` 

Sometimes the association between two variables depends on a third
variable. This is called moderation. The way that we formally test for
moderation in regression is by adding an interaction term. An
interaction term is the product of two (or more) variables. In an
equation:

$$ \boldsymbol{y} = b_0 + b_1 * \boldsymbol{x} + b_2 * \boldsymbol{w} + b_3 * \boldsymbol{x} * \boldsymbol{w} + \boldsymbol{\varepsilon} $$ 

The interaction term gets an additional regression coefficient
(another model parameter), $b_3$ in this example, that captures how
different the association of each individual variable ($b_1$ or $b_2$)
with the outcome is depending on the level of the other variable.

`r margin_note("As we are focused on moderation here, we are not
   evaluating model diagnostics. In research or applied settings, you
   would definitely want to conduct diagnostics before being confident
   the model results are trustworthy.")` 

It is very easy to including interactions in `R`. Inside of a
regression formula, we use the `*` operator, which expands to mean,
give me the main effects and the interactions between these
variables. Here is an example evaluating an interaction between stress
and age group on neuroticism. Note that when there are interactions,
`modelTest()` does not deal well with categorical predictors that are
automatically dummy coded. The solution is to manually dummy code.
The following code estimates the model and makes a nice summary with
effect sizes.

```{r, results = "asis"}

db[, AgeDummy := ifelse(AgeCat == ">= 22y", 1L, 0L)]

mint <- lm(neuroticism ~ stress * AgeDummy, data = db)
knitr::kable(APAStyler(modelTest(mint)))

``` 

We can also use `visreg()` to make a nice interaction plot.

```{r, fig.width = 7, fig.height = 5, fig.cap = "moderated regression plot from visreg"}

visreg(mint, xvar = "stress", by = "AgeDummy", partial = FALSE, rug = FALSE,
       overlay = TRUE, gg = TRUE) +
  theme_pubr() +
  ggtitle("A moderated multiple regression")

```

Now that we have the output and the interaction plot, let's interpret
what we found. In the presence of an interaction, the variables
involved in the interaction on their own become simple effects.
There was not a statistically significant interaction between stress
and age group (b = -0.16, p = .099, $f^2 = 0.05$). It is likely that if
this interaction was not central in practice you would drop the
interaction, re-run the model and interpret the results. However, to
show what could be interpretted if there was a significant
interaction, we will proceed.
There was a significant simple effect of stress
on neuroticism, such that for younger adults (i.e., when AgeDummy =
0), each one unit higher stress was associated iwth 0.25 higher
neuroticism [95% CI .09 - .40], p = .002 and this was a moderate
effect, $f^2 = 0.20$. There was no simple effect
of age group on neuroticism when stress was held at 0, (b = 4.38, [95%
CI -1.00 - 9.77], p = .109) and the effect size was small, $f^2 =
.05$. Although the interaction was not significant, a graph of the
interaction showed that the pattern was that the association between
stress and neuroticism was stronger in younger adults than in older
adults. The overall model explained 20% of the variance in neuroticism
in the sample.


It also is possible to have an interaction between two continuous
variables. The `R` code is virtually identical, but we can no longer
talk about "groups". We will use an almost identical model, but with
continuous age in years.

```{r, results = "asis"}

mint2 <- lm(neuroticism ~ stress * age, data = db)
knitr::kable(APAStyler(modelTest(mint2)))

``` 

To better understand what the interaction of two continuous variables
means, we can look at it in 3D. What we can see in the following
figure is that there is a continuous "bending" in the association
between either individual variable and neuroticism depending on the
level of the other. What used to be like a flat plan, is now "warped".

```{r, echo = FALSE}

## this package only used for demo purposes
library(plotly)

stress <- seq(from = min(db$stress, na.rm=TRUE), to = max(db$stress, na.rm=TRUE),
                length.out = 100)
age <- seq(from = min(db$age, na.rm=TRUE), to = max(db$age, na.rm=TRUE),
                length.out = 100)
neuroticism <- expand.grid(stress = stress, age = age)
neuroticism$neuroticism <- predict(mint2, newdata = neuroticism)
neuroticism <- as.matrix(
  reshape(neuroticism, v.names = "neuroticism", timevar = "age",
          idvar = "stress", direction = "wide")[, -1])

plot_ly(x = ~ stress, y = ~ age, z = ~ neuroticism) %>% add_surface()

``` 

Although it is difficult to show people the results in 3D most the
time, we can plot the results with `visreg()`. Although age is
continuous a common way to give a sense of the interaction is to
select a few specific values to plot lines at.

```{r, fig.width = 7, fig.height = 5, fig.cap = "moderated regression plot from visreg"}

visreg(mint2, xvar = "stress", by = "age", partial = FALSE, rug = FALSE,
       overlay = TRUE, gg = TRUE) +
  theme_pubr() +
  ggtitle("A moderated multiple regression of two continuous predictors")

```

In the previous figure, age was not converted into groups, rather,
single slices of the 3D plot were taken at specific age values and
these were plotted to give a sense of the association between stress
and neuroticism at different ages. Three lines with confidence
intervals in one figure is difficult to read, so we could also use the
`breaks` argument to specify specific ages we want it to plot the
interaction for. This makes it cleaner and easier to read and is shown
in the following plot.

```{r, fig.width = 7, fig.height = 5, fig.cap = "moderated regression plot from visreg at user selected values of age"}

visreg(mint2, xvar = "stress", by = "age", breaks = c(21, 25),
       partial = FALSE, rug = FALSE,
       overlay = TRUE, gg = TRUE) +
  theme_pubr() +
  ggtitle("A moderated multiple regression of two continuous predictors")

```

Overall we can interpret that there is a statistically significnat
interaction between age and stress on neuroticism (p = .013). The
pattern of the interaction is that the association between stress and
age is more negative as age increases. Indeed there appears to be a
cross over such that at younger ages there is a positive association
between stress and neuroticism whereas the reverse is true at older
ages.

Note that with continuous predictors, unless they were mean centered,
the defult simple effects are not very useful. For example we would
say that the simple effects show that each one unit higher stress is
associated with 2.15 higher neuroticism, when age = 0. Although a
technically accurate interpretation, we do not have any 0 year olds in
this dataset and so we probably do not want to extrapolate to
them. This is because the default simple effects from 
regression models when there are interactions are holding the other
variable at 0. If we mean center variables prior to testing
interactions, then 0 is the mean, which gives more meaningful simple
effect tests.


# Summary Table

Here is a little summary of some of the functions used in this
topic. 

| Function       | What it does                                 |
|----------------|----------------------------------------------|
| `lm()`  | Fits a linear regression model | 
| `modelDiagnostics()` | allows testing residuals for normality and for homogeneity of variance in models | 
| `modelTest()` | tests results of model and includes effect sizes for each predictor | 
| `visreg()` | can be used to make plots of regression models, with or without interactions |