-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathVB_RegDiagTrans_practical.qmd
149 lines (99 loc) · 4.32 KB
/
VB_RegDiagTrans_practical.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
title: "VectorByte Methods Training: Regression Diagnostics and Transformations (practical)"
author:
- name: Leah R. Johnson
url: https://lrjohnson0.github.io/QEDLab/leahJ.html
affiliation: Virginia Tech and VectorByte
citation: true
date: 2024-07-01
date-format: long
format:
html:
toc: true
toc-location: left
html-math-method: katex
css: styles.css
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
set.seed(123)
```
<br>
# Overview and Instructions
The goals of this practical are to:
1. Practice building residual diagnostic plots for determining violations of SLR assumptions.
2. Practice matching violations with remedies/transformations evaluating resulting residuals for models fit to transformed data.
<br>
# Practicing diagnostics and transformations
The file **transforms.csv** on the course website contains 4 pairs of $X$s and $Y$s. The ${\sf R}$ code from lecture 5B will also be very helpful.
***For each pair:***
1. Fit the linear regression model $Y = \beta_0 + \beta_1 X + \varepsilon$, $\varepsilon \sim \mathrm{N}(0,\sigma^2)$. Plot the data and fitted line.
2. Provide a scatterplot, normal Q-Q plot, and histogram for the studentized regression residuals.
3. Using the residual scatterplots, state how the SLR model assumptions are violated.
4. Determine the data transformation to correct the problems in 3, fit the corresponding regression model, and plot the transformed data with new fitted line.
5. Provide plots to show that your transformations have (mostly) fixed the model violations.
<br> <br>
# Example: Data set 1
Here we take you through an example of analyzing the first of the 4 datasets. You will then use this to practice for the other three.
First we'll read all of the data in here. You will likely need to change the path to correspond to where your data are stored.
```{r}
attach(D <- read.csv("data/transforms.csv"))
```
<br>
## 1. Fit the linear regression model. Plot the data and fitted line.
```{r, fig.align='center', fig.height=4, fig.width=5}
## fit models
lm1 <- lm(Y1 ~ X1)
## plot points and fitted lines
plot(X1, Y1, col=1, main="I"); abline(lm1, col=2)
```
<br>
## 2. Provide a scatterplot, normal Q-Q plot, and histogram for the studentized regression residuals.
```{r, fig.align='center', fig.height=3, fig.width=8}
par(mfrow=c(1,3), mar=c(4,4,2,0.5))
## studentized residuals vs fitted
plot(lm1$fitted, rstudent(lm1), col=1,
xlab="Fitted Values",
ylab="Studentized Residuals",
pch=20, main="I")
## qq plot of studentized residuals
qqnorm(rstudent(lm1), pch=20, col=1, main="" )
abline(a=0,b=1,lty=2, col=2)
## histogram of studentized residuals
hist(rstudent(lm1), col=1,
xlab="Studentized Residuals",
main="", border=8)
```
<br>
## 3. Using the residual scatterplots, state how the SLR model assumptions are violated.
$X$s are clumpy AND the variance seems non-constant. It looks a lot like the GDP data from class. Since both $X$s and $Y$s are strictly positive, we can try a log-log transform.
<br>
## 4. Determine the data transformation to correct the problems in 3, fit the corresponding regression model, and plot the transformed data with new fitted line.
```{r, fig.align='center',fig.height=4, fig.width=5}
### the fix is as follows:
logX1<- log(X1)
logY1 <- log(Y1)
### re-run the regressions and residual plots to show this worked
lm1 <- lm(logY1 ~ logX1)
## plot points and lines
plot(logX1, logY1, col=1, main="I"); abline(lm1, col=2)
```
## 5. Provide plots to show that your transformations have (mostly) fixed the model violations.
```{r, fig.align='center', fig.height=3, fig.width=8}
## studentized residuals vs fitted
par(mfrow=c(1,3), mar=c(4,4,2,0.5))
plot(lm1$fitted, rstudent(lm1), col=1,
xlab="Fitted Values",
ylab="Studentized Residuals",
pch=20, main="I")
## Q-Q plots
qqnorm(rstudent(lm1), pch=20, col=1, main="" )
abline(a=0,b=1,lty=2, col=2)
## histograms of studentized residuals
hist(rstudent(lm1), col=1,
xlab="Studentized Residuals",
main="", border=8)
```
This is much better! The histogram still maybe looks a little funny, but given that the qq-plot looks pretty good, I think we've made a good transformation.
## Your Turn!
Repeat this process with the other 3 datasets, and see if you can figure out a appropriate transformations for each dataset.