-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathVB_RegRev.qmd
813 lines (536 loc) · 26.2 KB
/
VB_RegRev.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
---
title: "VectorByte Methods Training: Regression Review"
author:
- name: Leah R. Johnson
url: https://lrjohnson0.github.io/QEDLab/leahJ.html
affiliation: Virginia Tech and VectorByte
citation: true
date: 2024-07-01
date-format: long
format:
html:
toc: true
toc-location: left
html-math-method: katex
css: styles.css
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(cache = FALSE,
echo = FALSE,
message = FALSE,
warning = FALSE,
#fig.height=6,
#fig.width = 1.777777*6,
tidy = FALSE,
comment = NA,
highlight = TRUE,
prompt = FALSE,
crop = TRUE,
comment = "#>",
collapse = TRUE)
library(knitr)
library(kableExtra)
library(xtable)
library(viridis)
options(stringsAsFactors=FALSE)
knit_hooks$set(no.main = function(before, options, envir) {
if (before) par(mar = c(4.1, 4.1, 1.1, 1.1)) # smaller margin on top
})
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_knit$set(width = 60)
source("my_knitter.R")
#library(tidyverse)
#library(reshape2)
#theme_set(theme_light(base_size = 16))
make_latex_decorator <- function(output, otherwise) {
function() {
if (knitr:::is_latex_output()) output else otherwise
}
}
insert_pause <- make_latex_decorator(". . .", "\n")
insert_slide_break <- make_latex_decorator("----", "\n")
insert_inc_bullet <- make_latex_decorator("> *", "*")
insert_html_math <- make_latex_decorator("", "$$")
## classoption: aspectratio=169
```
[Main materials](materials.qmd)
# Learning Objectives
1. Review the idea behind regression
2. Review how to make predictions from a regression model
3. Practice simple regression procedures in `R` including diagnostic plots
# A simple population simulation
```{=tex}
\begin{align*}
N(t) & = s N(t-1) + b +W(t) \\
N_{\mathrm{obs}}(t) & = N(t) + V(t)
\end{align*}
```
Let's specify a few values/distributions:
- $s=0.75$, $b=10$, $N_0=10$, $T_{\mathrm{max}}=100$
- $W(t) \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(0, 5^2)$
- $V(t) \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(0, 3^2)$
Let's simulate from this system to generate some test data.
## MC simulation of simple system
```{r, echo=F}
mod1<-function(N, s=0.8, b=10){
s*N+b
}
### Population parameters
s<-0.75
b<-10
N0<-10
Tmax<-100
### process stochasticity
mu.w<-0
sig.w<-5
### observational stochasticty
mu.v<-0
sig.v<-3
### making the structure to hold the output
N<-N.obs<-rep(NA, length(Tmax))
### I'm setting a random seed so that my results will be reproducable
### for class
set.seed(123)
## initializing the vector
N[1]<-N0
N.obs[1]<-rnorm(1, N[1], sd=sig.v)
for(t in 2:Tmax){
N[t]<-mod1(N[t-1], s=s, b=b) + rnorm(1, mu.w, sd=sig.w)
## I don't allow my population to drop below zero, or to have
## negative observations. This is a modification of the model above.
if(N[t]<0) N[t]<-0
if(N[t]!=0){
N.obs[t]<- N[t] + rnorm(1, mu.v, sd=sig.v)
}else N.obs[t]<-0
if(N.obs[t]<0) N.obs[t]<-0
}
```
```{r, echo=FALSE, fig.align='center', fig.width=6, fig.height=5}
plot(N, ylim=c(0, 65), xlab="time",
ylab="population", type="l")
points(N, pch=20)
points(N.obs, col=2)
```
## Changes in population size
```{r, echo=FALSE, fig.align='center'}
par(mfrow=c(1,2))
plot(N.obs-N, xlab="time", type="l")
hist(N.obs-N, xlab="", freq=FALSE)
## add the line with the true observational distribution
x<-seq(-10, 10, length=100)
lines(x, dnorm(x, mean=mu.v, sd=sig.v), col=2)
```
## Sequential observations
::: columns
::: {.column width="50%"}
```{r, echo=FALSE, fig.align='center', fig.width=6, fig.height=5.5}
par(mfrow=c(1,1))
t<-seq(1, Tmax-1)
plot(N[t], N[t+1], ylim=c(0, 60), xlim=c(0, 60))
```
:::
::: {.column width="50%"}
`r sk1()`
If we had just observed these data, how might we try to estimate parameters?
:::
:::
# Estimating parameters
Our model (assuming no observational stochasticity for now) can be written as
```{=tex}
\begin{align*}
N(t) & = s N(t-1) + b + \varepsilon(t) \\
\varepsilon(t) & \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(0, \sigma^2) \\
\end{align*}
```
We want to know $s$, $b$, and $\sigma$. Notice that without the error, the process just looks like \begin{align*}
N(t) = s N(t-1) + b \rightarrow y = b_1 x + b_0
\end{align*} This is just a line!
------------------------------------------------------------------------
```{r, echo=FALSE, fig.align='center', fig.width=5, fig.height=4}
par(mfrow=c(1,1))
t<-seq(1, Tmax-1)
plot(N[t], N[t+1], ylim=c(0, 60), xlim=c(0, 60))
abline(5, 0.9, col=3, lwd=3)
```
<center>The line shown was fit by the "eyeball" method.</center>
# Review: What is a good line?
`r sk2()`
<center>`r mygrn("Can we do better than the eyeball method?")`</center>
`r sk1()`
We desire a strategy for estimating the slope and intercept parameters in the model $\hat{Y~} = b_0 + b_1 X$.
`r sk1()` This involves
- choosing a `r myred("criteria")`, i.e., quantifying how good a line is
- and matching that with a `r myblue("solution")`, i.e., finding the best line subject to that criteria.
------------------------------------------------------------------------
Although there are lots of ways to choose a `r myred("criteria")`
- only a small handful lead to `r myblue("solutions")` that are "easy" to compute
- and which have nice statistical properties
`r sk1()` Most reasonable `r myred("criteria")` involve measuring the amount by which the `r myblue("fitted value")` obtained from the line differs from the `r myred("observed value")` of the response value(s) in the data.
- This amount is called the `r mygrn("residual")`.
- Good lines produce small residuals.
------------------------------------------------------------------------
<center>{height="3in"}</center>
The dots are the `r myred("observed values")` and the line represents our `r myblue("fitted values")` given by $$
\hat{Y~}_i = b_0 + b_1 X_i.
$$
------------------------------------------------------------------------
<center>{height="3in"}</center>
The `r mygrn("residual")` $e_i$ is the discrepancy between the `r myblue("fitted")` $\hat{Y~}_i$ and `r myred("observed")` $Y_i$ values.
- Note that we can write $Y_i = \hat{Y~}_i + (Y_i -\hat{Y~}_i) = \hat{Y~}_i + e_i$.
# Least Squares
A reasonable goal is to minimize the size of *all* residuals:
- If they were all zero we would have a perfect line.
- We trade-off between moving closer to some points and at the same time moving away from other points.
Since some residuals are positive and some are negative, we need one more ingredient.
- $|e_i|$ treats positives and negatives equally.
- So does $e_i^2$, which is easier to work with mathematically.
------------------------------------------------------------------------
***`r myblue("Least squares")`*** chooses $b_0$ and $b_1$ to minimize $$
SSE = \sum_{i=1}^n e^2_i = \sum_{i=1}^n (Y_i - \hat{Y~}_i)^2= \sum_{i=1}^n (Y_i - [b_0 + b_1 X_i])^2.
$$
<center>{height="3in"}</center>
------------------------------------------------------------------------
**`r mygrn("Optional Practice Exercise")`**
`r sk1()`
Show that the least squares estimates for $b_0$ and $b_1$ are: $$
b_0 = \bar{Y} - b_1 \bar{X}~~~\mathrm{and}~~~b_1 = \frac{ \sum_{i=1}(X_i Y_i) - n\bar{Y}\bar{X}}{\sum_{i=1}X_i^2 - n\bar{X}^2}
$$ where $\bar{X}$ denotes the arithmetic mean of $X$.
`r sk1()`
Hint: You will take (partial) derivatives of $SSE$ w.r.t.\~$b_0$ and $b_1$, set these equal to 0 and solve.
# Estimating parameters
We can use these formulas to estimate our parameters and draw a new line $\Rightarrow$ `r mygrn("make a prediction")`
::: columns
::: {.column width="50%"}
```{=tex}
\begin{align*}
\hat{b~} & = b_0 = 14.55 \\
\hat{s~} & = b_1 = 0.63
\end{align*}
```
- The least squares line is different than our eyeballed line;
- and we know its the `r myred("best line")` in a certain sense.
:::
::: {.column width="50%"}
```{r}
n<-length(N[t])
b1<-(sum(N[t]*N[t+1]) - n*mean(N[t])*mean(N[t+1]))/(sum(N[t]^2)-n*mean(N[t])^2)
b0<-mean(N[t+1])- b1*mean(N[t])
```
```{r, echo=FALSE, fig.align='center', fig.width=5, fig.height=5}
par(mfrow=c(1,1))
t<-seq(1, Tmax-1)
plot(N[t], N[t+1], ylim=c(0, 60), xlim=c(0, 60))
abline(5, 0.9, col=3, lwd=3)
abline(b0, b1, col=2, lwd=3)
abline(b, s, col="blue", lwd=3)
legend("bottomright", legend=c("eyeball", "LS", "true"), col=c(3,2,4), lwd=3)
```
:::
:::
# Estimation of error variance
We also want to estimate $\sigma$. If we think about the definition of the variance in our system in terms of the residuals:
$$
\sigma^2 = \text{var}(\varepsilon_i)
= \mathbb{E}[(\varepsilon_i - \mathbb{E}[\varepsilon_i])^2] =
\mathbb{E}[\varepsilon_i^2].
$$
This *seems* to indicate that a sensible strategy would be to estimate the average for squared errors with the sample average squared residuals: $$
\hat{\sigma~}^2 = \frac{1}{n} \sum_{i=1}^n e_i^2
$$
------------------------------------------------------------------------
However, this is ***not*** an `r myred("unbiased")` estimator of $\sigma^2$.
- This means that $\mathbb{E}[\hat{\sigma~}^2] \ne \sigma^2$.
For that, we have to alter the denominator slightly: $$
s^2 = \frac{1}{n-p} \sum_{i=1}^n e_i^2 = \frac{SSE}{n-2}
$$ ($p$ is the number of other parameters we estimate; i.e. `r myred("2")` for $b_0 +b_1$).
------------------------------------------------------------------------
Note:
- We have $n-p$ degrees of freedom because $p$ ($=2$) have been "used up" in the estimation of $b_0$ and $b_1$.
- It is often convenient to report $s = \sqrt{SSE/(n-p)}$, which is in the same units as $Y$.
# Estimating parameters
Now we can get (point) estimates of all of the parameters for our system.
::: columns
::: {.column width="50%"}
`r sk1()`
```{=tex}
\begin{align*}
\hat{b~} & = 14.55\mathrm{,}~~ b_{\mathrm{true}}=10\\
\hat{s~} & = 0.63\mathrm{,}~~ s_{\mathrm{true}}=0.75\\
\hat{\sigma~} & = 4.88\mathrm{,}~~ \sigma_{\mathrm{true}}=5
\end{align*}
```
:::
::: {.column width="50%"}
```{r}
n<-length(N[t])
b1<-(sum(N[t]*N[t+1]) - n*mean(N[t])*mean(N[t+1]))/(sum(N[t]^2)-n*mean(N[t])^2)
b0<-mean(N[t+1])- b1*mean(N[t])
```
```{r, echo=FALSE, fig.align='center', fig.width=5, fig.height=5}
par(mfrow=c(1,1))
t<-seq(1, Tmax-1)
plot(N[t], N[t+1], ylim=c(0, 60), xlim=c(0, 60))
abline(5, 0.9, col=3, lwd=3)
abline(b0, b1, col=2, lwd=3)
abline(b, s, col="blue", lwd=3)
legend("bottomright", legend=c("eyeball", "LS", "true"), col=c(3,2,4), lwd=3)
```
:::
:::
# Prediction and the modeling goal
A prediction rule is any function where you input $X$ and it outputs $\hat{Y~}$ as a predicted response at $X$.
`r sk1()` The least squares line is a prediction rule: $$
\hat{Y~} = f(X) = b_0 + b_1 X
$$
This rule tells us what to do when a **new** $X$ comes along \dots just run it through the formula above and obtain a guess $\hat{Y~}$!
However, **`r mygrn("it is not going to be a perfect prediction")`**.
------------------------------------------------------------------------
We need to devise a notion of `r mygrn("forecast accuracy")`.
- How sure are we about our forecast? Or
- How different could $Y$ be from what we expect?
`r sk1()` `r mygrn("Forecasts are useless without some kind of uncertainty qualification")`/`r myblue("quantification")`.
`r sk1()`
One method is to specify a range of $Y$ values that are likely, given an $X$ value.
- a `r myred("prediction interval")`: a probable range for $Y\!\,$s given $X$.
------------------------------------------------------------------------
`r mygrn("Key insight")`: to construct a prediction interval, we will have to assess the likely range of residual values corresponding to a Y value that has `r myred("not")` yet been observed!
`r sk1()` We must "invest" in a `r myblue("probability model")`. For our population model, for instance, we chose a `r mygrn("normal distribution")` for the observation error.
`r sk1()` We must also acknowledge that the "fitted" line may be fooled by particular realizations of the residuals.
- i.e., that our estimated coefficients $b_0$ & $b_1$ are random
------------------------------------------------------------------------
`r sk2()`
Luckily for us, the case of a `r myred("line with normally distributed noise")` around it is well studied theoretically, and you've probably seen it before\dots
# Simple linear regression (SLR) model
$$
Y = \beta_0 + \beta_1 X + \varepsilon, \;\;\;\;\;
\varepsilon \sim \mathcal{N}(0, \sigma^2)
$$
- It is a `r myred("model")`, so we are ***assuming*** this relationship holds for some `r myred("true but unknown")` values of $\beta_0$, $\beta_1$.
- Greek letters remind us they are not the same as the LS estimates $b_0$ and $b_1$.
The error $\varepsilon$ is independent, additive, "idosyncratic noise".
- Its distribution is *known* up to its spread $\sigma^2$.
- Greek letters remind us that $\varepsilon$ is not the same as $e$.
------------------------------------------------------------------------
`r myred("Before looking at any data")`, the model specifies
- how $Y$ varies with $X$ `r myred("on average")`: $\mathbb{E}[Y|X] = \beta_0 + \beta_1 X$;
- and the influence of factors other than $X$, $\varepsilon \sim \mathcal{N}(0, \sigma^2)$ independently of $X$.
<center>{height="3in"}</center>
# Context from the population example
Think of $\mathbb{E}[Y|X]$ as the average size of the population at time $t+1$ if the population had size $X$ at time $t$, and $\sigma^2$ is the spread around that average. When we specify the SLR `r myred("model")` we say that
- the average population depends is linearly on its previous size, but we don't know the coefficients.
- Observations could have a higher or lower value than the average, but how much they differ is unknown and
- is independent of the previous population size
- is normally distributed
------------------------------------------------------------------------
We think about the data as being `r myred("one possible realization")` of data that *could* have been `r myblue("generated from the model")` $$
Y|X \sim \mathcal{N}(\beta_0 + \beta_1 X, \sigma^2)
$$
$\Rightarrow$ $\sigma^2$ controls the `r myred(" dispersion")` of $Y$ around $\beta_0 + \beta_1 X$
<center>{height="3in"}</center>
# Prediction intervals in the ***true model***
You are told (without looking at the data) that $\beta_0 = 13$ ; $\beta_1= 0.9$ ; and $\sigma = 3$. You are asked to predict the population next year if the current population has 90 individuals. What do you know about $Y$ from the model? \begin{align*}
Y &= 13 + 0.9(90) + \varepsilon\\
&= 94 + \varepsilon
\end{align*}
Thus `r myblue("our prediction for the population at the next time point is")` $$
Y \sim \mathcal{N}(94, 3^2).
$$
------------------------------------------------------------------------
Thus, the model says that the mean number of individuals next year if there are 90 this year is 94 and that deviation from mean is within $\approx$ 6 (95% level).
`r sk1()` We are 95% sure that
- $-6< \varepsilon < 6$
- $88 < Y < 100$
In general, the 95% ***P***rediction ***I***nterval is: PI $= \beta_0 +\beta_1 X \pm 2\sigma$.
------------------------------------------------------------------------
This PI only incorporates uncertainty due to the inherent stochasticity.
`r sk1()`
We also must incorporate another source of risk: `r myblue("uncertainty in the parameters")` $\beta_0$ and $\beta_1$
# Sampling distribution of LS estimates
How much do our estimates depend on the particular random sample that we happen to observe?
We can run the following experiment:
- Randomly draw different samples of the same size.
- For each sample, compute the estimates $b_0$, $b_1$, and $s$.
If the estimates do vary a lot, then it matters which sample you happen to observe.
------------------------------------------------------------------------
<center>{height="4in"}</center>
------------------------------------------------------------------------
<center>{height="4in"}</center>
------------------------------------------------------------------------
The LS lines are much closer to the true line when $n=50$.
When $n=5$, some lines are close, others aren't: we need to get lucky.
Also notice that the LS lines are, more often than not,
- closer to the true line near middle of the cloud of data.
- They get farther apart away from the middle of the cloud.
- This is as true for small samples $(n=5)$ as it is for large $(n=50)$.
------------------------------------------------------------------------
\[See a Monte Carlo demonstration in `linear.R`\]
<center>{height="3.5in"}</center>
## Sampling distribution of LS line
What did we just do?
- We "imagined" through simulation the sampling distribution of a LS line.
In real life we get just one data set, and `r myred("we don't know the true generating model")`. But we can still imagine.
`r sk1()`
We first find the sampling distribution of our LS coefficients, $b_0$ and $b_1$...\
... which requires some review.
------------------------------------------------------------------------
In the online reading and review materials you should have come across some useful probability/stats facts, including:
- $\mathbb{E}(X_1+X_2) = \mathbb{E}(X_1)+ \mathbb{E}(X_2)$
- $\mathbb{E}(cX_1) = c \mathbb{E}(X_1)$
- $\text{var}(c X_1) = c^2\text{var}(X_1)$
- $\text{var}(X_1+X_2) = \text{var}(X_1)+\text{var}(X_2) + 2\text{cov}(X_1 X_2)$.
------------------------------------------------------------------------
`r mygrn("Recall: distribution of the sample mean")`
`r sk1()`
Step back for a moment and consider the mean for an $iid$ sample of $n$ observations of a random variable $\{X_1,\ldots,X_n\}$.
`r sk1()`
Suppose that $\mathbb{E}(X_i) = \mu$ and $\text{var}(X_i) = \sigma^2$, then
- $\mathbb{E}(\bar{X}) = \frac{1}{n} \sum\mathbb{E}(X_i) = \mu$
- $\text{var}(\bar{X}) = \text{var}\left( \frac{1}{n} \sum X_i \right) = \frac{1}{n^2} \sum \text{var}\left( X_i \right) = \displaystyle \frac{\sigma^2}{n}$.
## Central Limit Theorem
The CLT states that for $iid$ random variables, $X$, with mean $\mu$ and variance $\sigma^2$, the distribution of the sample mean becomes normal as the number of observations, $n$, gets large.
`r sk1()`
That is, $\displaystyle \bar{X} \rightarrow_{n} \mathcal{N}(\mu, \sigma^2/n)$ , and sample averages tend to be normally distributed in large samples.
------------------------------------------------------------------------
We are now ready to describe the sampling distribution of the least squares line ...
... in terms of its effect on the sampling distributions of the coefficients
- $b_1 = \hat{\beta~}_1$, the slope of the line
- $b_0 = \hat{\beta~}_0$, the intercept,
- and how they covary together,
given a particular (fixed) set of $X$-values.
## Sampling distribution of $b_1$
It turns out that $b_1$ is normally distributed: $b_1 \sim \mathcal{N}(\beta_1, \sigma^2_{b_1})$.
- $b_1$ is unbiased: $\mathbb{E}[b_1] = \beta_1$.
- The sampling sd $\sigma_{b_1}$ determines precision of $b_1$: $$
\sigma_{b_1}^2
= \text{var}(b_1) = \frac{\sigma^2}{\sum (X_i - \bar{X})^2} = \frac{\sigma^2}{(n-1)s_x^2}.
$$ It depends on three factors: 1) sample size ($n$); 2) error variance ($\sigma^2 = \sigma_\varepsilon^2$); and 3)$X$-spread ($s_x$).
## Sampling Distribution of $b_0$
The intercept is also normal and unbiased: $b_0 \sim \mathcal{N}(\beta_0, \sigma^2_{b_0})$, where $$
\sigma^2_{b_0} = \text{var}(b_0) = \sigma^2 \left(\frac{1}{n} + \frac{\bar{X}^2}{(n-1)
s_x^2} \right).
$$
What is the intuition here? $$
\text{var}(\bar{Y} - \bar{X} b_1)
= \text{var}(\bar{Y}) + \bar{X}^2\text{var}(b_1) {~-~ 2\mathrm{cov}(\bar{Y},b_1) }
$$
- $\bar{Y}$ and $b_1$ are uncorrelated because the slope ($b_1$) is invariant if you shift the data up or down ($\bar{Y}$).
------------------------------------------------------------------------
**`r mygrn("Optional Practice Exercise")`**
1. Show that:
1. $\mathbb{E}[b_1] = \beta_1$
2. $\mathbb{E}[b_0] = \beta_0$
3. $\text{var}(b_0) = \sigma^2 \left(\frac{1}{n} + \frac{\bar{X}^2}{(n-1) s_x^2} \right)$
2. Why is it that $b_0$ and $b_1$ are normally distributed?
## Joint Distribution of $b_0$ and $b_1$
We know that $b_0$ and $b_1$ *can be* dependent, i.e., $$
\mathbb{E}[(b_0 -\beta_0)(b_1 - \beta_1)] \ne 0.
$$ This means that estimation error in the slope is correlated with the estimation error in the intercept. $$
\mathrm{cov}(b_0,b_1) = -\sigma^2 \left(\frac{\bar{X}}{(n-1)s_x^2}\right).
$$
------------------------------------------------------------------------
**`r mygrn("Interpretation")`:**
- Usually, if the slope estimate is too high, the intercept estimate is too low (negative correlation).
- The correlation decreases with more $X$ spread ($s^2_x$).
## Estimated variance
However, these formulas aren't especially practical since they involve an unknown quantity: $\sigma$.
- `r myblue("Solution")`: use $s$, the residual sample standard deviation estimator for $\sigma = \sigma_\varepsilon$. $$
s_{b_1} = \sqrt{\frac{s^2}{(n-1)s_x^2}} ~~~
s_{b_0} = \sqrt{s^2 \left(\frac{1}{n} + \frac{\bar{X}^2}{(n-1)s^2_x}\right)}
$$
- $s_{b_1} = \hat{\sigma~}_{b_1}$ and $s_{b_0} = \hat{\sigma~}_{b_0}$ are estimated coefficient sd's.
------------------------------------------------------------------------
**`r mygrn("Interpretation")`:**
`r sk1()`
We now have a notion of standard error for the LS estimates of the slope and intercept.
`r sk1()`
- `r myred("Small $s_b$ values mean high info/precision/accuracy")`.
## Normal and Student-$t$
Again, recall what *Student* discovered:
If $\theta \sim \mathcal{N}(\mu,\sigma^2)$, but you estimate $\sigma^2 \approx s^2$ based on $n-p$ degrees of freedom, then $\theta \sim t_{n-p}(\mu, s^2)$.
`r sk1()`
For SLR, for example:
- $\bar{Y} \sim t_{n-1}(\mu, s_y^2/n)$.
- $b_0 \sim t_{n-2}\left(\beta_0, s^2_{b_0}\right)$ and $b_1 \sim t_{n-2}\left(\beta_1, s^2_{b_1}\right)$
------------------------------------------------------------------------
We can use these distributions for drawing conclusions about the parameters via:
- Confidence intervals
- Hypothesis tests
# Forecasting
`r myblue("The conditional forecasting problem")`:
- Given covariate(s) $X_f$ and sample data $\{X_i, Y_i\}_{i=1}^n$, predict the "future" observation $Y_f$.
`r myred("The solution is to use our LS fitted value")`: $\hat{Y~}_f = b_0 + b_1X_f$.
- That's the easy bit.
The hard (`r myblue("and very important!")`) part of forecasting is assessing uncertainty about our predictions.
------------------------------------------------------------------------
If we use $\hat{Y~}_f$, our prediction error is
$$e_f = Y_f - \hat{Y~}_f = Y_f - b_0 - b_1X_f$$
<center>{height="3in"}</center>
------------------------------------------------------------------------
We can decompose $e_f$ into two sources of error
- Inherent idiosyncratic randomness (due to $\varepsilon$).
- Estimation error in the intercept and slope\
(i.e., discrepancy between our line and "the truth").
```{=tex}
\begin{align*}
e_f &=& Y_f - \hat{Y~}_f = (Y_f - \mathbb{E}[Y_f | X_f]) + \mathbb{E}[Y_f | X_f]-\hat{Y~}_f \\
&=& \varepsilon_f + (\mathbb{E}[Y_f | X_f] -\hat{Y~}_f )\\
&=& \varepsilon_f + (\beta_0 - b_0) + (\beta_1 - b_1)X_f.
\end{align*}
```
------------------------------------------------------------------------
The variance of our prediction error is thus $$
\text{var}(e_f) = \text{var}(\varepsilon_f) + \text{var}(\mathbb{E}[Y_f | X_f] -\hat{Y~}_f )
= \sigma^2 + \text{var}(\hat{Y~}_f)
$$ We know $\text{var}(\varepsilon_f) = \sigma^2 \approx s^2$, but what about the fit error?
<center>{height="3in"}</center>
------------------------------------------------------------------------
From the sampling distributions derived earlier, $\text{var}(\hat{Y~}_f)$ is \begin{align*}
\text{var}(b_0 + b_1 X_f) &=& \text{var}(b_0) + X_f^2 \text{var}(b_1) + 2X_f\mathrm{cov}(b_0,b_1) \\
&=& {\sigma^2 \left[\frac{1}{n} + \frac{(X_f - \bar{X})^2}{(n-1)
s_x^2} \right]}.
\end{align*}
Replacing $\sigma^2$ with $s^2$ gives the standard error for $\hat{Y}_f$.
And hence the variance of our predictive error is $$
\text{var}(e_f) = s^2 \left[ 1 + \frac{1}{n} +
\frac{(X_f-\bar{X})^2}{(n-1) s_x^2} \right].
$$
------------------------------------------------------------------------
Putting it all together, we have that $$
Y_f \sim \mathcal{N}\left(\hat{Y}_f, \sigma^2 \left[ 1 +
\frac{1}{n} + \frac{(X_f - \bar{X})^2}{(n-1)
s_x^2} \right] \right)
$$ (sums of normals are normal) or, with estimated variance, $$
Y_f \sim t_{n-p}\left(\hat{Y}_f, {\color{blue}s^2} \left[ 1 +
\frac{1}{n} + \frac{(X_f - \bar{X})^2}{(n-1)
s_x^2} \right] \right).
$$
$$
b_0 + b_1 X_f \pm t_{n-2,\alpha/2} \left(s~ \sqrt{1+ \frac{1}{n} + \frac{(X_f -
\bar{X})^2}{(n-1) s_x^2}}\right).
$$
------------------------------------------------------------------------
Looking closer at what we'll call
$$
\color{blue} s_{pred} = s~ \sqrt{1+ \frac{1}{n} + \frac{(X_f -
\bar{X})^2}{(n-1) s_x^2}} = \sqrt{s^2 + s_{\mathrm{fit}}^2}.
$$
A large predictive error variance (high uncertainty) comes from
- Large $s$ (i.e., large $\varepsilon$'s).
- Small $n$ (not enough data).
- Small $s_x$ (not enough observed spread in covariates).
- Large $(X_f - \bar{X})$.
The first three are familiar... what about the last one?
------------------------------------------------------------------------
For $X_f$ far from our $\bar{X}$, the space between lines is magnified ...
`r sk1()`
<center>{height="3in"}</center>
------------------------------------------------------------------------
$\Rightarrow$ The prediction (conf.) interval needs to widen away from $\bar{X}$
`r sk1()`
<center>{height="4in"}</center>
------------------------------------------------------------------------
**`r mygrn("NOTE")`**: For SLR (and MLR) we can write down equations to describe these predictive intervals, sampling distributions.
`r sk1()`
However, for more complex problems this is not the case.
`r sk1()`
Sometimes we can do analytic/asymptotic approximations. The rest of the time we use simulation.