-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy path02-Global_estimation.Rmd
724 lines (425 loc) · 44.8 KB
/
02-Global_estimation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
---
title: "Global Estimation"
author: "Jon Lefcheck"
date: "January 11, 2021"
output: html_document
---
# Global Estimation
## What is (Co)variance?
The building block of the global estimation procedure for SEM is variance, specifically the covariance between variables. Before we delve into the specifics of this procedure, it's worth reviewing the basics of variance and covariance.
Variance is the degree of spread in a set of data. Formally, it captures the deviation of each point from the mean value across all points. Consider the variable $x$. The variance of $x$ is calculated as:
$$VAR_{x} = \frac{\sum(x_{i} - \overline{x})^2}{n-1}$$
where $x_{i}$ is each sample value, $\overline{x}$ is the sample mean, and $n$ is the sample size.
Similarly, for the response $y$:
$$VAR_{y} = \frac{\sum(y_{i} - \overline{y})^2}{n-1}$$
Note that, regardless of the actual scale of the variables, variances are always positive (due to the squared term). The larger the variance, the more spread out the data are around the mean.
Covariance is a measure of the dependency between two variables. Covariance can be formalized as:
$$COV_{xy} = \frac{\sum(x_{i} - \overline{x}) (y_{i} - \overline{y})}{n - 1}$$
If variation in $x$ tends to track variation in $y$, then the numerator is large and covariance is high. In this case, the two variables are then said to co-vary.
Consider a simple example. In `R`, the function `var` computes variance, and `cov` the covariance, which we can compare to coding the equations above by hand:
```{r}
x <- c(1, 2, 3, 4)
y <- c(2, 3, 4, 5)
# variance in x
sum((x - mean(x))^2)/(length(x) - 1) == var(x)
# variance in y
sum((y - mean(y))^2)/(length(y) - 1) == var(y)
# covariance
sum((x - mean(x)) * (y - mean(y)))/(length(x) - 1) == cov(x, y)
```
The variance and covariance depend on the magnitude of the units. If the units of $x$ are much larger than $y$, then the covariance will also be large:
```{r}
cov(x, y)
# increase units of x
x1 <- x * 1000
cov(x1, y)
```
This property can make the interpretation and comparison of (co)variances among sets of variables misleading if the units are very different between variables. To solve this issue, we can standardize the variables to have a mean of 0 and a variance of 1. This standardization is achieved by subtracting the mean from each observation, and dividing by the standard deviation of the mean (the square-root of the variance). This procedure is also known as the *Z*-transformation.
```{r}
zx <- (x - mean(x)) / sd(x)
zy <- (y - mean(y)) / sd(y)
# can also be obtained using the function `?scale`
```
Replacing the values of x and y with the standardized versions in our calculation of covariance yields the Pearson product moment correlation, $r$. Correlations are in units of standard deviations of the mean, and thus can be fairly compared regardless of the magnitude of the original variables. The function to compute the correlation is `cor`:
```{r}
sum(zx * zy)/(length(zx) - 1) == cor(x, y)
```
In our example, the two variables are perfectly correlated, so $r = 1$.
Incidentally, this is the same as dividing the covariance of $x$ and $y$ by the product of their standard deviations, which omits the need for the prior *Z*-transformation step but achieves the same outcome:
```{r}
(cov(x, y) / (sd(x) * sd(y))) == cor(x, y)
```
Now that we have reviewed these basic concepts, we can begin to consider them within the context of SEM.
## Regression Coefficients
The inferential heart of structural equation modeling are the regression (or path) coefficients. These values mathematically quantify the (mostly linear) dependence of one variable on another. This verbiage should sound familiar because that is what we have already established as the goal of covariance and correlation.
In this section, we will demonstrate how path coefficients can be derived from correlation coefficients and explore Grace's "8 rules of path coefficients."
First, we must define the important distinction between a regression (path) coefficient and a correlation coefficient.
In a simple linear regression, one variable $y$ is the response and another $x$ is the predictor. The association between the two variables is used to generate the predictor $\hat{y}$:
$$\hat{y} = bx + a$$
where $b$ is the regression coefficient and $a$ is the intercept. It's important to note that $b$ implies a linear relationship, i.e., the relationship between $x$ and $y$ can be captured by a straight line (for now, as we will see later there are ways to relax this restriction).
The regression coefficient between $x$ and $y$ can be related to the correlation coefficient through the following equation:
$$b_{xy} = r_{xy} (SD_{y}/SD_{x})$$
If the variables have been *Z*-transformed, then the $SD_{x} = SD_{y} = 1$ and $b_{xy} = r_{xy}$.
This brings us to our first key point: when the variables have been scaled to mean = 0 and variance = 1, then the regression coefficient *is* the correlation coefficient. For multiple regression (more than one predictor $x1$, $x2$, and so on), these values are the partial correlation coefficients. We refer to scaled relationships as *standardized coefficients*.
*Unstandardized coefficients*, in contrast, are reported in their raw units. As with variance, their values depend on the unit of measure. In fact, the unstandardized coefficient can be related to the variance through the following equation, which may appear familiar, as this is how we achieved a correlation coefficient from covariances earlier:
$$b_{xy} = \frac{COV_{xy}}{VAR_{x}}$$
In mathematical terms, the unstandardized coefficients are scaled by the variance of the predictor, while the standardized variance by the ratio of the standard deviations of both $x$ and $y$.
We can demonstrate these principles using a simple example:
```{r}
set.seed(111)
data <- data.frame(y1 = runif(100))
data$x1 <- data$y1 + runif(100)
unstd.model <- lm(y1 ~ x1, data)
# get unstandardized coefficient
summary(unstd.model)$coefficients[2, 1]
# now using covariance
(cov(data$y1, data$x1) / var(data$x))
# repeat with scaled data
std.model <- lm(y1 ~ x1, as.data.frame(apply(data, 2, scale)))
# get standardized coefficient
summary(std.model)$coefficients[2, 1]
# now using correlation
cor(data$y1, data$x1)
```
The concepts of variance, covariance, and correlation therefore directly inform the calculation of unstandardized and standardized regression coefficients and lend them their unique properties that we will now cover in the "8 rules of path coefficients."
### Rule 1: Unspecified relationships among exogenous variables are simply their bivariate correlations.
Variables that only have paths emanating from them (i.e., do not have arrows going into them) are called *exogenous* variables. If there is not a directed path between two exogenous variables, then their relationship can be expressed by the simple correlation between them. This is sometimes, but not always, indicated by a double-headed arrow. So $x1 <-> x2 == cor(x1, x2)$.
In contrast, variables for which arrows are also directed into are called *endogenous*. An endogenous variable can also have arrows directing *out* of it, but the sole condition is that they must be predicted (rather than solely predictors).
### Rule 2: When two variables are connected by a single path, the coefficient of that path is the regression coefficient.
For this rule, we will expand upon our earlier example to construct a simple path diagram:

```{r}
data$y2 <- data$y1 + runif(100)
```
In this case, the path coefficient connecting $x1 -> y1$ is the regression coefficient $\gamma_{y1,x1}$. Similarly, the path coefficient connecting $y1 -> y2$ is the regression coefficient $\beta_{y2,y1}$. If the data are standardized, then the regression coefficient equals the correlation between the two:
```{r}
(pathx1_y1 <- summary(lm(y1 ~ x1, as.data.frame(apply(data, 2, scale))))$coefficients[2, 1])
cor(data$y1, data$x1)
(pathy1_y2 <- summary(lm(y2 ~ y1, as.data.frame(apply(data, 2, scale))))$coefficients[2, 1])
cor(data$y2, data$y1)
```
Note the distinction between exogenous and endogenous linkages, which are represented with $\gamma$, and relationships between two endogenous variables, which are represented with $\beta$.
### Rule 3: The strength of a compound path (one that includes multiple links) is the product of the individual coefficients.
One of the strengths of SEM is being able to quantify indirect or cascading linkages. This is accomplished by simply multiplying the path coefficients. So the effect of $x1$ on $y2$ is the product of the coefficient of the path $x1 -> y1$ and $y1 -> y2$:
```{r}
pathx1_y1 * pathy1_y2
```
By our earlier logic, this value should equal the correlation between $x1$ and $y2$:
```{r}
cor(data$y2, data$x1)
```
Ahh, but wait! The correlations are not the same. This result implies that the relationship between $x1$ and $y2$ cannot be fully explained by the indirect path through $y1$. Rather, we require additional information to solve this problem, and it comes in the form of the missing link between $x1 -> y2$, which we can add to the model:

Introducing this path raises a new issue: the relationship between $y1$ and $y2$ now arises from two sources. The first is their direct link, the second is the indirect effect of $x1$ on $y2$ *through* $y1$. We require a new approach to be able to compute the independent effects of each variable on the others, which comes in the form of the 'partial' regression coefficient.
### Rule 4. When variables are connected by more than one pathway, each pathway is the 'partial' regression coefficient.
A partial regression coefficient accounts for the joint influence of more than one variable on the response. In other words, the coefficient for one predictor controls for the influence of other predictors in the model. In this new model, $y2$ is affected by two variables: $x1$ and $y1$.
Procedurally, this involves removing the shared variance between $x1$ and $y1$ so that their associations with $y2$ can be independently derived.
We can calculate the partial effect of $x1$ through the following equation:
$$b_{y2x1} = \frac{r_{x1y2} - (r_{x1y1} \times r_{y1y2})}{1 - r_{x1y1}^2}$$
which takes the bivariate correlation between $x1$ and $y2$, removes the joint influence of $y1$ and both $x1$ on $y2$, then scales this effect by the shared variance between $x1$ and $y1$. The result is the partial effect of $x1$ on $y2$.
```{r}
(partialx1 <- (cor(data$x1, data$y2) - (cor(data$x1, data$y1) * cor(data$y1, data$y2))) / (1 - cor(data$x1, data$y1)^2))
```
It is important to note that partial coefficients implement a statistical (rather than experimental) control. In other words, the partial effect of $x1$ accounts for the statistical contributions of $y1$. Thus, partial effects are useful in situations where experimental controls are difficult or impossible to apply.
Notably, we can back out the unstandardized coefficient by multiplying by the ratio of standard deviations of the response over the predictor:
```{r}
((cor(data$x1, data$y2) - (cor(data$x1, data$y1) * cor(data$y1, data$y2))) / (1 - cor(data$x1, data$y1)^2))* (sd(data$y2)/sd(data$x1))
summary(lm(y2 ~ x1 + y1, data))$coefficients[2, 1]
```
Similarly, the partial effect of $y1$ on $y2$ is given by:
$$b_{y2y1} = \frac{r_{y2y1} - (r_{y2x1} \times r_{y1x1})}{1 - r_{x1y1}^2}$$
```{r}
(partialy1 <- (cor(data$y2, data$y1) - (cor(data$y2, data$x1) * cor(data$y1, data$x1))) / (1 - cor(data$x1, data$y1)^2))
```
As before, we can arrive at the same answer by looking at the (standardized) coefficients obtained through multiple regression:
```{r}
summary(lm(y2 ~ x1 + y1, as.data.frame(apply(data, 2, scale))))$coefficients[2:3, 1]
partialx1; partialy1
```
The regression procedure therefore produces partial coefficients in the case of multiple predictors.
Another way of thinking about this is that the effect of $x1$ on $y2$ is computed AFTER having removed the effect of $y1$ on $y2$. Procedurally, this can be done by first removing the variance in $x1$ explained by $y1$, then regressing the residual values (i.e., the variance in $y2$ that is *unexplained* by $y1$) against $y2$.
```{r}
residsx1 <- residuals(lm(x1 ~ y1, as.data.frame(apply(data, 2, scale))))
summary(lm(scale(data$y2) ~ residsx1))$coefficients[2, 1]
partialx1
```
Indeed, this iterative procedure gives us the same value as the former equation based on correlations and the output from the multiple regression.
However, this raises the interesting notion of residual error. The second equation still has variance in $y2$ that is unassociated with variation in either $x1$ or $y1$. In other words, the model does not perfectly predict $y2$. The idea of residual (unexplained) variance leads us to the fifth rule of path coefficients.
### Rule 5: Errors on endogenous variables relate the unexplained correlations or variances arising from unmeasured variables.
Most researchers are familiar with the $R^2$ statistic, or the ratio of explained variation to the total variation in $y2$. It follows then that the unexplained or residual variance is $1 - R^2$.
For example, the error variance on $y2$ is:
```{r}
1 - summary(lm(y2 ~ y1 + x1, as.data.frame(apply(data, 2, scale))))$r.squared
```
This value captures the other (unknown) sources that cause the correlation between $y2$ and the other variables to deviate from 1. In other words, if $x1$ is the *only* variable controlling $y2$, then their correlation would be 1 and their prediction error would be 0 because we would have accounted for everything that affects $y2$.
This idea of residual or unexplained variance is easily illustrated with the relationship between $x1$ and $y1$ (because there are no partial correlations to deal with), where the square-root of variance explained is simply the correlation coefficient, and 1 - the correlation is the unexplained correlation arising from other sources:
```{r}
1 - sqrt(summary(lm(y1 ~ x1, as.data.frame(apply(data, 2, scale))))$r.squared)
1 - cor(data$y1, data$x1)
```
In a path diagram, error variances are often represented as $\zeta$ with an arrow leading into the endogenous variable. The path coefficient representing the effect of $\zeta$ is often expressed as the error correlation: $\sqrt(1 - R^2)$, in keeping with the presentation of the other (standardized) coefficients.
### Rule 6: Unanalyzed (residual) correlations among two endogenous variables are their partial correlations.
Imagine we remove the path from $y1 -> y2$:

Both variables are endogenous and their relationship can still be quantified, just not in a directed way. If they were exogenous variables, the relationship would be their bivariate correlation (Rule #1), but in this case, we have to remove the effects of $x1$ on both variables.
$$r_{y1y2\bullet x1} = \frac{r_{y1y2} - (r_{x1y1} \times r_{x1y2})}{\sqrt((1 - r_{x1y1}^2)(1 - r_{x1y2}^2))}$$
This equation removes the effect $x1$ and scales by the shared variance between $x1$ and both endogenous variables.
```{r}
(errory1y2 <- (cor(data$y1, data$y2) - (cor(data$x1, data$y1) * cor(data$x1, data$y2))) / sqrt((1 - cor(data$x1, data$y1)^2) * (1 - cor(data$x1, data$y2)^2)))
```
As with the partial correlation, this value is the same as the correlation between the residuals of the two models:
```{r}
(cor(
resid(lm(y1 ~ x1, as.data.frame(apply(data, 2, scale)))),
resid(lm(y2 ~ x1, as.data.frame(apply(data, 2, scale))))
))
errory1y2
```
Hence these are known as *correlated errors* and are represented by double-headed arrows between the errors of two endogenous variables. (Often the errors or $\zeta$'s are omittedin the diagram, and the graph simply depicts a double-headed arrow between the variables themselves, but the correlation is truly among their errors.)
If the presence of $x1$ explains all of the variation in $y1$ and $y2$, then their partial correlation will be 0. In this case, the two endogenous variables are said to be *conditionally independent*, or that they are unrelated conditional on the joint influence of $x1$. If the two are conditionally independent, then the correlation between $y1$ and $y2$ is the product of the correlations between $y1$ and $x1$, and $y2$ and $x1$.
$$r_{y1y2} = r_{y1x1} \times r_{y2x1}$$
(If we replace this term in the previous equation, the numerator becomes 0 and so does the partial correlation.)
Let's calculate this value:
```{r}
cor(data$y1, data$x1) * cor(data$y2, data$x1)
```
Ah, the two are very different, implying that $y1$ and $y2$ are *not* conditionally independent given the joint influence of $x1$. In other words, there are other, unmeasured sources of variance that are influencing the relationship between these two variables. This interpretation is a good way to think about the notion of correlated errors.
The concept of conditional independence is critical in the implementation of local estimation, principally in the tests of directed separation that form the basis of one goodness-of-fit statistic, which we will revisit in Chapter 3.
Now that we have derived all the quantities related to direct, indirect, and error variances or correlations, we have all the information necessary to calculate total effects.
### Rule 7: The total effect one variable has another is the sum of its direct and indirect effects.
If we return to our previous path model, which reinstates the path between $y1 -> y2$, the total effect of $x1$ on $y2$ includes the direct effect as well as the indirect effect mediated by $y1$.
```{r}
(totalx1 <- partialx1 + cor(data$y1, data$x1) * partialy1)
```
This value can be used to demonstrate the final rule.
### Rule 8: The total effect (including undirected paths) is equivalent to the total correlation.
We can test this rule easily:
```{r}
totalx1 == cor(data$y2, data$x1)
```
Indeed, the total effect equals the correlation!
If we consider the path model without the directed link between $y1$ and $y2$, the correlation between $y1$ and $y2$ considers the total effect *and* undirected effects (i.e., correlated errors between the two endogenous variables):
```{r}
(totaly1y2 <- cor(data$y1, data$x1) * cor(data$y2, data$x1) + errory1y2 * sqrt(1 - summary(lm(y1 ~ x1, data))$r.squared) * sqrt(1 - summary(lm(y2 ~ x1, data))$r.squared))
totaly1y2 == cor(data$y1, data$y2)
```
This example closes our discussion of path coefficients and their 8 rules.
The major points to remember are:
- standardized coefficients reflect (partial) correlations;
- the indirect effect of one variable on another is obtained by multiplying the individual path coefficients (standardized or unstandardized);
- the total effect is the sum of direct and indirect paths;
- the bivariate correlation is the sum of the total effect plus any undirected paths.
An understanding of covariances and correlations is essential to understanding the solutions provided by a global estimation approach to SEM.
## Variance-based Structural Equation Modeling
The classical approach to SEM is based on the the idea of variance and covariances. With >2 variables, we can construct a variance-covariance matrix, where the diagonals are the variances of each variable and the off-diagonals are the covariances between each pair. Consider our last example:
```{r}
cov(data)
```
This code returns the variance-covariance matrix for the three variables $x1$, $y1$, and $y2$. We would call this the *observed global* variance-covariance matrix.
The entire machinery behind covariance-based SEM is to reproduce that global variance-covariance matrix. In fact, all of covariance-based SEM can be boiled down into this simple equation:
$$\Sigma = \Sigma(\Phi)$$
where $\Sigma$ is the observed variance-covariance matrix, and $\Sigma(\Phi)$ is the model-estimated covariance matrix expressed in terms of $\Phi$, the matrix of model-estimated parameters (i.e., coefficients).
In other words, this equation shows that the observed covariances can be understood in terms of statistical parameters that can be used to predict these same covariances. We have already shown that linear regression can be expressed using variances and covariance, and so it follows that covariance-based SEM is simply a multivariate approach to regression.
The question is: how do we arrive at $\Sigma(\Phi)$, or more relevantly, how do we estimate the matrix of model parameters $\Phi$ that lead to the estimated variance-covariance matrix?
The most common tool is *maximum-likelihood estimation*, which iteratively searches parameter space and continually refines estimates of parameter values such that the differences between the observed and expected variance-covariance matrices are minimized.
The maximum-likelihood fitting function can be expressed as:
$$F_{ML} = log|\hat{\Sigma}| + tr(S\hat{\Sigma}^{-1}) - log|S| - (p + q)$$
where $\Sigma$ is the modeled covariance matrix, $S$ is the observed covariance matrix, $p$ is the number of endogenous variables, and $q$ is the number of exogenous variables. $tr$ is the trace of the matrix (sum of the diagonal) and the $^{-1}$ denotes the inverse of the matrix.
Maximum-likelihood estimators have a few desireable properties, principally that they are invariant to the scales of the variables and provide unbiased estimates based on a few assumptions:
- variables must exhibit multivariate normality. Oftentimes this is the not case: dummy variables, interactions and other product terms have non-normal distributions. However, $F_{ML}$ is fairly robust to violations of multinormality, especially as the sample size grows large.
- the observed matrix $S$ must be positive-definite. This means there are no negative variances, an implied correlation > 1.0, or redundant variables (one row is a linear function of another).
- finally, $F_{ML}$ assumes sufficiently large sample size.
The notion of sample size is a good one: as models become increasingly complex, they require more data to fit. The issue of model 'identifiability' and sample size is dealt with in the next section.
## Model Identifiability
Like any statistical technique, having sufficient power to test your hypotheses is key to arriving at robust unbiased inferences about your data. This requirement is particularly relevant to SEM, which often evaluates multiple hypotheses simultaneously, and therefore requires more data than other approaches. In this section, we will briefly review the idea of model 'identifiability' and sample size requirements.
A model is 'identified' if we can uniquely estimate each of its parameters. This not only includes in the matrix of parameter estimates but also their errors. In other words, we need at least as many 'known' pieces of information as 'unknowns' *at minimum* to be able to fit a model.
Consider the following equation:
$$a + b = 8$$
We have 1 piece of known information, $8$, and two unknowns, $a$ and $b$. There are a number of solutions for $a$ and $b$ (e.g., 1 and 7, 2 and 6, etc), and thus the equation is not solvable. In this case, the model would be *underidentified* because we lack sufficient information to arrive at a unique solution.
Now consider another equation, in addition to the first:
$$a = 3b$$
With this equation, we now have enough information to uniquely solve for $a$ and $b$:
$$(3b) + b = 8$$
$$ 4b = 8$$
$$b = 8 / 4 = 2$$
$$a + 2 = 8$$
$$a = 8 - 2 = 6$$
Thus we have arrived at a single solution for $a$ and $b$. We call this system of equations *just identified* since we have just enough information to solve for the unknowns.
Finally, consider a third equation:
$$2a - 4 = 4b$$
We now have more pieces of known information than unknowns, since we have already arrived at a solution for both $a$ and $b$ based on the previous two equations. In this case, we call the system of the equations *overidentified* because we have more information than is necessary to arrive at unique solutions for our unknown variables. This is the desireable state, because that extra information can be used to provide additional insight, which we will elaborate on shortly.
You may have also heard of models referred to as *saturated*. Such a model would be *just identified*; an *unsaturated* model would be *overidentified* and and an *oversaturated* model would be *underidentified*.
There is a handy rule that can be used to quickly gauge whether a model is under-, just, or overidentified: the "t-rule." The t-rule takes the following form:
$$t \leq \frac{n(n+1)}{2}$$
where $t$ is the number of unknowns (parameters to be estimated) and $n$ is the number of knowns (observed variables). The left hand side is how many pieces of information we want to know. The right hand side reflects the information we have to work with and is equal to the number of unique cells in the observed variance-covariance matrix (diagonal = variance, and lower triangle = covariances).
Consider the simple mediation model from earlier:

In this model, we have several pieces of known information: $x1$, $y1$, and $y2$. So $n = 3$ for this model.
We need to estimate the parameters related each set of relationships ($\gamma_{x1y1}$ and $\beta_{y1y2}$). These amount to the two covariances, but recall from the first section of this chapter that we need three variances to derive those estimates ($var_{x1}$, $var_{y1}$, $var_{y2}$). So the total number of unknowns to be estimated is $t = 5$ (the three variances and the two regression parameters, or covariances).
We can plug in these values to see if we meet the t-rule:
$$5 \leq \frac{3(3+1)}{2} = 6$$
In this case $5 \leq 6$ holds true and we have enough information to arrive at a unique solution. Note that the right hand side of the equation is the number of entries in the variance-covariance matrix: 3 variances (diagonal) plus 3 covariances (off-diagonal).
Let's consider our second model, which adds another path:

Now we must additionally estimate the path from $x1$ to $y2$ so our value of $t = 5 + 1$. However, $6 \leq 6$ and so the t-rule is still satisfied. In this case, however, the model is just identified.
Tallying the number of parameters to be estimated can sometimes be tricky because path diagrams are not always drawn with error variances on the endogenous variables. Additionally, multiple exogenous variables also have a covariance that must be estimated: this is depicted as a non-directional or double-headed error between every pair of exogenous variables. However, these double-headed arrows are rarely drawn, even though they exist. Thus it can be tricky to identify $n$ in the above equation. In such cases, it is valid to simply count the number of unique cells in the variance-covariance matrix (the diagonal and the lower off-diagonal).
If we were to consider a more complex model, such as one with a feedback from $y2$ to $y1$ (in addition to the path from $y1 -> y2$) then we would not have enough information to solve the model, which would be *underidentified*.
Models with bidirectional feedbacks (with separate arrows going in each direction, as opposed to a single double-headed arrow) are referred to as *non-recursive*. These feedbacks can also occur among variables, for instance: $x1 -> y1 -> y2 -> x1$ would also be a non-recursive model. *Recursive* models, then, lack such feedbacks and all causal effects are unidirectional.
Identifiability of non-recursive is tricky. Such models must satisfy the *order condition*. This condition tests whether variables involved in the feedback have unique information. In our above example of $y2$ also affecting $y1$, $y1$ has unique information in the form of $x1$ but $y2$ has no unique information, so it fails the order condition.
The order condition can be evaluated using the following equation:
$$G \leq H$$
where $G$ = the number of incoming paths, and $H$ = the number of exogenous variables + the number of indirectly-connected endogenous variabls. In the previous example, $G = 2$ while $H = 1$, so the model fails the order condition, as noted.
Model identification is only the first step in determining whether a model can provide unique solutions: sample size can also restrict model fitting by not providing enough replication for the $F_{ML}$ function to arrive at a stable set of estimates for the path coefficients.
The basic rule-of-thumb is that the level of replication should be *at least* 5 times the number of estimated coefficients (not error variances or other correlations). So in our previous path model, we are estimating two relationships, so we require at least $n = 10$ to fit that model.
However, this value is a lower limit: ideally, replication is *5-20x* the number of estimated parameters. The larger the sample size, the more precise (unbiased) the estimates will be. This is true for all linear regressions, not just SEM, and so we adopt these guidelines here.
Identifiability and replication are key in not only providing an actual solution, but also in providing extra information with which to evaluate model fit, the topic of the next section.
## Goodness-of-fit Measures
As we have established, the purpose of covariance-based SEM is to reproduce the global observed variance-covariance matrix. However, given that our hypothesized relationships may not actually match the data, we must be prepared to evaluate how well the model-estimated variance-covariance matrix matches the observed variance-covariance matrix.
Recall that in the previous section on Path Coefficients, we evaluated the error variance or correlation as reflecting outside sources of variation uncaptured by our measured variables. High error variances would lead to less accurate estimates of the relationships among variables, and thus a high level of disagreement among the observed and model-implied variance-covariance matrix.
Also recall our formula for the maximum-likelihood fitting function:
$$F_{ML} = log|\hat{\Sigma}| + tr(S\hat{\Sigma}^{-1}) - log|S| - (p + q)$$
where $\Sigma$ is the modeled covariance matrix, $S$ is the observed covariance matrix, $p$ is the number of endogenous variables, and $q$ is the number of exogenous variables. $tr$ is the trace of the matrix (sum of the diagonal) and the $^{-1}$ is the inverse of the matrix.
In the event that $\Sigma = S$, then the first two terms would equal 0, and similarly for the second two terms. Thus a model where $F_{ML} = 0$ implies perfect fit because the observed covariance matrix has been *exactly* reproduced.
Oppositely, a large value of $F_{ML}$ would imply increasing discrepancy between the observed and model-implied variance-covariance matrices. This could be interpreted as a 'poor fit' for the model.
In fact, $F_{ML}$ is $\chi^2$-distributed such that:
$$\chi^2 = (n - 1)F_{ML}$$
which allows us to actually quantify model fit. (In this case $n$ refers to the sample size.)
We can then formally compare the $\chi^2$ statistic to the $\chi^2$-distribution with some degrees of freedom to achieve a confidence level in the fit. The degrees of freedom are determined by model identification: just identified models will have 0 degrees of freedom (and thus no goodness-of-fit test is possible), and identified models will have $n(n + 1)/2 - t$ degrees of freedom (where $n$ are the number of knowns and $t$ the number of unknowns, from the t-rule).
Failing to reject the null hypothesis that the $\chi^2$ statistic is different from 0 (the ideal fit) implies a generally good representation of the data (*P* > 0.05). Alternatively, rejecting the null implies that the $\chi^2$ statistic is large, as is the discrepancy between the observed and modeled variance-covariance matrices, thus implying a poor fit to the data (*P* < 0.05). Interpreting the outcome of the significance test is often tricky: unlike testing whether regression parameters are significantly different from zero (something we are all familiar with), in this case, a significant *P*-value indicates *poor* fit, so be careful.
The $\chi^2$ index also provides a way to gauge the relative fit of two models, one of which is nested within the other. The *$\chi^2$ difference test* is simply the difference in $\chi^2$ values between the two models, with the degrees of freedom being the difference in the degrees of freedom between the two models. The resulting statistic can then be compared to a $\chi^2$ table to yield a significance value. Again, this test is for *nested* models. For non-nested models, other statistics allow for model comparisons, including AIC and BIC. An AIC or BIC score $\geq2$ is generally considered to indicate significant differences among models, with smaller values indicating equivalency between the two models.
$\chi^2$ tests tend to be affected by sample size, with larger samples more likely to generate poor fit due to small absolute deviations (note the scaling of $F_{ML}$ by $n-1$ in the above equation). As a result, there are several other fit indices for covariance-based SEM that attempt to correct for this problem:
- *Root-mean squared error of approximation* (RMSEA): this statistic penalizes models based on sample size. An 'acceptable' value is generally <0.10 and a 'good' value is anything <0.08.
- *Comparative fit index* (CFI): this statistic considers the deviation from a 'null' model. In most cases, the null estimates all variances but sets the covariances to 0. A value >0.9 is considered good.
- *Standardized root-mean squared residual* (SRMR): the standardized difference between the observed and predicted correlations. A value <0.08 is considered good.
There are a vast and probably unnecessary number of other fit statistics that have been developed which you may run across. This [website](http://davidakenny.net/cm/fit.htm) has a fairly comprehensive overview, in the event you encounter some of the more uncommon ones in the literature.
What happens if the model doesn't fit? Depending on the goals of your analysis (e.g., exploratory) you may wish to see which parts of your model have failed to be reproduced by the model-implied variance-covariance matrix. This can be achieved in two ways:
- examination correlation of model residuals: parameters with large residual correlations (difference between observed and expected) could suggest missing information or linkages.
- *modification indices*, or the expected decrease in the $\chi^2$ if a missing path were to be included in the model. A high value of a modification index would suggest the missing path should be included. (Tests of directed separation, which we cover in the chapter on Local Estimation, provide similar insight and are returned automatically by *piecewiseSEM*.)
Users should take caution when exploring these techniques as to avoid dredging the model. SEM is a technique that relies heavily on informed model specification. Adding paths in that are suggested by the data but not anticipated by the user only in an effort to achieve adequate fit might be appropriate in other applications, but ignore the basic philosophy behind SEM that relationships are based on *a priori* knowledge and intuition.
## Model Fitting Using *lavaan*
We now have all the pieces necessary to fit an SEM using a covariance-based approach. The package to do so is called *lavaan*:
```{r, results = "hide", message = FALSE, warning = FALSE}
library(lavaan)
```
To demonstrate the functionality of this package, let's use the data from Grace & Keeley (2006), which is included in the *piecewiseSEM* package:
```{r, results = "hide", message = FALSE, warning = FALSE}
library(piecewiseSEM)
data(keeley)
```
In their study, Grace & Keeley wanted to understand patterns in plant diversity following disturbance, in this case wildfires in California.
### *lavaan* vs `lm`
For purposes of exploration, let's first consider the relationship between fire severity and stand age (with older stands having more combustible materials).

This is both a linear regression but also a simple SEM, and thus both can be fit using packages in R.
The package to fit the SEM using covariance-based methods is called *lavaan* (for LAtent VAriable ANalysis, which we will delve into in a later chapter). In *lavaan*, the syntax is the same as in other modeling functions in R with one key distinction: formulae are passed as character strings. To fit a model in *lavaan*, it's first necessary to break down the component models by the endogenous (response) variables and code them as characters. For example:
```{r}
keeley_formula1 <- 'firesev ~ age'
class(keeley_formula1)
```
The function used to fit the model is called (unsurprisingly) `sem` and accepts the formula string and the dataset:
```{r}
keeley_sem1 <- sem(keeley_formula1, data = keeley)
```
As with most other models, the function to retrieve the output is `summary`:
```{r}
summary(keeley_sem1)
```
The output is organized into a few sections. First is the likelihood optimization method, number of parameters, the total sample size for the model, the estimator ($F_{ML}$ is the default) and the fit statistic. The model has $\chi^2 = 0$ with 0 degrees of freedom: this is because we have as many knowns as unknowns, and thus the model is just identified or saturated. To show this, we can apply the t-rule: we must estimate the two variances of the variables plus their path coefficient ($t = 3$) and know the values of the two variables ($n = 2$). Recall the equation for the t-rule $t \leq n(n + 1)/2$, so $3 = 2(2+1)/2 = 6/2 = 3$, and therefore the model is saturated.
Next up are the actual parameter estimates: the relationship between fire severity and stand age is $\beta = 0.06$ with $P < 0.001$. The model also reports the estimated error variance on the endogenous variable.
We can dissect this output a little more. First, let's fit the corresponding linear model using `lm`:
```{r}
keeley_mod <- lm(firesev ~ age, data = keeley)
summary(keeley_mod)$coefficients
```
You'll notice that we get the same effect of age on fire severity $\beta = 0.0596$, but we also get an intercept, which is missing from the previous input. We can force *lavaan* to return the intercept using the argument `meanstructure = T` to the `sem` function:
```{r}
summary(sem(keeley_formula1, keeley, meanstructure = T))
```
which now returns the estimate of the intercept for fire severity. This value is the same as returned by `lm`.
Returning to our exploration of how path coefficients are calculated, the slope of a simple linear regression is $b_{xy} = COV_{xy}/VAR_{x}$, which we can recover from the raw data:
```{r}
cov(keeley[, c("firesev", "age")])[2, 1]/var(keeley$age)
```
Note that this value is the same returned from both `sem` and `lm`. Recall also for simple linear regression that the standardized coefficient is equal to the correlation:
```{r}
cor(keeley$firesev, keeley$age)
```
We can obtain the standardized coefficient from `lm` using the `coefs` function from *piecewiseSEM*:
```{r}
coefs(keeley_mod)
```
and indeed, it equals the bivariate correlation!
To return the standardized coefficients using *lavaan* requires a separate function or another argument. The function is `standardizedsolution` and returns a table of the standardized coefficients:
```{r}
standardizedsolution(keeley_sem1)
```
This output does not return the raw coefficients, however, or any other information about the model that is useful in interpretation. The obtain a single output, you can pass the argument `standardize = T` to `summary`:
```{r}
summary(keeley_sem1, standardize = T)
```
Now, a few columns are added at the end to report the standardized coefficients.
`standardizedsolution` also returned the error variance on fire severity, which is $1 - R^2$. However, *lavaan* also doesn't return the $R^2$ value by default, but can be retrieved using one more argument to `summary`, `rsq = T`:
```{r}
summary(keeley_sem1, standardize = T, rsq = T)
```
You'll note that, per Rule 5 of path coefficients, the error variance is $1 - R^2$.
### SEM using *lavaan*
Now that we have covered the basics of *lavaan*, let's fit a slightly more complicated SEM. This model is a simplified subset of the full model presented by Grace & Keeley:

Here, we test the hypotheses that total cover of plants is a function of fire severity, which in turn is informed by how old the plants are in a particular plot (which we have already investigated using linear regression). This test is known as *full mediation*. In other words, the effect of age is fully mediated by fire severity (we will test another scenario shortly).
Again, we must provide the formulae as a character string. This model can be broken down into two equations (placed on separate lines) representing the two endogenous variables:
```{r}
keeley_formula2 <- '
firesev ~ age
cover ~ firesev
'
```
NOTE: Multiple equations will go on separate lines so that *lavaan* can properly parse the model.
Now let's fit the SEM and examine the output:
```{r}
keeley_sem2 <- sem(keeley_formula2, data = keeley)
summary(keeley_sem2, standardize = T, rsq = T)
```
The key difference from our previous application of *lavaan* is we now have extra information with which to compute the $\chi^2$ goodness-of-fit statistic. A quick application of the t-rule: unknowns = 3 variances + 2 path coefficients = 5, knowns = 3 variables, so $5 < 3(3+1)/2 = 6$ leaving us 1 extra degree of freedom.
Moreover, we fail to reject the null that the observed and model-implied variance-covariance matrices are significantly different ($P = 0.069$). Thus, we have achieved adequate fit with this model. A note here that the magnitude of the P-value does not reflect increasing fit, only increasing support for the notion that the observed and model-estimated covariances are not different.
Incidentally, we can obtain other fit statistics using the `fitMeasures` function:
```{r}
fitMeasures(keeley_sem2)
```
Woah! We're certainly not lacking in statistics.
Returning to the summary output, we see the same coefficient for $firesev ~ age$ and a new estimate for $cover ~ firesev$. In this case, more severe fires reduce cover (not unexpectedly).
Now that we have multiple linkages, we can also compute the indirect effect of age on cover. Recall from Rule 3 of path coefficients that the indirect effects along a compound path are the product of the individual path coefficients: $0.454 * -0.437 = -0.198$.
We can obtain this value by modifying the model formula to include these calculations directly. This involves giving a name to the coefficients in the model strings, then adding a new line indicating their product using the operator `:=`:
```{r}
keeley_formula2.1 <- '
firesev ~ B1 * age
cover ~ B2 * firesev
indirect := B1 * B2
'
keeley_sem2.1 <- sem(keeley_formula2.1, keeley)
summary(keeley_sem2.1, standardize = T)
```
Indeed, the indirect path coefficient is the same as computed above.
Naming coefficients can come in handy when specifying, for example, fixed values for specific purposes (e.g., multigroup modeling).
### Testing Alternative Structures using *lavaan*
There is another possible configuration of these variables which includes a directed path between age and cover:

This type of model tests *partial mediation*, or the idea that the effect of age is partially mediated by fire severity, but there is still a direct effect between age and cover.
Let's fit the partial mediation model:
```{r}
keeley_formula3 <- '
firesev ~ age
cover ~ firesev + age
'
keeley_sem3 <- sem(keeley_formula3, data = keeley)
summary(keeley_sem3, standardize = T)
```
Ah, we have a problem: the model is saturated, so there are no degrees of freedom leftover with which to test model fit.
We can, however, use the $\chi^2$ test to determine whether this model is more or less supported than the partial mediation model, which is called using the function `anova`:
```{r}
anova(keeley_sem2, keeley_sem3)
```
(Note that the $\chi^2$ statistic and associated degrees of freedom are still 0 for the model testing partial mediation.)
We can see from this output that we fail to reject the null that the models are significantly different ($P = 0.069$). Because these are nested models, it is also fair to compare them using AIC/BIC. In both cases, the models are deemed equivalent and the more parsimonious model (full mediation) would be preferred. Moreover, examining the output of the model of partial mediation reveals a non-significant effect of age on cover ($P = 0.067$). Together, these pieces of information would suggest that plant cover is not directly affected by stand age, but rather the effect of age is entirely mediated through the severity of burns.
## References
Grace, J. B., & Keeley, J. E. (2006). A structural equation model analysis of postfire plant diversity in California shrublands. Ecological Applications, 16(2), 503-514.