-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy path08-Composite_variables.Rmd
247 lines (148 loc) · 13.5 KB
/
08-Composite_variables.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
---
title: "Composite Variables"
author: "Jon Lefcheck"
date: "January 16, 2021"
output: html_document
---
# Composite Variables
## What is a Composite Variable?
Composite variables are another way besides latent variables to represent complex multivariate concepts in structural equation modeling. The most important distinction between the two is that, while latent variables *give rise to* measurable manifestations of an unobservable concept, composite variables *arise from* the total combined influence of measured variables.
Consider the following composite variable $\eta$:

Here, the arrows are leading *into*, not out of, $\eta$, indicating that the composite variable is made up of the influences of the three observed variables. Note: in this and other presentations, the composite is denoted by a hexagon, but can sometimes be an oval as it can technically be a form of a latent variable.
If the composite is *entirely* made up of the three influences, it can be said to have no error. An example might be the two levels of a treatment that lead into a single composite variable called 'Treatment.' In this case, there are no other levels of treatment because you, as the investigator, did not apply any. Thus the composite 'treatment' captures the full universe of treatment possibilities given the data.
In other cases, the property might arise from the collective influence of variables but is not without error. For example, the idea of soil condition arises from different aspects of the soil: its pH, moisture, grain size, and so on. However, one might measure only some of these, and thus there remain other factors (nutrient content, etc.) that might contribute to the notion of soil condition. In this case, the composite *would* have error and is therefore known as a *latent composite*.
The benefit of such an approach is that complicated constructs can be distilled into discrete *blocks* that are easier to present and discuss. For example, it is easier to talk about the effects of the experimental treatment on soil condition, rather than the effect of treatment 1 on soil moisture, the effect of treatment 1 on soil pH, the effect of treatment 1 on soil grain size, and so on.
In this way, the composite harkens back to the early meta-model, or broad conceptual relationships that inform the parameterization of the structural path model. In fact, in populating the meta-model, you may wish to consider those broad concepts as composites (or latents) when fitting the model, rather than modeling all relationships among all observed variables.
Selecting between latent and composite variables comes down to the concept in question, the presumed direction of causality, and the nature of the indicators.
For the soil example, consider: is it that there is a common difference among soils driving variation in pH, moisture, etc.? Or is it that pH, moisture, etc. are all independent properties that combine to inform soil condition? If the goal is measure plant growth in potting soils from different manufacturers, then manufacturer might be the common source of variation and a latent variable more appropriate. If the observer is visiting different sites and measuring conditions that describe the soil in each place, then perhaps a composite variable is warranted.
Another way of thinking about this is whether the indicators are interchangeable. In other words, does soil pH tell us the same information as soil moisture? If so, then they might be indicators of the same latent phenomenon. If not, and they contain unique information, then they likely combine to form a composite variable.
Finally, do the indicators co-vary? If they are under common control of a latent variable, then changing one should alter all the others. If they are relatively independent--for example, one could change grain size without changing nutrient content--then causation likely flows into a composite (rather than out of a latent) variable.
Now that we have defined a composite variable, let's see how to make one.
## Constructing a Composite Variable
Compared to latent variables, a composite variable is actually very easy to estimate: it is simply the sum of its indicators, hence the term *composite*.
The way in which the indicators are summed depends on whether they are expected to have the same weight (a *fixed composite*) or different weights (a *statistical composite*). The former might be something like species relative abundances. The latter is what we will focus on here because it has the most practical applications in ecology.
The weights for the composite are easily acquired as they are the values that maximize the explained variance in some response. We have done this before many times using maximum-likelihood fitting.
In fact, statistical composites can be boiled down to the coefficients from a multiple regression:
(1) The maximum-likelihood fitting function chooses parameter estimates for each predictor that maximize the likelihood of observing the response;
(2) Those parameter values serve as the loadings for the indicators of the composite variable;
(3) The data for each indicator are multiplied by their loading and summed to generate the *factor scores* for the composite variable, which is then used in the structural model.
Let's demonstrate using some random data:
```{r}
set.seed(8)
y <- rnorm(50)
x1 <- rnorm(50)
x2 <- x1 + runif(50)
# run multiple regression
model <- lm(y ~ x1 + x2)
# get loadings
beta_x1 <- summary(model)$coefficients[2, 1]
beta_x2 <- summary(model)$coefficients[3, 1]
# compute factor scores
composite <- beta_x1 * x1 + beta_x2 * x2
```
These summed values can then used to predict the response:
```{r}
summary(lm(y ~ composite))
```
Note how the unstandardized coefficient is 1. This is because the composite is in units of the predicted values of the response. Thus, the coefficient is really only interpretable in standardized units.
Let's alternately fit this composite model with *lavaan* and fix the loadings of $x1$ and $x2$ to the values from the multiple regression:
```{r, message = FALSE, warning = FALSE}
library(lavaan)
comp_formula1 <- '
composite <~ -0.498 * x1 + 0.579 * x2
y ~ composite
'
comp_model1 <- sem(comp_formula1, data.frame(y, x1, x2))
summary(comp_model1, standardize = T)
```
We see from the output that the estimated loadings for our two indicators are the same values we provided, and consequently the understandardized coefficient is 1. However, the standardized coefficient is 0.177 and it is this value that we would present (although its non-significant, given that these are fake data).
Let's suppose we didn't know the loadings from the multiple regression. We run into the same issue of identifiability as when constructing latent variables, so we must fix the first loading to 1. This will also define the scale of the composite. NOTE: *lavaan* does not do this automatically (as it does for latents), so we will have to implement it manually.
```{r}
comp_formula2 <- '
composite <~ 1 * x1 + x2
y ~ composite
'
comp_model2 <- sem(comp_formula2, data.frame(y, x1, x2))
```
It seems that, because the true loading of $x1$ on the composite is far from 1 (we know it is actually -0.498), we have received a non-convergence error!
One solution is to set the other loading to 1:
```{r}
comp_formula3 <- '
composite <~ x1 + 1 * x2
y ~ composite
'
comp_model3 <- sem(comp_formula3, data.frame(y, x1, x2))
summary(comp_model3, standardize = T)
```
Here the model converges because the true loading (0.579) is close enough to 1 for the maximum-likelihood fitting function to find it within a certain number of iterations.
Note that the unstandardized coefficient is no longer 1: this is because the scale of the composite has been set to that of the second indicator.
For reasons of model convergence, it is generally recommended that one compute the loadings by hand and fix them in the model. This has an added benefit we will get to in a later section. But first let's explore a real-world example.
## Grace & Keeley Revisited: A Worked Example
Recall from the chapters on global and local estimation that Grace & Keeley (2006) were interested in the factors that mediated recovery of shrublands post-fire disturbance. In those chapters, we fit different sub-models of their larger model, and we'll fit a different sub-model yet again in this chapter for simplicity.
In their model, they used plant cover to predict plant species richness. Let's assume for a moment that the relationship between cover and richness may be non-linear: its not until a certain amount of cover that rarer species begin to appear, for example. In this case, we might suppose there are both linear $cover$ and non-linear $cover^2$ components to the model. Composite variables are a nice way to summarize both the linear and non-linear effects.
Let's fit the following composite variable:

Here we have a composite that summarizes the unsquared and squared values of cover, which then goes on to predict richness.
Let's adopt the two-step approach and first fit a linear model.
```{r, message = FALSE, warning = FALSE}
library(piecewiseSEM)
data(keeley)
cover_model <- lm(rich ~ cover + I(cover^2), keeley)
summary(cover_model)
```
It seems both the unsquared and squared values of cover significantly predict richness, so we are justified in including both as indicators to our composite variable. Now we extract the coefficients, use them to generate the factor scores, and finally use those scores to predict richness.
```{r}
beta_cover <- summary(cover_model)$coefficients[2, 1]
beta_cover2 <- summary(cover_model)$coefficients[3, 1]
composite <- beta_cover * keeley$cover + beta_cover2 * (keeley$cover)^2
summary(lm(rich ~ composite, data = data.frame(keeley, composite)))
```
As would be expected from the multiple regression, the composite term significantly predicts richness (*P* < 0.001). Let's use the `coefs` function from *piecewiseSEM* to obtain the standardized coefficient:
```{r}
coefs(lm(rich ~ composite, data = data.frame(keeley, composite)))
```
So a 1 standard deviation change in the total cover effect would result in a 0.40 standard deviation change in plant richness.
We can alternately fit the model with *lavaan* using the same coefficients from the multiple regression:
```{r, message = FALSE, warning = FALSE}
# create a new non-linear variable for cover^2
keeley$coversq <- keeley$cover^2
keeley_formula1 <- '
composite <~ 58 * cover + -28.578 * coversq
rich ~ composite
'
keeley_model1 <- sem(keeley_formula1, keeley, fixed.x = F)
summary(keeley_model1, standardize = T)
```
Which leads us to the same standardized coefficient (0.40) as through the manual calculation.
Finally, let's incorporate the effect of fire severity on cover and richness. Now the composite is endogenous because it is affected by fire severity, and goes on to predict richness.

First, we must fit the model without the composite to get the loadings of $cover$ and $cover^2$.
```{r, message = FALSE, warning = FALSE}
keeley_formula2 <- '
composite <~ 58 * cover + -28.578 * coversq
rich ~ composite + firesev
cover ~ firesev
cover ~~ coversq
firesev ~~ coversq
'
keeley_model2 <- sem(keeley_formula2, keeley)
summary(keeley_model2, standardize = T, rsq = T)
```
Note that the squared and unsquared terms for $cover$ have correlated errors, because they are both driven by the underlying values of cover. Also, because the squared term is not a 'true' variable in the model but a convenience for us to explore this non-linearity, we must treat $cover$ and $cover^2$ as exogenous even though they are part of an endogenous composite. *lavaan* automatically models correlations among exogenous variables, but will not do so for the composite indicators unless told explicitly. In this case, then, we must manually control for the correlation between $cover^2$ and $cover$ and $firesev$.
Here, we find a good-fitting model (*P* = 0.80). Moreover, we obtain the standardized coefficient for the effect of fire severity on cover $\gamma = -0.437$ and of the composite on richness $\beta = 0.292$ controlling for fire severity, which is the total non-linear effect of cover.
To otain the indirect effect then, we multiply these paths plus the standardized loading $cover$ on the composite: $-0.437 * 3.048 * 0.292 = -0.389$.
## Composites in *piecewiseSEM*
For the moment, composites are not directly implemented in *piecewiseSEM* with special syntax like in *lavaan*, but we hope to introduce that functionality soon. In the interim, they are easy to compute them by hand, as we have shown above, extract the predicted scores, and use them as any other predictor.
Let's examine this with the Keeley model as above:
```{r, message = FALSE, warning = FALSE}
keeley$composite <- composite
keeley_psem <- psem(
lm(cover ~ firesev, keeley),
lm(rich ~ composite + firesev, keeley)
)
summary(keeley_psem, .progressBar = FALSE)
```
Note that we get the same standardized coefficients as in *lavaan*! There is, however, deviation in the goodness-of-fit that are accounted for by differences in how the composite is constructed and the correlated errors among the indicator and $firesev$ which are not possible to model in *piecewiseSEM* yet.
## References
Grace, J. B., & Keeley, J. E. (2006). A structural equation model analysis of postfire plant diversity in California shrublands. Ecological Applications, 16(2), 503-514.