forked from frec-5174/eco4cast-in-R-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathparameter-calibration1.qmd
548 lines (403 loc) · 22 KB
/
parameter-calibration1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
# Parameter calibration: likelihood methods {#sec-likelihood}
## Intro to probability and likelihood
```{r}
#| echo: FALSE
#| message: FALSE
library(tidyverse)
```
A random variable is a quantity that can take on values due to chance - it does not have a single value but instead can take on a range of values. The chance of each value is governed by a probability distribution
For example, X is a random variable. A particular value for X is y_i
```{r}
y_i <- -0.2513233
```
A random number is associated with a probability of it taking that value
P(X = y_i) = p_i
`p_i` is defined from a probability distribution (P). We may have an idea of the shape (i.e., normally distributed), but not the specific distribution (i.e. a normal with a mean = 0, and sd = 1). For example, @fig-simple-rnorm is a normal distribution with a mean = 0 and sd = 1. (I created it by randomly drawing 10000 samples from a normal distribution using the `rnorm` function)
```{r}
#| fig-cap: Normal distribution with mean = 0 and sd = 1
#| label: fig-simple-rnorm
plot(density(rnorm(10000, mean = 0, sd = 1)), xlab = "y", main = "mean = 0, sd = 1")
```
For our random variable, if is is governed by the distribution above, a single sample (i.e., a measurement of X) will likely be close to 0 but sometimes it will be quite far away.
Imagine that we are measuring the same thing with the same sensor but the sensor has error that is normally distributed. We think that the measurement should be zero (mean = 0) but with the sensor uncertainty (sd = 1).
The question is "What is the probability that `X = y_i` given a normal distribution with mean = 0 and sd = 1?". In @fig-simple-rnorm-obs, the red line is a hypothetical observation (y_i).
```{r}
#| fig-cap: Normal distribution with mean = 0 and sd = 1. The red line is an 'observation'
#| label: fig-simple-rnorm-obs
plot(density(rnorm(10000, mean = 0, sd = 1)), xlab = "y", main = "mean = 0, sd = 1")
abline(v = y_i, col = "red")
```
We could also ask "What is the probability that `X = y_i` given a normal distribution with mean = 2 and sd = 1?" In @fig-simple-rnorm-obs2 you can see that the observation is at a lower probability in the distribution.
```{r}
#| fig-cap: Normal distribution with mean = 2 and sd = 1. The red line is an 'observation'
#| label: fig-simple-rnorm-obs2
plot(density(rnorm(10000, mean = 2, sd = 1)), xlab = "y", main = "mean = 2, sd = 1")
abline(v = y_i, col = "red")
```
We can numerically compare the answer to the two questions using the "Probability Density Function". In R the PDF is represented by the `dnorm` function (d representing "density"). The dnorm function allows us to answer the question: Which mean is more likely given the data point?
```{r}
message("mean of 0")
dnorm(x = y_i, mean = 0, sd = 1)
message("mean of 2")
dnorm(x = y_i, mean = 2, sd = 1)
```
Now, we can collect more data that represent more samples from the random variable with an unknown mean. Here are two data points.
```{r}
y <- c(-0.2513233, -2.662035)
```
What is the probability of both events occurring given the mean of 0? They are independent so we can multiply them.
**IMPORTANT:** If two random events are independent, then the probability that they both occur is the multiplication of the two individual probabilities. For example the probability of getting two heads in two coin flips is 0.5 \* 0.5 = 0.25. If the events are not independent then we can't just multiply (we will get to this later).
The two calls to the `dnorm` function are the evaluation of the two data points.
```{r}
dnorm(x = y[1], mean = 0, sd = 1, log = FALSE) * dnorm(x = y[2], mean = 0, sd = 1, log = FALSE)
```
We can repeate the calculate but with a different mean to ask: What about with a mean of 2?
```{r}
dnorm(x = y[1], mean = 2, sd = 1, log = FALSE) * dnorm(x = y[2], mean = 2, sd = 1, log = FALSE)
```
Using the values above, which mean (mean = 0 or mean = 2) is more likely given the two data points?
IMPORTANT: Since multiplying a lot of small numbers can result in very small numbers, we add the log of the densities. This is shown below by using `log = TRUE`.
```{r}
message("log likelihood: mean of 0")
dnorm(x = y[1], mean = 0, sd = 1, log = TRUE) + dnorm(x = y[2], mean = 0, sd = 1, log = TRUE)
message("log likelihood: mean of 2")
dnorm(x = y[1], mean = 2, sd = 1, log = TRUE) + dnorm(x = y[2], mean = 2, sd = 1, log = TRUE)
```
This logically extends to three data points or more. We will get two log likelihood values because we are examining two "models" with three data points. In the code below, what are the two "models"?
```{r}
y <- c(-0.2513233, -2.662035, 4.060981)
LL1 <- dnorm(x = y[1], mean = 0, sd = 1, log = TRUE) +
dnorm(x = y[2], mean = 0, sd = 1, log = TRUE) +
dnorm(x = y[3], mean = 0, sd = 1, log = TRUE)
LL2 <- dnorm(x = y[1], mean = 2, sd = 1, log = TRUE) +
dnorm(x = y[2], mean = 2, sd = 1, log = TRUE) +
dnorm(x = y[3], mean = 2, sd = 1, log = TRUE)
message("log likelihood: mean = 0")
LL1
message("log likelihood: mean = 2")
LL2
```
The separate calls to `dnorm` can be combined into one call by using the vectoration capacities of R. If a vector of obsevation is passed to the function, it will return a vector that was evaluated as each value of the vector. Then the vector can be used in a call to the `sum` function to get a single value.
```{r}
message("log likelihood: mean = 0")
sum(dnorm(x = y, mean = 0, sd = 1, log = TRUE))
message("log likelihood: mean = 2")
sum(dnorm(x = y, mean = 2, sd = 1, log = TRUE))
```
The data (i.e., the three values in y) are more likely for a PDF with a mean = 0 than a mean = 2. Clearly we can compare two different guesses for the mean (0 vs. 2) but how do we find the most likely value for the mean?
First, lets create a vector of different mean values we want to test
```{r}
#Values for mean that we are going to explore
mean_values <- seq(from = -4, to = 7, by = 0.1)
mean_values
```
Now we will calculate the log likelihood of the data for each potential value of the mean. The code below loops over the different values of the mean that we want to test.
```{r}
LL <- rep(NA, length(mean_values))
for(i in 1:length(mean_values)){
LL[i] <- sum(dnorm(y, mean = mean_values[i], sd = 1, log = TRUE))
}
LL
```
The relationship between the different mean values and the likelihood is a curve (@fig-ll1)
IMPORTANT: This curve may look like a probability density function (PDF) but it is not! Why? Because the area under the curve does not sum to 1. As a result 1) we use the term likelihood rather than density and 2) while the curve gives us the most likelihood estimate for the parameter, it does **not** give us the PDF of the parameter (even if we really really want the PDF, Bayesian statistics are needed for that)
```{r}
#| fig-cap: The likelihood curve over different values of the mean
#| label: fig-ll1
plot(x = mean_values, y = LL, type = "o", xlab = "mean", ylab = "log(likelihood)")
```
The most likely value is the highest value on the curve. Which is:
```{r}
index_of_max <- which.max(LL)
mean_values[index_of_max]
```
that compares well to directly calculating the mean
```{r}
mean(y)
```
We just calculated the **likelihood of the data GIVEN our model** where our model was that "the data are normally distributed with a specific mean". We found the most likely value for a parameter in the model (the parameter being the mean of the PDF).
For use in R tools that find the minimum of a function, you can find the minimum of the negative log likelihood curve (@fig-ll2)
```{r}
#| fig-cap: The likelihood curve over different values of the mean with minimium shown
#| label: fig-ll2
plot(x = mean_values, y = -LL, type = "o", xlab = "mean", ylab = "-1 * log(likelihood)") #Notice negative sign
abline(v = mean(y), col = "red")
```
To use optimizer function in R, we first need to create a function that defines our likelihood (this is the same as above with the mean being \`par\[1\]\`). The likelihood is also called the cost function where worse values for the parameter have worse cost scores.
```{r}
LL_fn <- function(y, par){
#NEED NEGATIVE BECAUSE OPTIM MINIMIZES
-sum(dnorm(y, mean = par[1], sd = 1, log = TRUE))
}
```
This likelihood function is then used in an optimization function that finds the minimum (don't worry about the details)
```{r}
#| warning: FALSE
#@par are the starting values for the search
#@fn is the function that is minimized (find parameters that yield the lowest -log(LL))
#@ anything else like x are arguments needed by the LL_fn
fit <- optim(par = 4, fn = LL_fn, method = "BFGS", y = y)
fit
```
The optimal value from fit is similar to our manual calculation and equal to the actual mean.
```{r}
message("From optim")
fit$par
message("From manual calcuation")
mean_values[which.max(LL)]
message("From the mean function")
mean(y)
```
Sweet it works! I know you are thinking "Wow a complicated way to calculate a mean...great thanks!". But the power is that we can replace the direct estimation of the mean with a "process" model that predicts the mean.
Here is a simple model that predicts y using a regression `y = par[1] + par[2] * x`
First we are going to create some simulated data from the regression and add normally distributed noise.
```{r}
set.seed(100)
par_true <- c(3, 1.2) #Slope and intercept
sd_true <- 0.9 #noise term
x <- runif(100, 3, 10)
y_true <- par_true[1] + par_true[2] * x
y <- rnorm(length(y_true), mean = y_true, sd = sd_true) #This is the "fake data"
```
The fake data are shown in @fig-fake-data1.
```{r}
#| fig-cap: The "fake" data used in the following analysis
#| label: fig-fake-data1
plot(x, y)
```
Now we need a function to make the regression calculation from the parameters and inputs (x)
```{r}
pred_linear <- function(x, par){
y <- par[1] + par[2] * x
}
```
Now our likelihood function uses the result of `pred_linear` as the mean of the PDF. This asks how likely each data point is given the parameters and inputs (x). Unlike our example above where mean is constant, the mean is allowed to be different for each data point because the value of the input (x) is different for each point.
```{r}
LL_fn <- function(par, x, y){
#NEED NEGATIVE BECAUSE OPTIM MINIMIZES
-sum(dnorm(y, mean = pred_linear(x, par), sd = 1, log = TRUE))
}
```
As an example, for a value of par\[1\] = 0.2, and par\[2\] = 1.4 we can calculate the likelihood for the first two data points with the following
```{r}
dnorm(y[1], mean = pred_linear(x[1], c(0.2, 1.4)), sd = 1, log = TRUE) + dnorm(y[2], mean = pred_linear(x[2], c(0.2, 1.4)), sd = 1, log = TRUE) # + ...
```
Now use `optim` to solve
```{r}
#| warning: FALSE
pred_linear <- function(x, par){
y <- par[1] + par[2] * x
}
LL_fn <- function(par, x, y){
#NEED NEGATIVE BECAUSE OPTIM MINIMIZES
-sum(dnorm(y, mean = pred_linear(x, par), sd = 1, log = TRUE))
}
fit <- optim(par = c(0.2, 1.5), fn = LL_fn, method = "BFGS", x = x, y = y)
```
How do the estimated parameters compare to the true values?
```{r}
message("From optim")
fit$par
message("True values (those we used to generate the data)")
par_true
```
A key take home is that defining the likelihood function requires two steps
1) Write down your *process* model. This is a deterministic model (meaning that it doesn't have any random component)
```{r}
#Process model
pred_linear <- function(x, par){
y <- par[1] + par[2] * x
}
```
2) Write down your *probability* model. You can think of this as your *data* model because it represents how you think the underlying process is converted into data through the random processes of data collection, sampling, sensor uncertainty, etc. In R this will use a "d" function: `dnorm`, `dexp`, `dlnorm`, `dgamma`, etc.
```{r}
#Probability model
LL_fn <- function(par, x, y){
-sum(dnorm(y, mean = pred_linear(x, par), sd = 1, log = TRUE))
}
```
There is one more assumption in our likelihood calculation that we have not addressed: we are assuming the standard deviation in the data is known (sd = 1). In reality we don't know what the sd is. This assumption is simple to fix. All we need to do is make the sd a parameter that we estimate.
Now there is a par\[3\] in in the likelihood function (but it isn't in the process model function)
```{r}
#Probability model
LL_fn <- function(par, x, y){
-sum(dnorm(y, mean = pred_linear(x, par), sd = par[3], log = TRUE))
}
```
Fitting the sd is not necessarily clear. Think about it this way: if the sd is too low, there will be very low likelihood for data near the mean. In this case, you can increase the likelihood just by increasing the sd parameter (not changing other parameters).
How we can estimate all three parameters
```{r}
#| warning: FALSE
fit <- optim(par = c(0.2, 1.5, 2), fn = LL_fn, method = "BFGS", x = x, y = y)
```
How well to they compare to the true values used to generate the simulated data?
```{r}
message("From optim")
fit$par
message("True values (those we used to generate the data)")
c(par_true, sd_true)
```
Sweet it works! I know you are thinking "Wow a complicated way to calculate a linear regression...thanks...I can just use the lm() function in R to do this!".
```{r}
fit2 <- lm(y ~ x)
summary(fit2)
```
Notice how the "Residual standard error" is similar to our sd?
But the power is that we are not restricted to the assumptions of a linear regression model. We can use more a complicated and non-linear "process" models (not restricted to y = mx + b) and we can use more complicated "data" models than the normal distribution with constant variance (i.e., single value for sd that is used for all data point)
For example, we can simulate new data that assumes the standard deviation increases as the prediction increases (larger predictions have more variation). This might be due to sensor error increasing with the size of measurement. In this case we multiply y times a scalar (sd_scalar_true)
```{r}
set.seed(100)
x <- runif(100, 3, 10)
par_true <- c(3, 1.2)
sd_scalar_true <- 0.1
y_true <- par_true[1] + par_true[2] * x
y <- rnorm(length(y_true), mean = y_true, sd = sd_scalar_true * y) #This is the "fake data"
```
The fake data are shown in @fig-fake-data2.
```{r}
#| fig-cap: The "fake" data used in the following analysis
#| label: fig-fake-data2
plot(x, y)
```
We use the same "process model" because we have not changed how we think the underlying process works
```{r}
#Process model
pred_linear <- function(x, par){
y <- par[1] + par[2] * x
}
```
However, we do have to change our "data" or "probability" model. See how `sd` in the `dnorm` is now `sd = par[3] * pred_linear(x, par)` so that the sd increases with the size of the prediction
```{r}
#Probability model
LL_fn <- function(par, x, y){
-sum(dnorm(y, mean = pred_linear(x, par), sd = par[3] * pred_linear(x, par), log = TRUE))
}
```
We can now fit it
```{r}
#| warning: FALSE
fit <- optim(par = c(0.2, 1.5, 1), fn = LL_fn, method = "BFGS", x = x, y = y)
fit$par
```
As a result, we now better represent how data is generated from the underlying process. This will often change and improve parameter estimates for the process model.
As another example, you can easily change the process model to be non-linear. For example, we will generate simulated data from a saturating function (a Michaelis-Menten function)
```{r}
set.seed(100)
x <- runif(100, 0.1, 10)
par_true <- c(2, 0.5)
sd_true <- 0.1
y <- par_true[1] * (x / (x + par_true[2]))
y <- rnorm(length(y), mean = y, sd = sd_true)
```
The fake data are shown in @fig-fake-data3.
```{r}
#| fig-cap: The "fake" data used in the following analysis
#| label: fig-fake-data3
plot(x, y)
```
We certainly don't want to fit a linear regression to these data (though we could but it would not fit well). Fortunately, fitting it using likelihood is as simple as changing the *process* model. Here is our new process model.
```{r}
#Process model
pred_mm <- function(x, par){
y <- par[1] * (x / (par[2] + x))
}
```
The "data" or "probability" model stays the same
```{r}
#Probability model
LL_fn <- function(par, x, y){
-sum(dnorm(y, mean = pred_mm(x, par), sd = par[3], log = TRUE))
}
```
And we fit it
```{r}
#| warning: FALSE
fit <- optim(par = c(0.2, 1.5, 1), fn = LL_fn, method = "BFGS", x = x, y = y)
```
And it fits well
```{r}
fit$par
c(par_true, sd_true)
```
@fig-data-curve shows the curve from the optimized parameters with the observations
```{r}
#| fig-cap: the curve from the optimized parameters with the observations
#| label: fig-data-curve
plot(x, y)
curve(pred_mm(x, par = fit$par), from = 0.1, to = 10, add = TRUE)
```
Finally, we don't have to assume that the *data* model is a normal distribution. Here is an example assuming that the data model follows a gamma distribution (@fig-data-curve-gamma). Learn more about the gamma [here](https://en.wikipedia.org/wiki/Gamma_distribution)
```{r}
#| warning: FALSE
LL_fn <- function(par, x, y){
-sum(dgamma(y, shape = (pred_mm(x, par)^2/par[3]), rate = pred_mm(x, par)/par[3], log = TRUE))
}
fit <- optim(par = c(0.2, 1.5, 1), fn = LL_fn, x = x, y = y)
```
```{r}
#| echo: false
#| fig-cap: the curve from the optimized parameters with the observations for a gamma ditribution data model.
#| label: fig-data-curve-gamma
plot(x, y)
curve(pred_mm(x, par = fit$par), from = 0.1, to = 10, add = TRUE)
```
### Take home messages
1) Think about what process produces your data. Consider deterministic mechanisms and random processes.\
2) Write your process model as a function of inputs and parameters. Your process model should be deterministic (not have any random component). The process model can be a simple empirical model or a process model like @sec-process-model.
3) Write down your data model. What random processes converts the output of your process model to data? This could include variance in observations, variance in the process models (it doesn't represent the real world perfectly even if we could observe it perfectly), and variation among sample units (like individuals). In a likelihood analysis, all of these forms are typically combined together.
4) Embed your process model as the mean for your data model and solve with an optimizing algorithm. In the normal distribution, the mean is one of the parameters of the PDF so you can use the process model prediction directory in the data model as the mean. In other distributions the parameters of the PDF are not the mean, instead the mean is a function of the parameters. Therefore you will need to convert the mean to the parameters of the distribution. This is called "moment matching."
### Brief introduction to other components of a likelihood analysis
- Estimating uncertainty in the parameters using the shape of the likelihood curve (you can use the shape of the likelihood curve to estimate confidence (steeper = more confidence). In the @fig-ll-unc, the blue line has more data in the calculation of the mean than the black line. The dashed lines are the 2-unit likelihood level (2 log(LL) units were subtracted from the highest likelihood to determine the dashed line). The values of the dashed lines where they cross the likelihood curve with the same color are approximately the 95% confidence intervals for the parameter.
```{r}
#Create a dataset
y_more_data <- rnorm(20, mean = 0, sd = 1)
#Select only 5 of the values from the larger dataset
y_small_data <- y_more_data[1:5]
#Calculate the likelihood surface for each of the datasets
mean_values <- seq(from = -4, to = 7, by = 0.1)
LL_more <- rep(NA, length(mean_values))
for(i in 1:length(mean_values)){
LL_more[i] <- sum(dnorm(y_more_data, mean = mean_values[i], sd = 1, log = TRUE))
}
LL_small <- rep(NA, length(mean_values))
for(i in 1:length(mean_values)){
LL_small[i] <- sum(dnorm(y_small_data, mean = mean_values[i], sd = 1, log = TRUE))
}
#Find the highest likelihood
mle_LL_small <- LL_small[which.max(LL_small)]
mle_LL_more <- LL_more[which.max(LL_more)]
```
```{r}
#| echo: false
#| fig-cap: Likilihood curve with the two-unit support intervals shown
#| label: fig-ll-unc
#Plot the likelihood functions
plot(x = mean_values, y = LL_small, type = "l", xlab = "mean", ylab = "log-likelihood") #Notice negative sign
lines(x = mean_values, y = LL_more, col = "blue")
#Plot a line that shows a 2 likelihood reduction
abline(h = mle_LL_small - 2, lty = "dashed")
abline(h = mle_LL_more - 2, lty = "dashed", col = "blue")
```
- Comparing different models to select for the best model (uses the likelihood of the different models + a penalty for having more parameters). AIC is a common way to compare models (https://en.wikipedia.org/wiki/Akaike_information_criterion). It is 2 \* number of parameters - 2 \* ln (likelihood). Increasing the likelihood by adding more parameters won't necessarily result in lowest AIC (lower is better) because it includes a penalty for having more parameters.
- There are more complete packages for doing likelihood analysis that optimize parameters, calculate AIC, estimate uncertainty, etc. For example, the `likelihood` package and `mle` function in R do these calculations for you. Here is an example using the `mle` package (note that the `mle` package assumes that you have created the objects x and y and only has parameters in the function calls...see how pred_mm and LL_fn don't have x and y as arguments)
```{r}
#| warning: false
pred_mm <- function(par1, par2){
par1 * (x / (par2 + x))
}
LL_fn <- function(par1, par2, par3){
mu <- pred_mm(par1, par2)
-sum(dnorm(y, mean = mu, sd = par3, log = TRUE))
}
library(stats4)
fit <- mle(minuslogl = LL_fn, start = list(par1 = 0.2, par2 = 1.5, par3 = 1), method = "BFGS", nobs = length(y))
AIC(fit)
summary(fit)
logLik(fit)
confint(fit)
```
### Key point
Likelihood approach shown here gives us the *likelihood of the particular set of data GIVEN the model*. It does **not** give us the probability of the model given the particular set of data (i.e., the probability distributions of the model parameters). Ultimately, we want the probability distributions of the model parameters so that we can turn it around forecast the probability of yet to be collected data. For this we need to use Bayesian statistics in @sec-bayes!
## Reading
## Problem set
The problem set is located in `likelihood_problem_set.qmd` in https://github.com/frec-5174/book-problem-sets. You can fork the repository or copy the [code](https://github.com/frec-5174/book-problem-sets/blob/main/likelihood_problem_set.qmd) into your own quarto document