diff --git a/2.08-model_checking.Rmd b/2.08-model_checking.Rmd
index a293074..508b7ba 100644
--- a/2.08-model_checking.Rmd
+++ b/2.08-model_checking.Rmd
@@ -23,7 +23,7 @@ We use an analysis of the whitethroat breeding density in wildflower fields of d
Because the Stan developers have written highly convenient user friendly functions to do posterior predictive model checks, we fit the model with Stan using the function `stan_glmer` from the package `rstanarm`.
-```{r}
+```{r, message=FALSE, results="hide"}
data("wildflowerfields")
dat <- wildflowerfields
dat$size.ha <- dat$size/100 # change unit to ha
@@ -39,3 +39,85 @@ mod <- stan_glmer(bp ~ year.z + age.l + age.q + age.c + size.z +
(1|field) + offset(log(size.ha)), family=poisson, data=dat)
```
+
+The R-package `shinystan` [@StanDevelopmentTeam.2017b] provides an easy way to do model checking. Therefore, there is no excuse to not do posterior predictive model checking. The R-code `launch_shinystan(mod)` opens a html-file that contains all kind of diagnostics of a model. Besides many statistics and diagnostic plots to assess how well the MCMC worked we also find a menu "PPcheck". There, we can click through many of the plots that we, below, produce in R.
+
+The function `posterior_predict` simulates many (exactly as many as there are draws from the posterior distributions of the model parameters, thus 4000 if the default number of iteration has been used in Stan) different data sets from a model fit. Specifically, for each single set of parameter values of the joint posterior distribution it simulates one replicated data set. We can look at histograms of the data and the replicated (Figure \@ref(fig:histpp)). The real data (bp) look similar to the replicated data.
+
+```{r histpp, fig.cap="Histograms of 8 out of 4000 replicated data sets and of the observed data (dat$bp). The arguments breaks and ylim have been used in the function hist to produce the same scale of the x- and y-axis in all plots. This makes comparison among the plots easier."}
+set.seed(2352) # to make sure that the ylim and breaks of the histograms below can be used
+yrep <- posterior_predict(mod)
+par(mfrow=c(3,3), mar=c(2,1,2,1))
+for(i in 1:8) hist(yrep[i,], col="blue",
+ breaks=seq(-0.5, 18.5, by=1), ylim=c(0,85))
+hist(dat$bp, col="blue",
+ breaks=seq(-0.5, 18.5, by=1), ylim=c(0,85))
+```
+
+Let's look at specific aspects of the data. The proportion of zero counts could be a sensitive test statistic for this data set. First, we define a function “propzero” that extracts the proportion of zero counts from a vector of count data. Then we apply this function to the observed data and to each of the 4000 replicated data sets. At last, we extract the 1 and 99% quantile of the proportion of zero values of the replicated data.
+
+```{r}
+propzeros <- function(x) sum(x==0)/length(x)
+propzeros(dat$bp) # prop. zero values in observed data
+
+pzeroyrep <- apply(yrep, 2, propzeros) # prop. zero values in yrep
+quantile(pzeroyrep, prob=c(0.01, 0.99))
+```
+
+The observed data contain `r round(propzeros(dat$bp), 2)*100`% zero values, which is well within the 98%-range of what the model predicted (`r round(quantile(pzeroyrep, prob=c(0.01)), 2)*100` - `r round(quantile(pzeroyrep, prob=c(0.99)), 2)*100`%). the Bayesian p-value is `r round(mean(pzeroyrep>=propzeros(dat$bp)),2)`.
+
+```{r}
+mean(pzeroyrep>=propzeros(dat$bp))
+```
+
+What about the upper tail of the data? Let’s look at the 90% quantile.
+
+```{r}
+quantile(dat$bp, prob=0.9) # for observed data
+
+q90yrep <- apply(yrep, 2, quantile, prob=0.9) # for simulated data
+table(q90yrep)
+```
+
+Also, the 90% quantile of the data is within what the model predicts.
+
+We also can look at the spatial distribution of the data and the replicated data. The variables X and Y are the coordinates of the wildflower fields. We can use them to draw transparent gray dots sized according to the number of breeding pairs.
+
+```{r spatpp, fig.cap="Spatial distribution of the whitethroat breeding pair counts and of 8 randomly chosen replicated data sets with data simulated based on the model. the smallest dot correspond to a count of 0, the largest to a count of 20 breeding pairs. The panel in the upper left corner shows the data, the other panels are replicated data from the model."}
+par(mfrow=c(3,3), mar=c(1,1,1,1))
+plot(dat$X, dat$Y, pch=16, cex=dat$bp+0.2, col=rgb(0,0,0,0.5), axes=FALSE)
+box()
+r <- sample(1:nrow(yrep), 1) # draw 8 replicated data sets at random
+for(i in r:(r+7)){
+plot(dat$X, dat$Y, pch=16, cex=yrep[i,]+0.2,
+col=rgb(0,0,0,0.5), axes=FALSE)
+box()
+}
+```
+
+The spatial distribution of the replicated data sets seems to be similar to the observed one at first look (Figure \@ref(fig:spatpp)). With a second look, we may detect in the middle of the study area the model may predict slightly larger numbers than observed. This pattern may motivate us to find the reason for the imperfect fit if the main interest is whitethroat density estimates. Are there important elements in the landscape that influence whitethroat densities and that we have not yet taken into account in the model? However, our main interest is finding the optimal age of wildflower fields for the whitethroat. Therefore, we look at the mean age of the 10% of the fields with the highest breeding densities.
+To do so, we first define a function that extracts the mean field age of the 10% largest whitethroat density values, and then we apply this function to the observed data and to the 4000 replicated data sets.
+
+```{r}
+magehighest <- function(x) {
+q90 <- quantile(x/dat$size.ha, prob=0.90)
+index <- (x/dat$size.ha)>=q90
+mage <- mean(dat$age[index])
+return(mage)
+}
+
+magehighest(dat$bp)
+
+mageyrep <- apply(yrep, 1, magehighest)
+quantile(mageyrep, prob=c(0.01, 0.5,0.99))
+```
+
+The mean age of the 10% of the fields with the highest whitethroat densities is `r magehighest(dat$bp)` years in the observed data set. In the replicated data set it is between `r round(quantile(mageyrep, prob=0.01),2)` and `r round(quantile(mageyrep, prob=0.99),2)` years. The Bayesian p-value is `r round(mean(mageyrep>=magehighest(dat$bp)),2)`. Thus, in around `r round(mean(mageyrep>=magehighest(dat$bp)),2)*100`% of the replicated data sets the mean age of the 10% fields with the highest whitethroat densities was higher than the observed one (Figure \@ref(fig:agepp)).
+
+```{r agepp, fig.cap="Histogram of the average age of the 10% wildflower fields with the highest breeding densities in the replicated data sets. The orange line indicates the average age for the 10% fields with the highest observed whithethroat densities."}
+
+hist(mageyrep)
+abline(v=magehighest(dat$bp), col="orange", lwd=2)
+```
+
+In a publication, we could summarize the results of the posterior predictive model checking in a table or give the plots in an appendix. Here, we conclude that the model fits in the most important aspects well. However, the model may predict too high whitethroat densities in the central part of the study area.
\ No newline at end of file
diff --git a/docs/1.3-distributions_files/figure-html/unnamed-chunk-1-1.png b/docs/1.3-distributions_files/figure-html/unnamed-chunk-1-1.png
index 945c073..2159d30 100644
Binary files a/docs/1.3-distributions_files/figure-html/unnamed-chunk-1-1.png and b/docs/1.3-distributions_files/figure-html/unnamed-chunk-1-1.png differ
diff --git a/docs/1.4-additional_basic_material.md b/docs/1.4-additional_basic_material.md
deleted file mode 100644
index 3d44feb..0000000
--- a/docs/1.4-additional_basic_material.md
+++ /dev/null
@@ -1,295 +0,0 @@
-
-# Additional basic material {#addbasics}
-
-THIS CHAPTER IS UNDER CONSTRUCTION!!!
-
-## Correlations among categorical variables
-
-### Chisquare test
-
-When testing for correlations between two categorical variables, then the nullhypothesis is "there is no correlation". The data can be displayed in cross-tables.
-
-
-```r
-# Example: correlation between birthday preference and car ownership
-load("RData/datacourse.RData")
-table(dat$birthday, dat$car)
-```
-
-```
-##
-## N Y
-## flowers 6 1
-## wine 9 6
-```
-
-Given the nullhypothesis was true, we expect that the distribution of the data in each column of the cross-table is similar to the distribution of the row-sums. And, the distribution of the data in each row should be similar to the distribution of the column-sums. The chisquare test statistics $\chi^2$ measures the deviation of the data from this expected distribution of the data in the cross-table.
-
-For calculating the chisquare test statistics $\chi^2$, we first have to obtain for each cell in the cross-table the expected value $E_{ij}$ = rowsum*colsum/total.
-
-$\chi^2$ measures the difference between the observed $O_{ij}$ and expected $E_{ij}$ values as:
-$\chi^2=\sum_{i=1}^{m}\sum_{j=1}^{k}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}$ where $m$ is the number of rows and $k$ is the number of columns.
-The $\chi^2$-distribution has 1 parameter, the degrees of freedom $v$ = $(m-1)(k-1)$.
-
-
-
-
(\#fig:chisqdist)Two examples of Chisquare distributions.
-
-
-R is calculating the $\chi^2$ value for specific cross-tables, and it is also giving the p-values, i.e., the probability of obtaining the observed or a higher $\chi^2$ value given the nullhypothesis was true by comparing the observed $\chi^2$ with the corresponding chisquare distribution.
-
-
-```r
-chisq.test(table(dat$birthday, dat$car))
-```
-
-```
-##
-## Pearson's Chi-squared test with Yates' continuity correction
-##
-## data: table(dat$birthday, dat$car)
-## X-squared = 0.51084, df = 1, p-value = 0.4748
-```
-
-The warning (that is suppressed in the rmarkdown version, but that you will see if you run the code on your own computer) is given, because in our example some cells have counts less than 5. In such cases, the Fisher's exact test should be preferred. This test calculates the p-value analytically using probability theory, whereas the chisquare test relies on the assumption that the $\chi^2$ value follows a chisquare distribution. The latter assumption holds better for larger sample sizes.
-
-
-```r
-fisher.test(table(dat$birthday, dat$car))
-```
-
-```
-##
-## Fisher's Exact Test for Count Data
-##
-## data: table(dat$birthday, dat$car)
-## p-value = 0.3501
-## alternative hypothesis: true odds ratio is not equal to 1
-## 95 percent confidence interval:
-## 0.3153576 213.8457248
-## sample estimates:
-## odds ratio
-## 3.778328
-```
-
-
-### Correlations among categorical variables using Bayesian methods
-
-For a Bayesian analysis of cross-table data, a data model has to be found. There are several possibilities that could be used:
-
-* a so-called log-linear model (Poisson model) for the counts in each cell of the cross-table.
-* a binomial or a multinomial model for obtaining estimates of the proportions of data in each cell
-
-These models provide possibilities to explore the patterns in the data in more details than a chisquare test.
-
-
-```r
-# We arrange the data into a cross-table in a data-frame
-# format. That is, the counts in each cell of the
-# cross-table become a variable and the row and column names
-# are also given in separate variables
-datagg <- aggregate(dat$name_fictive, list(birthday=dat$birthday, car=dat$car),
- length, drop=FALSE)
-datagg$x[is.na(datagg$x)] <- 0
-names(datagg) <- c("birthday", "car", "count")
-datagg
-```
-
-```
-## birthday car count
-## 1 flowers N 6
-## 2 wine N 9
-## 3 flowers Y 1
-## 4 wine Y 6
-```
-
-
-
-```r
-# log-linear model
-library(arm)
-nsim <- 5000
-
-mod <- glm(count~birthday+car + birthday:car,
- data=datagg, family=poisson)
-bsim <- sim(mod, n.sim=nsim)
-round(t(apply(bsim@coef, 2, quantile,
- prob=c(0.025, 0.5, 0.975))),2)
-```
-
-```
-## 2.5% 50% 97.5%
-## (Intercept) 1.00 1.79 2.58
-## birthdaywine -0.64 0.41 1.48
-## carY -3.94 -1.79 0.29
-## birthdaywine:carY -0.94 1.41 3.76
-```
-
-The interaction parameter measures the strength of the correlation. To quantitatively understand what a parameter value of 1.39 means, we have to look at the interpretation of all parameter values. We do that here quickly without a thorough explanation, because we already explained the Poisson model in chapter 8 of [@KornerNievergelt2015].
-
-The intercept 1.79 corresponds to the logarithm of the count in the cell "flowers" and "N" (number of students who prefer flowers as a birthday present and who do not have a car), i.e., $exp(\beta_0)$ = 6. The exponent of the second parameter corresponds to the multiplicative difference between the counts in the cells "flowers and N" and "wine and N", i.e., count in the cell "wine and N" = $exp(\beta_0)exp(\beta_1)$ = exp(1.79)exp(0.41) = 9. The third parameter measures the multiplicative difference in the counts between the cells "flowers and N" and "flowers and Y", i.e., count in the cell "flowers and Y" = $exp(\beta_0)exp(\beta_2)$ = exp(1.79)exp(-1.79) = 1. Thus, the third parameter is the difference in the logarithm of the counts between the car owners and the car-free students for those who prefer flowers. The interaction parameter is the difference of this difference between the students who prefer wine and those who prefer flowers. This is difficult to intuitively understand. Here is another try to formulate it: The interaction parameter measures the difference in the logarithm of the counts in the cross-table between the row-differences between the columns. Maybe it becomes clear, when we extract the count in the cell "wine and Y" from the model parameters: $exp(\beta_0)exp(\beta_1)exp(\beta_2)exp(\beta_3)$ = exp(1.79)exp(0.41)exp(-1.79)exp(1.39) = 6.
-
-
-Alternatively, we could estimate the proportions of students prefering flower and wine within each group of car owners and car-free students using a binomial model. For an explanation of the binomial model, see chapter 8 of [@KornerNievergelt2015].
-
-
-```r
-# binomial model
-tab <- table(dat$car,dat$birthday)
-mod <- glm(tab~rownames(tab), family=binomial)
-bsim <- sim(mod, n.sim=nsim)
-```
-
-
-
-
(\#fig:unnamed-chunk-7)Estimated proportion of students that prefer flowers over wine as a birthday present among the car-free students (N) and the car owners (Y). Given are the median of the posterior distribution (circle). The bar extends between the 2.5% and 97.5% quantiles of the posterior distribution.
-
-
-
-
-
-
-
-
-## 3 methods for getting the posterior distribution
-
-* analytically
-* approximation
-* Monte Carlo simulation
-
-### Monte Carlo simulation (parametric bootstrap)
-
-Monte Carlo integration: numerical solution of $\int_{-1}^{1.5} F(x) dx$
-
-
-
-sim is solving a mathematical problem by simulation
-How sim is simulating to get the marginal distribution of $\mu$:
-
-
-
-
-
-
-### Grid approximation
-
-$p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)}$
-
-For example, one coin flip (Bernoulli model)
-
-data: y=0 (a tail)
-likelihood: $p(y|\theta)=\theta^y(1-\theta)^{(1-y)}$
-
-
-
-
-
-### Markov chain Monte Carlo simulations
-
-* Markov chain Monte Carlo simulation (BUGS, Jags)
-* Hamiltonian Monte Carlo (Stan)
-
-
-
-
-
-## Analysis of variance ANOVA
-The aim of an ANOVA is to compare means of groups. In a frequentist analysis, this is done by comparing the between-group with the within-group variance. The result of a Bayesian analysis is the joint posterior distribution of the group means.
-
-
-
-
(\#fig:unnamed-chunk-12)Number of stats courses students have taken before starting a PhD in relation to their feeling about statistics.
-
-
-In the frequentist ANOVA, the following three sum of squared distances (SS) are used to calculate the total, the between- and within-group variances:
-Total sum of squares = SST = $\sum_1^n{(y_i-\bar{y})^2}$
-Within-group SS = SSW = $\sum_1^n{(y_i-\bar{y_g})^2}$: unexplained variance
-Between-group SS = SSB = $\sum_1^g{n_g(\bar{y_g}-\bar{y})^2}$: explained variance
-
-The between-group and within-group SS sum to the total sum of squares: SST=SSB+SSW. Attention: this equation is only true in any case for a simple one-way ANOVA (just one grouping factor). If the data are grouped according to more than one factor (such as in a two- or three-way ANOVA), then there is one single solution for the equation only when the data is completely balanced, i.e. when there are the same number of observations in all combinations of factor levels. For non-balanced data with more than one grouping factor, there are different ways of calculating the SSBs, and the result of the F-test described below depends on the order of the predictors in the model.
-
-
-
-
(\#fig:unnamed-chunk-13)Visualisation of the total, between-group and within-group sum of squares. Points are observations; long horizontal line is the overall mean; short horizontal lines are group specific means.
-
-
-
-In order to make SSB and SSW comparable, we have to divide them by their degrees of freedoms. For the within-group SS, SSW, the degrees of freedom is the number of obervations minus the number of groups ($g$), because $g$ means have been estimated from the data. If the $g$ means are fixed and $n-g$ data points are known, then the last $g$ data points are defined, i.e., they cannot be chosen freely. For the between-group SS, SSB, the degrees of freedom is the number of groups minus 1 (the minus 1 stands for the overall mean).
-
-* MSB = SSB/df_between, MSW = SSW/df_within
-
-It can be shown (by mathematicians) that, given the nullhypothesis, the mean of all groups are equal $m_1 = m_2 = m_3$, then the mean squared errors between groups (MSB) is expected to be equal to the mean squared errors within the groups (MSW). Therefore, the ration MSB/MSW is expected to follow an F-distribution given the nullhypothesis is true.
-
-* MSB/MSW ~ F(df_between, df_within)
-
-
-The Bayesian analysis for comparing group means consists of calculating the posterior distribution for each group mean and then drawing inference from these posterior distributions.
-A Bayesian one-way ANOVA involves the following steps:
-1. Decide for a data model: We, here, assume that the measurements are normally distributed around the group means. In this example here, we transform the outcome variable in order to better meet the normal assumption. Note: the frequentist ANOVA makes exactly the same assumptions. We can write the data model: $y_i\sim Norm(\mu_i,\sigma)$ with $mu_i= \beta_0 + \beta_1I(group=2) +\beta_1I(group=3)$, where the $I()$-function is an indicator function taking on 1 if the expression is true and 0 otherwise. This model has 4 parameters: $\beta_0$, $\beta_1$, $\beta_2$ and $\sigma$.
-
-
-```r
-# fit a normal model with 3 different means
-mod <- lm(log(nrcourses+1)~statsfeeling, data=dat)
-```
-
-2. Choose a prior distribution for each model parameter: In this example, we choose flat prior distributions for each parameter. By using these priors, the result should not remarkably be affected by the prior distributions but almost only reflect the information in the data. We choose so-called improper prior distributions. These are completely flat distributions that give all parameter values the same probability. Such distributions are called improper because the area under the curve is not summing to 1 and therefore, they cannot be considered to be proper probability distributions. However, they can still be used to solve the Bayesian theorem.
-
-3. Solve the Bayes theorem: The solution of the Bayes theorem for the above priors and model is implemented in the function sim of the package arm.
-
-
-```r
-# calculate numerically the posterior distributions of the model
-# parameters using flat prior distributions
-nsim <- 5000
-set.seed(346346)
-bsim <- sim(mod, n.sim=nsim)
-```
-
-4. Display the joint posterior distributions of the group means
-
-
-
-```r
-# calculate group means from the model parameters
-newdat <- data.frame(statsfeeling=levels(factor(dat$statsfeeling)))
-X <- model.matrix(~statsfeeling, data=newdat)
-fitmat <- matrix(ncol=nsim, nrow=nrow(newdat))
-for(i in 1:nsim) fitmat[,i] <- X%*%bsim@coef[i,]
-hist(fitmat[1,], freq=FALSE, breaks=seq(-2.5, 4.2, by=0.1), main=NA, xlab="Group mean of log(number of courses +1)", las=1, ylim=c(0, 2.2))
-hist(fitmat[2,], freq=FALSE, breaks=seq(-2.5, 4.2, by=0.1), main=NA, xlab="", las=1, add=TRUE, col=rgb(0,0,1,0.5))
-hist(fitmat[3,], freq=FALSE, breaks=seq(-2.5, 4.2, by=0.1), main=NA, xlab="", las=1, add=TRUE, col=rgb(1,0,0,0.5))
-legend(2,2, fill=c("white",rgb(0,0,1,0.5), rgb(1,0,0,0.5)), legend=levels(factor(dat$statsfeeling)))
-```
-
-
-
-
(\#fig:unnamed-chunk-16)Posterior distributions of the mean number of stats courses PhD students visited before starting the PhD grouped according to their feelings about statistics.
-
-
-Based on the posterior distributions of the group means, we can extract derived quantities depending on our interest and questions. Here, for example, we could extract the posterior probability of the hypothesis that students with a positive feeling about statistics have a better education in statistics than those with a neutral or negative feeling about statistics.
-
-
-```r
-# P(mean(positive)>mean(neutral))
-mean(fitmat[3,]>fitmat[2,])
-```
-
-```
-## [1] 0.8754
-```
-
-```r
-# P(mean(positive)>mean(negative))
-mean(fitmat[3,]>fitmat[1,])
-```
-
-```
-## [1] 0.9798
-```
-
-
-
-
-## Summary
-
diff --git a/docs/2.02-priors.md b/docs/2.02-priors.md
index 9c9187d..bf2032d 100644
--- a/docs/2.02-priors.md
+++ b/docs/2.02-priors.md
@@ -1,5 +1,5 @@
-# Prior distributions {#priors}
+# Prior distributions and prior sensitivity analyses{#priors}
## Introduction
diff --git a/docs/2.06-glm.md b/docs/2.06-glm.md
index 783471a..fed42d0 100644
--- a/docs/2.06-glm.md
+++ b/docs/2.06-glm.md
@@ -163,10 +163,10 @@ apply(bsim@coef, 2, quantile, prob=c(0.5, 0.025, 0.975))
```
```
-## (Intercept) elevation I(elevation^2) I(elevation^3) I(elevation^4)
-## 50% -24.27945 0.3953864 -0.0021756836 0.000004798096 -0.0000000037490422
-## 2.5% -35.02347 0.1887128 -0.0034319692 0.000001217286 -0.0000000070720019
-## 97.5% -12.85627 0.5910082 -0.0008527594 0.000008247819 -0.0000000003734525
+## (Intercept) elevation I(elevation^2) I(elevation^3) I(elevation^4)
+## 50% -24.39396 0.3967360 -0.0022011353 0.000004884348 -0.000000003824508
+## 2.5% -35.61303 0.1984131 -0.0035191339 0.000001354667 -0.000000007254023
+## 97.5% -13.45759 0.6020802 -0.0009058836 0.000008491873 -0.000000000448624
```
To interpret this polynomial function, an effect plot is helpful. To that end, and as we have done before, we calculate fitted values over the range of the covariate, together with compatibility intervals.
diff --git a/docs/2.06-glm_files/figure-html/fittree1-1.png b/docs/2.06-glm_files/figure-html/fittree1-1.png
index a898db5..2675076 100644
Binary files a/docs/2.06-glm_files/figure-html/fittree1-1.png and b/docs/2.06-glm_files/figure-html/fittree1-1.png differ
diff --git a/docs/2.06-glm_files/figure-html/lrgof-1.png b/docs/2.06-glm_files/figure-html/lrgof-1.png
index 3be30af..d514fe2 100644
Binary files a/docs/2.06-glm_files/figure-html/lrgof-1.png and b/docs/2.06-glm_files/figure-html/lrgof-1.png differ
diff --git a/docs/2.06-glm_files/figure-html/overdisp-1.png b/docs/2.06-glm_files/figure-html/overdisp-1.png
index 6b50264..5ef8359 100644
Binary files a/docs/2.06-glm_files/figure-html/overdisp-1.png and b/docs/2.06-glm_files/figure-html/overdisp-1.png differ
diff --git a/docs/2.06-glm_files/figure-html/unnamed-chunk-4-1.png b/docs/2.06-glm_files/figure-html/unnamed-chunk-4-1.png
index 16a84dc..79af826 100644
Binary files a/docs/2.06-glm_files/figure-html/unnamed-chunk-4-1.png and b/docs/2.06-glm_files/figure-html/unnamed-chunk-4-1.png differ
diff --git a/docs/2.07-glmm_files/figure-html/ppbinomial-1.png b/docs/2.07-glmm_files/figure-html/ppbinomial-1.png
index 2a63e24..a4789cf 100644
Binary files a/docs/2.07-glmm_files/figure-html/ppbinomial-1.png and b/docs/2.07-glmm_files/figure-html/ppbinomial-1.png differ
diff --git a/docs/2.08-model_checking.md b/docs/2.08-model_checking.md
index e1cd6ab..bfa4b42 100644
--- a/docs/2.08-model_checking.md
+++ b/docs/2.08-model_checking.md
@@ -1,12 +1,183 @@
# Posterior predictive model checking {#modelchecking}
-THIS CHAPTER IS UNDER CONSTRUCTION!!!
+
-## Introduction
+Only if the model describes the data-generating process sufficiently accurately can we draw relevant conclusions from the model. It is therefore essential to assess model fit: our goal is to describe how well the model fits the data with respect to different aspects of the model. In this book, we present three ways to assess how well a model reproduces the data-generating process: (1) [residual analysis](#residualanalysis),
+(2) posterior predictive model checking (this chapter)
+and (3) [prior sensitivity analysis](#priors).
-## Summary
-xxx
+Posterior predictive model checking is the comparison of replicated data generated under the model with the observed data. The aim of posterior predictive model checking is similar to the aim of a residual analysis, that is, to look at what data structures the model does not explain. However, the possibilities of residual analyses are limited, particularly in the case of non-normal data distributions. For example, in a logistic regression, positive residuals are always associated with $y_i = 1$ and negative residuals with $y_i = 0$. As a consequence, temporal and spatial patterns in the residuals will always look similar to these patterns in the observations and it is difficult to judge whether the model captures these processes adequately. In such cases, simulating data from the posterior predictive distribution of a model and comparing these data with the observations (i.e., predictive model checking) gives a clearer insight into the performance of a model.
+We follow the notation of @Gelman2014 in that we use “replicated
+data”, $y^{rep}$ for a set of $n$ new observations drawn from the posterior predictive distribution for the specific predictor variables $x$ of the $n$ observations in our data set. When we simulate new observations for new values of the predictor variables, for example, to show the prediction interval in an effect plot, we use $y^{new}$.
+The first step in posterior predictive model checking is to simulate a replicated data set for each set of simulated values of the joint posterior distribution of the model parameters. Thus, we produce, for example, 2000 replicated data sets. These replicated data sets are then compared graphically, or more formally by test statistics, with the observed data. The Bayesian p-value offers a way for formalized testing. It is defined as the probability that the replicated data from the model are more extreme than the observed data, as measured by a test statistic. In case of a perfect fit, we expect that the test statistic from the observed data is well in the middle of the ones from the replicated data. In other words, around 50% of the test statistics from the replicated data are higher than the one from the observed data, resulting in a Bayesian p-value close to 0.5. Bayesian p-values close to 0 or close to 1, on the contrary, indicate that the aspect of the model measured by the specific test statistic is not well represented by the model.
+Test statistics have to be chosen such that they describe important data structures that are not directly measured as a model parameter. Because model parameters are chosen so that they fit the data well, it is not surprising to find p-values close to 0.5 when using model parameters as test statistics. For example, extreme values or quantiles of $y$ are often better suited than the mean as test statistics, because they are less redundant with the model parameter that is fitted to the data. Similarly, the number of switches from 0 to 1 in binary data is suited to check for autocorrelation whereas the proportion of 1s among all the data may not give so much insight into the model fit. Other test statistics could be a measure for asymmetry, such as the relative difference between the 10 and 90% quantiles, or the proportion of zero values in a Poisson model.
+
+We like predictive model checking because it allows us to look at different, specific aspects of the model. It helps us to judge which conclusions from the model are reliable and to identify the limitation of a model. Predictive model checking also helps to understand the process that has generated the data.
+
+We use an analysis of the whitethroat breeding density in wildflower fields of different ages for illustration. The aim of this analysis was to identify an optimal age of wildflower fields that serves as good habitat for the whitethroat.
+
+Because the Stan developers have written highly convenient user friendly functions to do posterior predictive model checks, we fit the model with Stan using the function `stan_glmer` from the package `rstanarm`.
+
+
+
+```r
+data("wildflowerfields")
+dat <- wildflowerfields
+dat$size.ha <- dat$size/100 # change unit to ha
+dat$size.z <- scale(dat$size) # z-transform size
+dat$year.z <- scale(dat$year)
+age.poly <- poly(dat$age, 3) # create orthogonal polynomials
+dat$age.l <- age.poly[,1] # to ease convergence of the model fit
+dat$age.q <- age.poly[,2]
+dat$age.c <- age.poly[,3]
+
+library(rstanarm)
+mod <- stan_glmer(bp ~ year.z + age.l + age.q + age.c + size.z +
+(1|field) + offset(log(size.ha)), family=poisson, data=dat)
+```
+
+The R-package `shinystan` [@StanDevelopmentTeam.2017b] provides an easy way to do model checking. Therefore, there is no excuse to not do posterior predictive model checking. The R-code `launch_shinystan(mod)` opens a html-file that contains all kind of diagnostics of a model. Besides many statistics and diagnostic plots to assess how well the MCMC worked we also find a menu "PPcheck". There, we can click through many of the plots that we, below, produce in R.
+
+The function `posterior_predict` simulates many (exactly as many as there are draws from the posterior distributions of the model parameters, thus 4000 if the default number of iteration has been used in Stan) different data sets from a model fit. Specifically, for each single set of parameter values of the joint posterior distribution it simulates one replicated data set. We can look at histograms of the data and the replicated (Figure \@ref(fig:histpp)). The real data (bp) look similar to the replicated data.
+
+
+```r
+set.seed(2352) # to make sure that the ylim and breaks of the histograms below can be used
+yrep <- posterior_predict(mod)
+par(mfrow=c(3,3), mar=c(2,1,2,1))
+for(i in 1:8) hist(yrep[i,], col="blue",
+ breaks=seq(-0.5, 18.5, by=1), ylim=c(0,85))
+hist(dat$bp, col="blue",
+ breaks=seq(-0.5, 18.5, by=1), ylim=c(0,85))
+```
+
+
+
+
(\#fig:histpp)Histograms of 8 out of 4000 replicated data sets and of the observed data (dat$bp). The arguments breaks and ylim have been used in the function hist to produce the same scale of the x- and y-axis in all plots. This makes comparison among the plots easier.
+
+
+Let's look at specific aspects of the data. The proportion of zero counts could be a sensitive test statistic for this data set. First, we define a function “propzero” that extracts the proportion of zero counts from a vector of count data. Then we apply this function to the observed data and to each of the 4000 replicated data sets. At last, we extract the 1 and 99% quantile of the proportion of zero values of the replicated data.
+
+
+```r
+propzeros <- function(x) sum(x==0)/length(x)
+propzeros(dat$bp) # prop. zero values in observed data
+```
+
+```
+## [1] 0.4705882
+```
+
+```r
+pzeroyrep <- apply(yrep, 2, propzeros) # prop. zero values in yrep
+quantile(pzeroyrep, prob=c(0.01, 0.99))
+```
+
+```
+## 1% 99%
+## 0.0335750 0.9557625
+```
+
+The observed data contain 47% zero values, which is well within the 98%-range of what the model predicted (3 - 96%). the Bayesian p-value is 0.6.
+
+
+```r
+mean(pzeroyrep>=propzeros(dat$bp))
+```
+
+```
+## [1] 0.5955882
+```
+
+What about the upper tail of the data? Let’s look at the 90% quantile.
+
+
+```r
+quantile(dat$bp, prob=0.9) # for observed data
+```
+
+```
+## 90%
+## 2
+```
+
+```r
+q90yrep <- apply(yrep, 2, quantile, prob=0.9) # for simulated data
+table(q90yrep)
+```
+
+```
+## q90yrep
+## 0 1 2 3 4 5 6 7 8
+## 10 38 47 22 8 7 1 1 2
+```
+
+Also, the 90% quantile of the data is within what the model predicts.
+
+We also can look at the spatial distribution of the data and the replicated data. The variables X and Y are the coordinates of the wildflower fields. We can use them to draw transparent gray dots sized according to the number of breeding pairs.
+
+
+```r
+par(mfrow=c(3,3), mar=c(1,1,1,1))
+plot(dat$X, dat$Y, pch=16, cex=dat$bp+0.2, col=rgb(0,0,0,0.5), axes=FALSE)
+box()
+r <- sample(1:nrow(yrep), 1) # draw 8 replicated data sets at random
+for(i in r:(r+7)){
+plot(dat$X, dat$Y, pch=16, cex=yrep[i,]+0.2,
+col=rgb(0,0,0,0.5), axes=FALSE)
+box()
+}
+```
+
+
+
+
(\#fig:spatpp)Spatial distribution of the whitethroat breeding pair counts and of 8 randomly chosen replicated data sets with data simulated based on the model. the smallest dot correspond to a count of 0, the largest to a count of 20 breeding pairs. The panel in the upper left corner shows the data, the other panels are replicated data from the model.
+
+
+The spatial distribution of the replicated data sets seems to be similar to the observed one at first look (Figure \@ref(fig:spatpp)). With a second look, we may detect in the middle of the study area the model may predict slightly larger numbers than observed. This pattern may motivate us to find the reason for the imperfect fit if the main interest is whitethroat density estimates. Are there important elements in the landscape that influence whitethroat densities and that we have not yet taken into account in the model? However, our main interest is finding the optimal age of wildflower fields for the whitethroat. Therefore, we look at the mean age of the 10% of the fields with the highest breeding densities.
+To do so, we first define a function that extracts the mean field age of the 10% largest whitethroat density values, and then we apply this function to the observed data and to the 4000 replicated data sets.
+
+
+```r
+magehighest <- function(x) {
+q90 <- quantile(x/dat$size.ha, prob=0.90)
+index <- (x/dat$size.ha)>=q90
+mage <- mean(dat$age[index])
+return(mage)
+}
+
+magehighest(dat$bp)
+```
+
+```
+## [1] 4.4
+```
+
+```r
+mageyrep <- apply(yrep, 1, magehighest)
+quantile(mageyrep, prob=c(0.01, 0.5,0.99))
+```
+
+```
+## 1% 50% 99%
+## 3.733333 4.714286 5.785714
+```
+
+The mean age of the 10% of the fields with the highest whitethroat densities is 4.4 years in the observed data set. In the replicated data set it is between 3.73 and 5.79 years. The Bayesian p-value is 0.79. Thus, in around 79% of the replicated data sets the mean age of the 10% fields with the highest whitethroat densities was higher than the observed one (Figure \@ref(fig:agepp)).
+
+
+```r
+hist(mageyrep)
+abline(v=magehighest(dat$bp), col="orange", lwd=2)
+```
+
+
+
+
(\#fig:agepp)Histogram of the average age of the 10% wildflower fields with the highest breeding densities in the replicated data sets. The orange line indicates the average age for the 10% fields with the highest observed whithethroat densities.
+
+
+In a publication, we could summarize the results of the posterior predictive model checking in a table or give the plots in an appendix. Here, we conclude that the model fits in the most important aspects well. However, the model may predict too high whitethroat densities in the central part of the study area.
diff --git a/docs/2.08-model_checking_files/figure-html/agepp-1.png b/docs/2.08-model_checking_files/figure-html/agepp-1.png
new file mode 100644
index 0000000..e6453c4
Binary files /dev/null and b/docs/2.08-model_checking_files/figure-html/agepp-1.png differ
diff --git a/docs/2.08-model_checking_files/figure-html/histpp-1.png b/docs/2.08-model_checking_files/figure-html/histpp-1.png
new file mode 100644
index 0000000..faf7db1
Binary files /dev/null and b/docs/2.08-model_checking_files/figure-html/histpp-1.png differ
diff --git a/docs/2.08-model_checking_files/figure-html/spatpp-1.png b/docs/2.08-model_checking_files/figure-html/spatpp-1.png
new file mode 100644
index 0000000..381e6d9
Binary files /dev/null and b/docs/2.08-model_checking_files/figure-html/spatpp-1.png differ
diff --git a/docs/2.2-priors.md b/docs/2.2-priors.md
deleted file mode 100644
index 1db18ef..0000000
--- a/docs/2.2-priors.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Prior distributions {#priors}
-
-## Introduction
-
-
-## How to choose a prior {#choosepriors}
-> Tabelle von Fränzi (CourseIII_glm_glmmm/course2018/presentations_handouts/presentations)
-
-## Prior sensitivity
-xxx
-
-
-
diff --git a/docs/2.4-residual_analysis.md b/docs/2.4-residual_analysis.md
deleted file mode 100644
index d9ed90b..0000000
--- a/docs/2.4-residual_analysis.md
+++ /dev/null
@@ -1,40 +0,0 @@
-# Assessing Model Assumptions {#residualanalysis}
-
-## Model Assumptions
-
-Every statistical model makes assumptions. We try to build models that reflect the data-generating process as realistically as possible. However, a model never is the truth. Yet, all inferences drawn from a model, such as estimates of effect size or derived quantities with credible intervals, are based on the assumption that the model is true. However, if a model captures the datagenerating process poorly, for example, because it misses important structures (predictors, interactions, polynomials), inferences drawn from the model are probably biased and results become unreliable. In a (hypothetical) model that captures all important structures of the data generating process, the stochastic part, the difference between the observation and the fitted value (the residuals), should only show random variation. Analyzing residuals is a very important part of the data analysis process.
-
-Residual analysis can be very exciting, because the residuals show what remains unexplained by the present model. Residuals can sometimes show surprising patterns and, thereby, provide deeper insight into the system. However, at this step of the analysis it is important not to forget the original research questions that motivated the study. Because these questions have been asked without knowledge of the data, they protect against data dredging. Of course, residual analysis may raise interesting new questions. Nonetheless, these new questions have emerged from patterns in the data, which might just be random, not systematic, patterns. The search for a model with good fit should be guided by thinking about the process that generated the data, not by trial and error (i.e., do not try all possible variable combinations until the residuals look good; that is data dredging). All changes done to the model should be scientifically justified. Usually, model complexity increases, rather than decreases, during the analysis.
-
-## Independent and Identically Distributed
-Usually, we model an outcome variable as independent and identically distributed (iid) given the model parameters. This means that all observations with the same predictor values behave like independent random numbers from the identical distribution. As a consequence, residuals should look iid. Independent means that:
-
-- The residuals do not correlate with other variables (those that are included in the model as well as any other variable not included in the model).
-
-- The residuals are not grouped (i.e., the means of any set of residuals should all be equal).
-
-- The residuals are not autocorrelated (i.e., no temporal or spatial autocorrelation exist; Sections \@ref(tempautocorrelation) and \@ref(spatialautocorrelation)).
-
-Identically distributed means that:
-
-- All residuals come from the same distribution.
-
-In the case of a linear model with normal error distribution (Chapter \@ref(lm)) the residuals are assumed to come from the same normal distribution. Particularly:
-
-- The residual variance is homogeneous (homoscedasticity), that is, it does not depend on any predictor variable, and it does not change with the fitted value.
-
-- The mean of the residuals is zero over the whole range of predictor values. When numeric predictors (covariates) are present, this implies that the relationship between x and y can be adequately described by a straight line.
-
-Residual analysis is mainly done graphically. R makes it very easy to plot residuals to look at the different aspects just listed. As a first example, we use the coal tit example from Chapter \@ref(lm):
-
-> Hier fehlt noch ein Teil aus dem BUCH.
-
-## The QQ-Plot {#qqplot}
-xxx
-
-## Temporal Autocorrelation {#tempautocorrelation}
-
-## Spatial Autocorrelation {#spatialautocorrelation}
-
-## Heteroscedasticity {#Heteroscedasticity}
-
diff --git a/docs/404.html b/docs/404.html
index f1f3838..d1ade1b 100644
--- a/docs/404.html
+++ b/docs/404.html
@@ -23,7 +23,7 @@
-
+
@@ -262,7 +262,7 @@
When testing for correlations between two categorical variables, then the nullhypothesis is “there is no correlation”. The data can be displayed in cross-tables.
-
# Example: correlation between birthday preference and car ownership
-load("RData/datacourse.RData")
-table(dat$birthday, dat$car)
-
##
-## N Y
-## flowers 6 1
-## wine 9 6
-
Given the nullhypothesis was true, we expect that the distribution of the data in each column of the cross-table is similar to the distribution of the row-sums. And, the distribution of the data in each row should be similar to the distribution of the column-sums. The chisquare test statistics \(\chi^2\) measures the deviation of the data from this expected distribution of the data in the cross-table.
-
For calculating the chisquare test statistics \(\chi^2\), we first have to obtain for each cell in the cross-table the expected value \(E_{ij}\) = rowsum*colsum/total.
-
\(\chi^2\) measures the difference between the observed \(O_{ij}\) and expected \(E_{ij}\) values as:
-\(\chi^2=\sum_{i=1}^{m}\sum_{j=1}^{k}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\) where \(m\) is the number of rows and \(k\) is the number of columns.
-The \(\chi^2\)-distribution has 1 parameter, the degrees of freedom \(v\) = \((m-1)(k-1)\).
-
-
-
-Figure 5.1: Two examples of Chisquare distributions.
-
-
-
R is calculating the \(\chi^2\) value for specific cross-tables, and it is also giving the p-values, i.e., the probability of obtaining the observed or a higher \(\chi^2\) value given the nullhypothesis was true by comparing the observed \(\chi^2\) with the corresponding chisquare distribution.
The warning (that is suppressed in the rmarkdown version, but that you will see if you run the code on your own computer) is given, because in our example some cells have counts less than 5. In such cases, the Fisher’s exact test should be preferred. This test calculates the p-value analytically using probability theory, whereas the chisquare test relies on the assumption that the \(\chi^2\) value follows a chisquare distribution. The latter assumption holds better for larger sample sizes.
-
fisher.test(table(dat$birthday, dat$car))
-
##
-## Fisher's Exact Test for Count Data
-##
-## data: table(dat$birthday, dat$car)
-## p-value = 0.3501
-## alternative hypothesis: true odds ratio is not equal to 1
-## 95 percent confidence interval:
-## 0.3153576 213.8457248
-## sample estimates:
-## odds ratio
-## 3.778328
-
-
-
5.1.2 Correlations among categorical variables using Bayesian methods
-
For a Bayesian analysis of cross-table data, a data model has to be found. There are several possibilities that could be used:
-
-
a so-called log-linear model (Poisson model) for the counts in each cell of the cross-table.
-
-
a binomial or a multinomial model for obtaining estimates of the proportions of data in each cell
-
-
These models provide possibilities to explore the patterns in the data in more details than a chisquare test.
-
# We arrange the data into a cross-table in a data-frame
-# format. That is, the counts in each cell of the
-# cross-table become a variable and the row and column names
-# are also given in separate variables
-datagg <-aggregate(dat$name_fictive, list(birthday=dat$birthday, car=dat$car),
- length, drop=FALSE)
-datagg$x[is.na(datagg$x)] <-0
-names(datagg) <-c("birthday", "car", "count")
-datagg
-
## birthday car count
-## 1 flowers N 6
-## 2 wine N 9
-## 3 flowers Y 1
-## 4 wine Y 6
The interaction parameter measures the strength of the correlation. To quantitatively understand what a parameter value of 1.39 means, we have to look at the interpretation of all parameter values. We do that here quickly without a thorough explanation, because we already explained the Poisson model in chapter 8 of (Korner-Nievergelt et al. 2015).
-
The intercept 1.79 corresponds to the logarithm of the count in the cell “flowers” and “N” (number of students who prefer flowers as a birthday present and who do not have a car), i.e., \(exp(\beta_0)\) = 6. The exponent of the second parameter corresponds to the multiplicative difference between the counts in the cells “flowers and N” and “wine and N”, i.e., count in the cell “wine and N” = \(exp(\beta_0)exp(\beta_1)\) = exp(1.79)exp(0.41) = 9. The third parameter measures the multiplicative difference in the counts between the cells “flowers and N” and “flowers and Y”, i.e., count in the cell “flowers and Y” = \(exp(\beta_0)exp(\beta_2)\) = exp(1.79)exp(-1.79) = 1. Thus, the third parameter is the difference in the logarithm of the counts between the car owners and the car-free students for those who prefer flowers. The interaction parameter is the difference of this difference between the students who prefer wine and those who prefer flowers. This is difficult to intuitively understand. Here is another try to formulate it: The interaction parameter measures the difference in the logarithm of the counts in the cross-table between the row-differences between the columns. Maybe it becomes clear, when we extract the count in the cell “wine and Y” from the model parameters: \(exp(\beta_0)exp(\beta_1)exp(\beta_2)exp(\beta_3)\) = exp(1.79)exp(0.41)exp(-1.79)exp(1.39) = 6.
-
Alternatively, we could estimate the proportions of students prefering flower and wine within each group of car owners and car-free students using a binomial model. For an explanation of the binomial model, see chapter 8 of (Korner-Nievergelt et al. 2015).
-Figure 5.2: Estimated proportion of students that prefer flowers over wine as a birthday present among the car-free students (N) and the car owners (Y). Given are the median of the posterior distribution (circle). The bar extends between the 2.5% and 97.5% quantiles of the posterior distribution.
-
-
-
-
-
-
5.2 3 methods for getting the posterior distribution
-
-
analytically
-
approximation
-
Monte Carlo simulation
-
-
-
5.2.1 Monte Carlo simulation (parametric bootstrap)
-
Monte Carlo integration: numerical solution of \(\int_{-1}^{1.5} F(x) dx\)
-
-
sim is solving a mathematical problem by simulation
-How sim is simulating to get the marginal distribution of \(\mu\):
data: y=0 (a tail)
-likelihood: \(p(y|\theta)=\theta^y(1-\theta)^{(1-y)}\)
-
-
-
-
5.2.3 Markov chain Monte Carlo simulations
-
-
Markov chain Monte Carlo simulation (BUGS, Jags)
-
Hamiltonian Monte Carlo (Stan)
-
-
-
-
-
-
5.3 Analysis of variance ANOVA
-
The aim of an ANOVA is to compare means of groups. In a frequentist analysis, this is done by comparing the between-group with the within-group variance. The result of a Bayesian analysis is the joint posterior distribution of the group means.
-
-
-
-Figure 5.3: Number of stats courses students have taken before starting a PhD in relation to their feeling about statistics.
-
-
-
In the frequentist ANOVA, the following three sum of squared distances (SS) are used to calculate the total, the between- and within-group variances:
-Total sum of squares = SST = \(\sum_1^n{(y_i-\bar{y})^2}\)
-Within-group SS = SSW = \(\sum_1^n{(y_i-\bar{y_g})^2}\): unexplained variance
-Between-group SS = SSB = \(\sum_1^g{n_g(\bar{y_g}-\bar{y})^2}\): explained variance
-
The between-group and within-group SS sum to the total sum of squares: SST=SSB+SSW. Attention: this equation is only true in any case for a simple one-way ANOVA (just one grouping factor). If the data are grouped according to more than one factor (such as in a two- or three-way ANOVA), then there is one single solution for the equation only when the data is completely balanced, i.e. when there are the same number of observations in all combinations of factor levels. For non-balanced data with more than one grouping factor, there are different ways of calculating the SSBs, and the result of the F-test described below depends on the order of the predictors in the model.
-
-
-
-Figure 5.4: Visualisation of the total, between-group and within-group sum of squares. Points are observations; long horizontal line is the overall mean; short horizontal lines are group specific means.
-
-
-
In order to make SSB and SSW comparable, we have to divide them by their degrees of freedoms. For the within-group SS, SSW, the degrees of freedom is the number of obervations minus the number of groups (\(g\)), because \(g\) means have been estimated from the data. If the \(g\) means are fixed and \(n-g\) data points are known, then the last \(g\) data points are defined, i.e., they cannot be chosen freely. For the between-group SS, SSB, the degrees of freedom is the number of groups minus 1 (the minus 1 stands for the overall mean).
-
-
MSB = SSB/df_between, MSW = SSW/df_within
-
-
It can be shown (by mathematicians) that, given the nullhypothesis, the mean of all groups are equal \(m_1 = m_2 = m_3\), then the mean squared errors between groups (MSB) is expected to be equal to the mean squared errors within the groups (MSW). Therefore, the ration MSB/MSW is expected to follow an F-distribution given the nullhypothesis is true.
-
-
MSB/MSW ~ F(df_between, df_within)
-
-
The Bayesian analysis for comparing group means consists of calculating the posterior distribution for each group mean and then drawing inference from these posterior distributions.
-A Bayesian one-way ANOVA involves the following steps:
-1. Decide for a data model: We, here, assume that the measurements are normally distributed around the group means. In this example here, we transform the outcome variable in order to better meet the normal assumption. Note: the frequentist ANOVA makes exactly the same assumptions. We can write the data model: \(y_i\sim Norm(\mu_i,\sigma)\) with \(mu_i= \beta_0 + \beta_1I(group=2) +\beta_1I(group=3)\), where the \(I()\)-function is an indicator function taking on 1 if the expression is true and 0 otherwise. This model has 4 parameters: \(\beta_0\), \(\beta_1\), \(\beta_2\) and \(\sigma\).
-
# fit a normal model with 3 different means
-mod <-lm(log(nrcourses+1)~statsfeeling, data=dat)
-
-
Choose a prior distribution for each model parameter: In this example, we choose flat prior distributions for each parameter. By using these priors, the result should not remarkably be affected by the prior distributions but almost only reflect the information in the data. We choose so-called improper prior distributions. These are completely flat distributions that give all parameter values the same probability. Such distributions are called improper because the area under the curve is not summing to 1 and therefore, they cannot be considered to be proper probability distributions. However, they can still be used to solve the Bayesian theorem.
-
Solve the Bayes theorem: The solution of the Bayes theorem for the above priors and model is implemented in the function sim of the package arm.
-
-
# calculate numerically the posterior distributions of the model
-# parameters using flat prior distributions
-nsim <-5000
-set.seed(346346)
-bsim <-sim(mod, n.sim=nsim)
-
-
Display the joint posterior distributions of the group means
-
-
# calculate group means from the model parameters
-newdat <-data.frame(statsfeeling=levels(factor(dat$statsfeeling)))
-X <-model.matrix(~statsfeeling, data=newdat)
-fitmat <-matrix(ncol=nsim, nrow=nrow(newdat))
-for(i in1:nsim) fitmat[,i] <-X%*%bsim@coef[i,]
-hist(fitmat[1,], freq=FALSE, breaks=seq(-2.5, 4.2, by=0.1), main=NA, xlab="Group mean of log(number of courses +1)", las=1, ylim=c(0, 2.2))
-hist(fitmat[2,], freq=FALSE, breaks=seq(-2.5, 4.2, by=0.1), main=NA, xlab="", las=1, add=TRUE, col=rgb(0,0,1,0.5))
-hist(fitmat[3,], freq=FALSE, breaks=seq(-2.5, 4.2, by=0.1), main=NA, xlab="", las=1, add=TRUE, col=rgb(1,0,0,0.5))
-legend(2,2, fill=c("white",rgb(0,0,1,0.5), rgb(1,0,0,0.5)), legend=levels(factor(dat$statsfeeling)))
-
-
-
-Figure 5.5: Posterior distributions of the mean number of stats courses PhD students visited before starting the PhD grouped according to their feelings about statistics.
-
-
-
Based on the posterior distributions of the group means, we can extract derived quantities depending on our interest and questions. Here, for example, we could extract the posterior probability of the hypothesis that students with a positive feeling about statistics have a better education in statistics than those with a neutral or negative feeling about statistics.
In this chapter we provide a checklist with some guidance for data analysis. However, do not expect the list to be complete and for different studies, a different order of the steps may make more sense. We usually repeat steps 3.2 to 3.8 until we find a model that fit the data well and that is realistic enough to be useful for the intended purpose. Data analysis is always a lot of work and, often, the following steps have to be repeated many times until we find a useful model.
-
There is a chance and danger at the same time: we may find interesting results that answer different questions than we asked originally. They may be very exciting and important, however they may be biased. We can report such findings, but we should state that they appeared (more or less by chance) during the data exploration and model fitting phase, and we have to be aware that the estimates may be biased because the study was not optimally designed with respect to these findings. It is important to always keep the original aim of the study in mind. Do not adjust the study question according to the data. We also recommend reporting what the model started with at the first iteration and describing the strategy and reasoning behind the model development process.
-
-
3.1 Plausibility of Data
-
Prepare the data and check graphically, or via summary statistics, whether all the data are plausible. Prepare the data so that errors (typos, etc.) are minimal, for example, by double-checking the entries. See chapter ?? for useful R-code that can be used for data preparation and to make plausibility controls.
-
-
-
3.2 Relationships
-
Think about the direct and indirect relationships among the variables of the study. We normally start a data analysis by drawing a sketch of the model including all explanatory variables and interactions that may be biologically meaningful. We will most likely repeat this step after having looked at the model fit. To make the data analysis transparent we should report every model that was considered. A short note about why a specific model was considered and why it was discarded helps make the modeling process reproducible.
-
-
-
3.3 Data Distribution
-
What is the nature of the variable of interest (outcome, dependent variable)? At this stage, there is no use of formally comparing the distribution of the outcome variable to a statistical distribution, because the rawdata is not required to follow a specific distribution. The models assume that conditional on the explanatory variables and the model structure, the outcome variable follows a specific distribution. Therefore, checking how well the chosen distribution fits to the data is done after the model fit 3.8. This first choice is solely done based on the nature of the data. Normally, our first choice is one of the classical distributions for which robust software for model fitting is available.
-
Here is a rough guideline for this first choice:
-
-
continuous measurements \(\Longrightarrow\) normal distribution
-> exceptions: time-to-event data \(\Longrightarrow\) see survival analysis
-
-
-
count \(\Longrightarrow\) Poisson or negative-binomial distribution
-
-
-
count with upper bound (proportion) \(\Longrightarrow\) binomial distribution
-
-
-
binary \(\Longrightarrow\) Bernoully distribution
-
-
-
rate (count by a reference) \(\Longrightarrow\) Poisson including an offset
-
-
-
nominal \(\Longrightarrow\) multinomial distribution
-
-
Chapter 4 gives an overview of the distributions that are most relevant for ecologists.
-
-
-
3.4 Preparation of Explanatory Variables
-
-
Look at the distribution (histogram) of every explanatory variable: Linear models do not assume that the explanatory variables have any specific distribution. Thus there is no need to check for a normal distribution! However, very skewed distributions result in unequal weighting of the observations in the model. In extreme cases, the slope of a regression line is defined by one or a few observations only. We also need to check whether the variance is large enough, and to think about the shape of the expected effect. The following four questions may help with this step:
-
-
-
Is the variance (of the explanatory variable) big enough so that an effect of the variable can be measured?
-
Is the distribution skewed? If an explanatory variable is highly skewed, it may make sense to transform the variable (e.g., log, square-root).
-
Does it show a bimodal distribution? Consider making the variable binary.
-
Is it expected that a change of 1 at lower values for x has the same biological effect as a change of 1 at higher values of x? If not, a trans- formation (e.g., log) could linearize the relationship between x and y.
-
-
-
Centering: Centering (\(x.c = x-mean(x)\)) is a transformation that produces a variable with a mean of 0. Centering is optional. We have two reasons to center a predictor variable. First, it helps the model fitting algorithm to better converge because it reduces correlations among model parameters. Second, with centered predictors, the intercept and main effects in the linear model are better interpretable (they are measured at the center of the data instead of at the covariate value of 0 which may be far off).
-
Scaling: Scaling (\(x.s = x/c\), where \(c\) is a constant) is a transformation that changes the unit of the variable. Also scaling is optional. We have three reasons to scale an predictor variable. First, to make the effect sizes better understandable. For example, a population change from one year to the next may be very small and hard to interpret. When we give the change for a 10-year period, its ecological meaning is better understandable. Second, to make the estimate of the effect sizes comparable between variables, we may use \(x.s = x/sd(x)\). The resulting variable has a unit of one standard deviation. A standard deviation may be comparable between variables that oritinally are measured in different units (meters, seconds etc). A. Gelman and Hill (2007) (p. 55 f) propose to scale the variables by two times the standard deviation (\(x.s = x/(2*sd(x))\)) to make effect sizes comparable between numeric and binary variables. Third, scaling can be important for model convergence, especially when polynomials are included. Also, consider the use of orthogonal polynomials, see Chapter 4.2.9 in Korner-Nievergelt et al. (2015).
-
Collinearity: Look at the correlation among the explanatory variables (pairs plot or correlation matrix). If the explanatory variables are correlated, go back to step 2. Also, Chapter 4.2.7 in Korner-Nievergelt et al. (2015) discusses collinearity.
-
Are interactions and polynomial terms needed in the model? If not already
-done in step 2, think about the relationship between each explanatory variable and the dependent variable. Is it linear or do polynomial terms have to be included in the model? If the relationship cannot be described appropriately by polynomial terms, think of a nonlinear model or a generalized additive model (GAM). May the effect of one explanatory variable depend on the value of
-another explanatory variable (interaction)?
-
-
-
-
3.5 Data Structure
-
After having taken into account all of the (fixed effect) terms from step 4: are the observations independent or grouped/structured? What random factors are needed in the model? Are the data obviously temporally or spatially correlated? Or, are other correlation structures present, such as phylogenetic relationships?
-Our strategy is to start with a rather simple model that may not account for all correlation structures that in fact are present in the data. We first, only include those that are known to be important a priory. Only when residual analyses reveals important additional correlation structures, we include them in the model.
-
-
-
3.6 Define Prior Distributions
-
Decide whether we would like to use informative prior distributions or whether we would like use priors that only have a negligible effect on the results. When the results are later used for informing authorities or for making a decision (as usual in applied sciences), then we would like to base the results on all information available. Information from the literature is then used to construct informative prior distributions. In contrast to applied sciences, in basic research we often would like to show only the information in the data that should not be influenced by earlier results. Therefore, in basic research we look for priors that do not influence the results.
-
-
-
3.7 Fit the Model
-
Fit the model.
-
-
-
3.8 Check Model
-
We assess model fit by graphical analyses of the residuals (Chapter 6 in Korner-Nievergelt et al. (2015)), by predictive model checking (Section 10.1 in Korner-Nievergelt et al. (2015)), or by sensitivity analysis (Chapter 15 in Korner-Nievergelt et al. (2015)).
-
For non-Gaussian models it is often easier to assess model fit using pos- terior predictive checks (Chapter 10 in Korner-Nievergelt et al. (2015)) rather than residual analyses. Posterior predictive checks usually show clearly in which aspect the model failed so we can go back to step 2 of the analysis. Recognizing in what aspect a model does not fit the data based on residual plots improves with experience. Therefore, we list in Chapter 16 of Korner-Nievergelt et al. (2015) some patterns that can appear in residual plots together with what these patterns possibly indicate. We also indicate what could be done in the specific cases.
-
-
-
3.9 Model Uncertainty
-
If, while working through steps 1 to 8, possibly repeatedly, we came up with one or more models that fit the data reasonably well, we then turn to the methods presented in Chapter 11 (Korner-Nievergelt et al. (2015)) to draw inference from more than one model. If we have only one model, we proceed to 3.10.
-
-
-
3.10 Draw Conclusions
-
Simulate values from the joint posterior distribution of the model parameters (sim or Stan). Use these samples to present parameter uncertainty, to obtain posterior distributions for predictions, probabilities of specific hypotheses, and derived quantities.
25 Capture-mark recapture model with a mixture structure to account for missing sex-variable for parts of the individuals
-
-
25.1 Introduction
-
In some species the identification of the sex is not possible for all individuals without sampling DNA. For example, morphological dimorphism is absent or so weak that parts of the individuals cannot be assigned to one of the sexes. Particularly in ornithological long-term capture recapture data sets that typically are obtained by voluntary bird ringers who do normaly not have the possibilities to analyse DNA, often the sex identification is missing in parts of the individuals. For estimating survival, it would nevertheless be valuable to include data of all individuals, use the information on sex-specific effects on survival wherever possible but account for the fact that of parts of the individuals the sex is not known. We here explain how a Cormack-Jolly-Seber model can be integrated with a mixture model in oder to allow for a combined analyses of individuals with and without sex identified.
-An introduction to the Cormack-Jolly-Seber model we gave in Chapter 14.5 of the book Korner-Nievergelt et al. (2015). We here expand this model by a mixture structure that allows including individuals with a missing categorical predictor variable, such as sex.
-
-
-
25.2 Data description
-
## simulate data
-# true parameter values
-theta <-0.6# proportion of males
-nocc <-15# number of years in the data set
-b0 <-matrix(NA, ncol=nocc-1, nrow=2)
-b0[1,] <-rbeta((nocc-1), 3, 4) # capture probability of males
-b0[2,] <-rbeta((nocc-1), 2, 4) # capture probability of females
-a0 <-matrix(NA, ncol=2, nrow=2)
-a1 <-matrix(NA, ncol=2, nrow=2)
-a0[1,1]<-qlogis(0.7) # average annual survival for adult males
-a0[1,2]<-qlogis(0.3) # average annual survival for juveniles
-a0[2,1] <-qlogis(0.55) # average annual survival for adult females
-a0[2,2] <- a0[1,2]
-a1[1,1] <-0
-a1[1,2] <--0.5
-a1[2,1] <--0.8
-a1[2,2] <- a1[1,2]
-
-nindi <-1000# number of individuals with identified sex
-nindni <-1500# number of individuals with non-identified sex
-nind <- nindi + nindni # total number of individuals
-y <-matrix(ncol=nocc, nrow=nind)
-z <-matrix(ncol=nocc, nrow=nind)
-first <-sample(1:(nocc-1), nind, replace=TRUE)
-sex <-sample(c(1,2), nind, prob=c(theta, 1-theta), replace=TRUE)
-juvfirst <-sample(c(0,1), nind, prob=c(0.5, 0.5), replace=TRUE)
-juv <-matrix(0, nrow=nind, ncol=nocc)
-for(i in1:nind) juv[i,first[i]] <- juv[i]
-
-x <-runif(nocc-1, -2, 2) # a time dependent covariate covariate
-p <- b0 # recapture probability
-phi <-array(NA, dim=c(2, 2, nocc-1))
-# for ad males
-phi[1,1,] <-plogis(a0[1,1]+a1[1,1]*x)
-# for ad females
-phi[2,1,] <-plogis(a0[2,1]+a1[2,1]*x)
-# for juvs
-phi[1,2,] <- phi[2,2,] <-plogis(a0[2,2]+a1[2,2]*x)
-for(i in1:nind){
- z[i,first[i]] <-1
- y[i, first[i]] <-1
-for(t in (first[i]+1):nocc){
- z[i, t] <-rbinom(1, size=1, prob=z[i,t-1]*phi[sex[i],juv[i,t-1]+1, t-1])
- y[i, t] <-rbinom(1, size=1, prob=z[i,t]*p[sex[i],t-1])
- }
-}
-y[is.na(y)] <-0
-
The mark-recapture data set consists of capture histories of 2500 individuals over 15 time periods. For each time period \(t\) and individual \(i\) the capture history matrix \(y\) contains \(y_{it}=1\) if the individual \(i\) is captured during time period \(t\), or \(y_{it}=0\) if the individual \(i\) is not captured during time period \(t\). The marking time period varies between individuals from 1 to 14. At the marking time period, the age of the individuals was classified either as juvenile or as adult. Juveniles turn into adults after one time period, thus age is known for all individuals during all time periods after marking. For 1000 individuals of the 2500 individuals, the sex is identified, whereas for 1500 individuals, the sex is unknown. The example data contain one covariate \(x\) that takes on one value for each time period.
-
# bundle the data for Stan
-i <-1:nindi
-ni <- (nindi+1):nind
-datax <-list(yi=y[i,], nindi=nindi, sex=sex[i], nocc=nocc,
-yni=y[ni,], nindni=nindni, firsti=first[i], firstni=first[ni],
-juvi=juv[i,]+1, juvni=juv[ni,]+1, year=1:nocc, x=x)
-
-
-
25.3 Model description
-
The observations \(y_{it}\), an indicator of whether individual i was recaptured during time period \(t\) is modelled conditional on the latent true state of the individual birds \(z_{it}\) (0 = dead or permanently emigrated, 1 = alive and at the study site) as a Bernoulli variable. The probability \(P(y_{it} = 1)\) is the product of the probability that an alive individual is recaptured, \(p_{it}\), and the state of the bird \(z_{it}\) (alive = 1, dead = 0). Thus, a dead bird cannot be recaptured, whereas for a bird alive during time period \(t\), the recapture probability equals \(p_{it}\):
-\[y_{it} \sim Bernoulli(z_{it}p_{it})\]
-The latent state variable \(z_{it}\) is a Markovian variable with the state at time \(t\) being dependent on the state at time \(t-1\) and the apparent survival probability \[\phi_{it}\]:
-\[z_{it} \sim Bernoulli(z_{it-1}\phi_{it})\]
-We use the term apparent survival in order to indicate that the parameter \(\phi\) is a product of site fidelity and survival. Thus, individuals that permanently emigrated from the study area cannot be distinguished from dead individuals.
-In both models, the parameters \(\phi\) and \(p\) were modelled as sex-specific. However, for parts of the individuals, sex could not be identified, i.e. sex was missing. Ignoring these missing values would most likely lead to a bias because they were not missing at random. The probability that sex can be identified is increasing with age and most likely differs between sexes. Therefore, we included a mixture model for the sex:
-\[Sex_i \sim Categorical(q_i)\]
-where \(q_i\) is a vector of length 2, containing the probability of being a male and a female, respectively. In this way, the sex of the non-identified individuals was assumed to be male or female with probability \(q[1]\) and \(q[2]=1-q[1]\), respectively. This model corresponds to the finite mixture model introduced by Pledger, Pollock, and Norris (2003) in order to account for unknown classes of birds (heterogeneity). However, in our case, for parts of the individuals the class (sex) was known.
-
In the example model, we constrain apparent survival to be linearly dependent on a covariate x with different slopes for males, females and juveniles using the logit link function.
-\[logit(\phi_{it}) = a0_{sex-age-class[it]} + a1_{sex-age-class[it]}x_i\]
-
Annual recapture probability was modelled for each year and age and sex class independently:
-\[p_{it} = b0_{t,sex-age-class[it]}\]
-Uniform prior distributions were used for all parameters with a parameter space limited to values between 0 and 1 (probabilities) and a normal distribution with a mean of 0 and a standard deviation of 1.5 for the intercept \(a0\), and a standard deviation of 5 was used for \(a1\).
-
-
-
25.4 The Stan code
-
The trick for coding the CMR-mixture model in Stan is to formulate the model 3 times:
-1. For the individuals with identified sex
-2. For the males that were not identified
-3. For the females that were not identified
-
Then for the non-identified individuals a mixture model is formulated that assigns a probability of being a female or a male to each individual.
-
data {
-int<lower=2> nocc; // number of capture events
-int<lower=0> nindi; // number of individuals with identified sex
-int<lower=0> nindni; // number of individuals with non-identified sex
-int<lower=0,upper=2> yi[nindi,nocc]; // CH[i,k]: individual i captured at k
-int<lower=0,upper=nocc-1> firsti[nindi]; // year of first capture
-int<lower=0,upper=2> yni[nindni,nocc]; // CH[i,k]: individual i captured at k
-int<lower=0,upper=nocc-1> firstni[nindni]; // year of first capture
-int<lower=1, upper=2> sex[nindi];
-int<lower=1, upper=2> juvi[nindi, nocc];
-int<lower=1, upper=2> juvni[nindni, nocc];
-int<lower=1> year[nocc];
-real x[nocc-1]; // a covariate
-}
-
-transformed data {
-int<lower=0,upper=nocc+1> lasti[nindi]; // last[i]: ind i last capture
-int<lower=0,upper=nocc+1> lastni[nindni]; // last[i]: ind i last capture
- lasti = rep_array(0,nindi);
- lastni = rep_array(0,nindni);
-for (i in1:nindi) {
-for (k in firsti[i]:nocc) {
-if (yi[i,k] == 1) {
-if (k > lasti[i]) lasti[i] = k;
- }
- }
- }
-for (ii in1:nindni) {
-for (kk in firstni[ii]:nocc) {
-if (yni[ii,kk] == 1) {
-if (kk > lastni[ii]) lastni[ii] = kk;
- }
- }
- }
-
-}
-
-
-parameters {
-real<lower=0, upper=1> theta[nindni]; // probability of being male for non-identified individuals
-real<lower=0, upper=1> b0[2,nocc-1]; // intercept of p
-real a0[2,2]; // intercept for phi
-real a1[2,2]; // coefficient for phi
-}
-
-transformed parameters {
-real<lower=0,upper=1>p_male[nindni,nocc]; // capture probability
-real<lower=0,upper=1>p_female[nindni,nocc]; // capture probability
-real<lower=0,upper=1>p[nindi,nocc]; // capture probability
-
-real<lower=0,upper=1>phi_male[nindni,nocc-1]; // survival probability
-real<lower=0,upper=1>chi_male[nindni,nocc+1]; // probability that an individual
-// is never recaptured after its
-// last capture
-real<lower=0,upper=1>phi_female[nindni,nocc-1]; // survival probability
-real<lower=0,upper=1>chi_female[nindni,nocc+1]; // probability that an individual
-// is never recaptured after its
-// last capture
-real<lower=0,upper=1>phi[nindi,nocc-1]; // survival probability
-real<lower=0,upper=1>chi[nindi,nocc+1]; // probability that an individual
-// is never recaptured after its
-// last capture
-
- {
-int k;
-int kk;
-for(ii in1:nindi){
-if (firsti[ii]>1) {
-for (z in1:(firsti[ii]-1)){
- phi[ii,z] = 1;
- }
- }
-for(tt in firsti[ii]:(nocc-1)) {
-// linear predictor for phi:
- phi[ii,tt] = inv_logit(a0[sex[ii], juvi[ii,tt]] + a1[sex[ii], juvi[ii,tt]]*x[tt]);
-
- }
- }
-
-for(ii in1:nindni){
-if (firstni[ii]>1) {
-for (z in1:(firstni[ii]-1)){
- phi_female[ii,z] = 1;
- phi_male[ii,z] = 1;
- }
- }
-for(tt in firstni[ii]:(nocc-1)) {
-// linear predictor for phi:
- phi_male[ii,tt] = inv_logit(a0[1, juvni[ii,tt]] + a1[1, juvni[ii,tt]]*x[tt]);
- phi_female[ii,tt] = inv_logit(a0[2, juvni[ii,tt]]+ a1[2, juvni[ii,tt]]*x[tt]);
-
- }
- }
-
-for(i in1:nindi) {
-// linear predictor for p for identified individuals
-for(w in1:firsti[i]){
- p[i,w] = 1;
- }
-for(kkk in (firsti[i]+1):nocc)
- p[i,kkk] = b0[sex[i],year[kkk-1]];
- chi[i,nocc+1] = 1.0;
- k = nocc;
-while (k > firsti[i]) {
- chi[i,k] = (1 - phi[i,k-1]) + phi[i,k-1] * (1 - p[i,k]) * chi[i,k+1];
- k = k - 1;
- }
-if (firsti[i]>1) {
-for (u in1:(firsti[i]-1)){
- chi[i,u] = 0;
- }
- }
- chi[i,firsti[i]] = (1 - p[i,firsti[i]]) * chi[i,firsti[i]+1];
- }// close definition of transformed parameters for identified individuals
-
-for(i in1:nindni) {
-// linear predictor for p for non-identified individuals
-for(w in1:firstni[i]){
- p_male[i,w] = 1;
- p_female[i,w] = 1;
- }
-for(kkkk in (firstni[i]+1):nocc){
- p_male[i,kkkk] = b0[1,year[kkkk-1]];
- p_female[i,kkkk] = b0[2,year[kkkk-1]];
- }
- chi_male[i,nocc+1] = 1.0;
- chi_female[i,nocc+1] = 1.0;
- k = nocc;
-while (k > firstni[i]) {
- chi_male[i,k] = (1 - phi_male[i,k-1]) + phi_male[i,k-1] * (1 - p_male[i,k]) * chi_male[i,k+1];
- chi_female[i,k] = (1 - phi_female[i,k-1]) + phi_female[i,k-1] * (1 - p_female[i,k]) * chi_female[i,k+1];
- k = k - 1;
- }
-if (firstni[i]>1) {
-for (u in1:(firstni[i]-1)){
- chi_male[i,u] = 0;
- chi_female[i,u] = 0;
- }
- }
- chi_male[i,firstni[i]] = (1 - p_male[i,firstni[i]]) * chi_male[i,firstni[i]+1];
- chi_female[i,firstni[i]] = (1 - p_female[i,firstni[i]]) * chi_female[i,firstni[i]+1];
- } // close definition of transformed parameters for non-identified individuals
-
-
- } // close block of transformed parameters exclusive parameter declarations
-} // close transformed parameters
-
-model {
-// priors
- theta ~ beta(1, 1);
-for (g in1:(nocc-1)){
- b0[1,g]~beta(1,1);
- b0[2,g]~beta(1,1);
- }
- a0[1,1]~normal(0,1.5);
- a0[1,2]~normal(0,1.5);
- a1[1,1]~normal(0,3);
- a1[1,2]~normal(0,3);
-
- a0[2,1]~normal(0,1.5);
- a0[2,2]~normal(a0[1,2],0.01); // for juveniles, we assume that the effect of the covariate is independet of sex
- a1[2,1]~normal(0,3);
- a1[2,2]~normal(a1[1,2],0.01);
-
-// likelihood for identified individuals
-for (i in1:nindi) {
-if (lasti[i]>0) {
-for (k in firsti[i]:lasti[i]) {
-if(k>1) target+= (log(phi[i, k-1]));
-if (yi[i,k] == 1) target+=(log(p[i,k]));
-elsetarget+=(log1m(p[i,k]));
- }
- }
-target+=(log(chi[i,lasti[i]+1]));
- }
-
-// likelihood for non-identified individuals
-for (i in1:nindni) {
-real log_like_male = 0;
-real log_like_female = 0;
-
-if (lastni[i]>0) {
-for (k in firstni[i]:lastni[i]) {
-if(k>1){
- log_like_male += (log(phi_male[i, k-1]));
- log_like_female += (log(phi_female[i, k-1]));
- }
-if (yni[i,k] == 1){
- log_like_male+=(log(p_male[i,k]));
- log_like_female+=(log(p_female[i,k]));
- }
-else{
- log_like_male+=(log1m(p_male[i,k]));
- log_like_female+=(log1m(p_female[i,k]));
- }
-
- }
- }
- log_like_male += (log(chi_male[i,lastni[i]+1]));
- log_like_female += (log(chi_female[i,lastni[i]+1]));
-
-target += log_mix(theta[i], log_like_male, log_like_female);
- }
-
-}
-
-
-
25.5 Call Stan from R, check convergence and look at results
-
# Run STAN
-library(rstan)
-fit <-stan(file ="stanmodels/cmr_mixture_model.stan", data=datax, verbose =FALSE)
-# for above simulated data (25000 individuals x 15 time periods)
-# computing time is around 48 hours on an intel corei7 laptop
-# for larger data sets, we recommed moving the transformed parameters block
-# to the model block in order to avoid monitoring of p_male, p_female,
-# phi_male and phi_female producing memory problems
-
-# launch_shinystan(fit) # diagnostic plots
-summary(fit)
25 Capture-mark recapture model with a mixture structure to account for missing sex-variable for parts of the individuals
-
-
25.1 Introduction
+
+
25.1 Introduction
In some species the identification of the sex is not possible for all individuals without sampling DNA. For example, morphological dimorphism is absent or so weak that parts of the individuals cannot be assigned to one of the sexes. Particularly in ornithological long-term capture recapture data sets that typically are obtained by voluntary bird ringers who do normaly not have the possibilities to analyse DNA, often the sex identification is missing in parts of the individuals. For estimating survival, it would nevertheless be valuable to include data of all individuals, use the information on sex-specific effects on survival wherever possible but account for the fact that of parts of the individuals the sex is not known. We here explain how a Cormack-Jolly-Seber model can be integrated with a mixture model in oder to allow for a combined analyses of individuals with and without sex identified.
An introduction to the Cormack-Jolly-Seber model we gave in Chapter 14.5 of the book Korner-Nievergelt et al. (2015). We here expand this model by a mixture structure that allows including individuals with a missing categorical predictor variable, such as sex.
25.2 Data description
-
## simulate data
-# true parameter values
-theta <-0.6# proportion of males
-nocc <-15# number of years in the data set
-b0 <-matrix(NA, ncol=nocc-1, nrow=2)
-b0[1,] <-rbeta((nocc-1), 3, 4) # capture probability of males
-b0[2,] <-rbeta((nocc-1), 2, 4) # capture probability of females
-a0 <-matrix(NA, ncol=2, nrow=2)
-a1 <-matrix(NA, ncol=2, nrow=2)
-a0[1,1]<-qlogis(0.7) # average annual survival for adult males
-a0[1,2]<-qlogis(0.3) # average annual survival for juveniles
-a0[2,1] <-qlogis(0.55) # average annual survival for adult females
-a0[2,2] <- a0[1,2]
-a1[1,1] <-0
-a1[1,2] <--0.5
-a1[2,1] <--0.8
-a1[2,2] <- a1[1,2]
-
-nindi <-1000# number of individuals with identified sex
-nindni <-1500# number of individuals with non-identified sex
-nind <- nindi + nindni # total number of individuals
-y <-matrix(ncol=nocc, nrow=nind)
-z <-matrix(ncol=nocc, nrow=nind)
-first <-sample(1:(nocc-1), nind, replace=TRUE)
-sex <-sample(c(1,2), nind, prob=c(theta, 1-theta), replace=TRUE)
-juvfirst <-sample(c(0,1), nind, prob=c(0.5, 0.5), replace=TRUE)
-juv <-matrix(0, nrow=nind, ncol=nocc)
-for(i in1:nind) juv[i,first[i]] <- juv[i]
-
-x <-runif(nocc-1, -2, 2) # a time dependent covariate covariate
-p <- b0 # recapture probability
-phi <-array(NA, dim=c(2, 2, nocc-1))
-# for ad males
-phi[1,1,] <-plogis(a0[1,1]+a1[1,1]*x)
-# for ad females
-phi[2,1,] <-plogis(a0[2,1]+a1[2,1]*x)
-# for juvs
-phi[1,2,] <- phi[2,2,] <-plogis(a0[2,2]+a1[2,2]*x)
-for(i in1:nind){
- z[i,first[i]] <-1
- y[i, first[i]] <-1
-for(t in (first[i]+1):nocc){
- z[i, t] <-rbinom(1, size=1, prob=z[i,t-1]*phi[sex[i],juv[i,t-1]+1, t-1])
- y[i, t] <-rbinom(1, size=1, prob=z[i,t]*p[sex[i],t-1])
- }
-}
-y[is.na(y)] <-0
+
## simulate data
+# true parameter values
+theta <-0.6# proportion of males
+nocc <-15# number of years in the data set
+b0 <-matrix(NA, ncol=nocc-1, nrow=2)
+b0[1,] <-rbeta((nocc-1), 3, 4) # capture probability of males
+b0[2,] <-rbeta((nocc-1), 2, 4) # capture probability of females
+a0 <-matrix(NA, ncol=2, nrow=2)
+a1 <-matrix(NA, ncol=2, nrow=2)
+a0[1,1]<-qlogis(0.7) # average annual survival for adult males
+a0[1,2]<-qlogis(0.3) # average annual survival for juveniles
+a0[2,1] <-qlogis(0.55) # average annual survival for adult females
+a0[2,2] <- a0[1,2]
+a1[1,1] <-0
+a1[1,2] <--0.5
+a1[2,1] <--0.8
+a1[2,2] <- a1[1,2]
+
+nindi <-1000# number of individuals with identified sex
+nindni <-1500# number of individuals with non-identified sex
+nind <- nindi + nindni # total number of individuals
+y <-matrix(ncol=nocc, nrow=nind)
+z <-matrix(ncol=nocc, nrow=nind)
+first <-sample(1:(nocc-1), nind, replace=TRUE)
+sex <-sample(c(1,2), nind, prob=c(theta, 1-theta), replace=TRUE)
+juvfirst <-sample(c(0,1), nind, prob=c(0.5, 0.5), replace=TRUE)
+juv <-matrix(0, nrow=nind, ncol=nocc)
+for(i in1:nind) juv[i,first[i]] <- juv[i]
+
+x <-runif(nocc-1, -2, 2) # a time dependent covariate covariate
+p <- b0 # recapture probability
+phi <-array(NA, dim=c(2, 2, nocc-1))
+# for ad males
+phi[1,1,] <-plogis(a0[1,1]+a1[1,1]*x)
+# for ad females
+phi[2,1,] <-plogis(a0[2,1]+a1[2,1]*x)
+# for juvs
+phi[1,2,] <- phi[2,2,] <-plogis(a0[2,2]+a1[2,2]*x)
+for(i in1:nind){
+ z[i,first[i]] <-1
+ y[i, first[i]] <-1
+for(t in (first[i]+1):nocc){
+ z[i, t] <-rbinom(1, size=1, prob=z[i,t-1]*phi[sex[i],juv[i,t-1]+1, t-1])
+ y[i, t] <-rbinom(1, size=1, prob=z[i,t]*p[sex[i],t-1])
+ }
+}
+y[is.na(y)] <-0
The mark-recapture data set consists of capture histories of 2500 individuals over 15 time periods. For each time period \(t\) and individual \(i\) the capture history matrix \(y\) contains \(y_{it}=1\) if the individual \(i\) is captured during time period \(t\), or \(y_{it}=0\) if the individual \(i\) is not captured during time period \(t\). The marking time period varies between individuals from 1 to 14. At the marking time period, the age of the individuals was classified either as juvenile or as adult. Juveniles turn into adults after one time period, thus age is known for all individuals during all time periods after marking. For 1000 individuals of the 2500 individuals, the sex is identified, whereas for 1500 individuals, the sex is unknown. The example data contain one covariate \(x\) that takes on one value for each time period.
-
# bundle the data for Stan
-i <-1:nindi
-ni <- (nindi+1):nind
-datax <-list(yi=y[i,], nindi=nindi, sex=sex[i], nocc=nocc,
-yni=y[ni,], nindni=nindni, firsti=first[i], firstni=first[ni],
-juvi=juv[i,]+1, juvni=juv[ni,]+1, year=1:nocc, x=x)
+
# bundle the data for Stan
+i <-1:nindi
+ni <- (nindi+1):nind
+datax <-list(yi=y[i,], nindi=nindi, sex=sex[i], nocc=nocc,
+yni=y[ni,], nindni=nindni, firsti=first[i], firstni=first[ni],
+juvi=juv[i,]+1, juvni=juv[ni,]+1, year=1:nocc, x=x)
25.3 Model description
@@ -520,225 +516,225 @@
25.4 The Stan code
data {
-int<lower=2> nocc; // number of capture events
-int<lower=0> nindi; // number of individuals with identified sex
-int<lower=0> nindni; // number of individuals with non-identified sex
-int<lower=0,upper=2> yi[nindi,nocc]; // CH[i,k]: individual i captured at k
-int<lower=0,upper=nocc-1> firsti[nindi]; // year of first capture
-int<lower=0,upper=2> yni[nindni,nocc]; // CH[i,k]: individual i captured at k
-int<lower=0,upper=nocc-1> firstni[nindni]; // year of first capture
-int<lower=1, upper=2> sex[nindi];
-int<lower=1, upper=2> juvi[nindi, nocc];
-int<lower=1, upper=2> juvni[nindni, nocc];
-int<lower=1> year[nocc];
-real x[nocc-1]; // a covariate
-}
-
-transformed data {
-int<lower=0,upper=nocc+1> lasti[nindi]; // last[i]: ind i last capture
-int<lower=0,upper=nocc+1> lastni[nindni]; // last[i]: ind i last capture
- lasti = rep_array(0,nindi);
- lastni = rep_array(0,nindni);
-for (i in1:nindi) {
-for (k in firsti[i]:nocc) {
-if (yi[i,k] == 1) {
-if (k > lasti[i]) lasti[i] = k;
- }
- }
- }
-for (ii in1:nindni) {
-for (kk in firstni[ii]:nocc) {
-if (yni[ii,kk] == 1) {
-if (kk > lastni[ii]) lastni[ii] = kk;
- }
- }
- }
-
-}
-
-
-parameters {
-real<lower=0, upper=1> theta[nindni]; // probability of being male for non-identified individuals
-real<lower=0, upper=1> b0[2,nocc-1]; // intercept of p
-real a0[2,2]; // intercept for phi
-real a1[2,2]; // coefficient for phi
-}
-
-transformed parameters {
-real<lower=0,upper=1>p_male[nindni,nocc]; // capture probability
-real<lower=0,upper=1>p_female[nindni,nocc]; // capture probability
-real<lower=0,upper=1>p[nindi,nocc]; // capture probability
-
-real<lower=0,upper=1>phi_male[nindni,nocc-1]; // survival probability
-real<lower=0,upper=1>chi_male[nindni,nocc+1]; // probability that an individual
-// is never recaptured after its
-// last capture
-real<lower=0,upper=1>phi_female[nindni,nocc-1]; // survival probability
-real<lower=0,upper=1>chi_female[nindni,nocc+1]; // probability that an individual
-// is never recaptured after its
-// last capture
-real<lower=0,upper=1>phi[nindi,nocc-1]; // survival probability
-real<lower=0,upper=1>chi[nindi,nocc+1]; // probability that an individual
-// is never recaptured after its
-// last capture
-
- {
-int k;
-int kk;
-for(ii in1:nindi){
-if (firsti[ii]>1) {
-for (z in1:(firsti[ii]-1)){
- phi[ii,z] = 1;
- }
- }
-for(tt in firsti[ii]:(nocc-1)) {
-// linear predictor for phi:
- phi[ii,tt] = inv_logit(a0[sex[ii], juvi[ii,tt]] + a1[sex[ii], juvi[ii,tt]]*x[tt]);
-
- }
- }
-
-for(ii in1:nindni){
-if (firstni[ii]>1) {
-for (z in1:(firstni[ii]-1)){
- phi_female[ii,z] = 1;
- phi_male[ii,z] = 1;
- }
- }
-for(tt in firstni[ii]:(nocc-1)) {
-// linear predictor for phi:
- phi_male[ii,tt] = inv_logit(a0[1, juvni[ii,tt]] + a1[1, juvni[ii,tt]]*x[tt]);
- phi_female[ii,tt] = inv_logit(a0[2, juvni[ii,tt]]+ a1[2, juvni[ii,tt]]*x[tt]);
-
- }
- }
-
-for(i in1:nindi) {
-// linear predictor for p for identified individuals
-for(w in1:firsti[i]){
- p[i,w] = 1;
- }
-for(kkk in (firsti[i]+1):nocc)
- p[i,kkk] = b0[sex[i],year[kkk-1]];
- chi[i,nocc+1] = 1.0;
- k = nocc;
-while (k > firsti[i]) {
- chi[i,k] = (1 - phi[i,k-1]) + phi[i,k-1] * (1 - p[i,k]) * chi[i,k+1];
- k = k - 1;
- }
-if (firsti[i]>1) {
-for (u in1:(firsti[i]-1)){
- chi[i,u] = 0;
- }
- }
- chi[i,firsti[i]] = (1 - p[i,firsti[i]]) * chi[i,firsti[i]+1];
- }// close definition of transformed parameters for identified individuals
-
-for(i in1:nindni) {
-// linear predictor for p for non-identified individuals
-for(w in1:firstni[i]){
- p_male[i,w] = 1;
- p_female[i,w] = 1;
- }
-for(kkkk in (firstni[i]+1):nocc){
- p_male[i,kkkk] = b0[1,year[kkkk-1]];
- p_female[i,kkkk] = b0[2,year[kkkk-1]];
- }
- chi_male[i,nocc+1] = 1.0;
- chi_female[i,nocc+1] = 1.0;
- k = nocc;
-while (k > firstni[i]) {
- chi_male[i,k] = (1 - phi_male[i,k-1]) + phi_male[i,k-1] * (1 - p_male[i,k]) * chi_male[i,k+1];
- chi_female[i,k] = (1 - phi_female[i,k-1]) + phi_female[i,k-1] * (1 - p_female[i,k]) * chi_female[i,k+1];
- k = k - 1;
- }
-if (firstni[i]>1) {
-for (u in1:(firstni[i]-1)){
- chi_male[i,u] = 0;
- chi_female[i,u] = 0;
- }
- }
- chi_male[i,firstni[i]] = (1 - p_male[i,firstni[i]]) * chi_male[i,firstni[i]+1];
- chi_female[i,firstni[i]] = (1 - p_female[i,firstni[i]]) * chi_female[i,firstni[i]+1];
- } // close definition of transformed parameters for non-identified individuals
-
-
- } // close block of transformed parameters exclusive parameter declarations
-} // close transformed parameters
-
-model {
-// priors
- theta ~ beta(1, 1);
-for (g in1:(nocc-1)){
- b0[1,g]~beta(1,1);
- b0[2,g]~beta(1,1);
- }
- a0[1,1]~normal(0,1.5);
- a0[1,2]~normal(0,1.5);
- a1[1,1]~normal(0,3);
- a1[1,2]~normal(0,3);
-
- a0[2,1]~normal(0,1.5);
- a0[2,2]~normal(a0[1,2],0.01); // for juveniles, we assume that the effect of the covariate is independet of sex
- a1[2,1]~normal(0,3);
- a1[2,2]~normal(a1[1,2],0.01);
-
-// likelihood for identified individuals
-for (i in1:nindi) {
-if (lasti[i]>0) {
-for (k in firsti[i]:lasti[i]) {
-if(k>1) target+= (log(phi[i, k-1]));
-if (yi[i,k] == 1) target+=(log(p[i,k]));
-elsetarget+=(log1m(p[i,k]));
- }
- }
-target+=(log(chi[i,lasti[i]+1]));
- }
-
-// likelihood for non-identified individuals
-for (i in1:nindni) {
-real log_like_male = 0;
-real log_like_female = 0;
-
-if (lastni[i]>0) {
-for (k in firstni[i]:lastni[i]) {
-if(k>1){
- log_like_male += (log(phi_male[i, k-1]));
- log_like_female += (log(phi_female[i, k-1]));
- }
-if (yni[i,k] == 1){
- log_like_male+=(log(p_male[i,k]));
- log_like_female+=(log(p_female[i,k]));
- }
-else{
- log_like_male+=(log1m(p_male[i,k]));
- log_like_female+=(log1m(p_female[i,k]));
- }
-
- }
- }
- log_like_male += (log(chi_male[i,lastni[i]+1]));
- log_like_female += (log(chi_female[i,lastni[i]+1]));
-
-target += log_mix(theta[i], log_like_male, log_like_female);
- }
-
-}
+
data {
+int<lower=2> nocc; // number of capture events
+int<lower=0> nindi; // number of individuals with identified sex
+int<lower=0> nindni; // number of individuals with non-identified sex
+int<lower=0,upper=2> yi[nindi,nocc]; // CH[i,k]: individual i captured at k
+int<lower=0,upper=nocc-1> firsti[nindi]; // year of first capture
+int<lower=0,upper=2> yni[nindni,nocc]; // CH[i,k]: individual i captured at k
+int<lower=0,upper=nocc-1> firstni[nindni]; // year of first capture
+int<lower=1, upper=2> sex[nindi];
+int<lower=1, upper=2> juvi[nindi, nocc];
+int<lower=1, upper=2> juvni[nindni, nocc];
+int<lower=1> year[nocc];
+real x[nocc-1]; // a covariate
+}
+
+transformed data {
+int<lower=0,upper=nocc+1> lasti[nindi]; // last[i]: ind i last capture
+int<lower=0,upper=nocc+1> lastni[nindni]; // last[i]: ind i last capture
+ lasti = rep_array(0,nindi);
+ lastni = rep_array(0,nindni);
+for (i in1:nindi) {
+for (k in firsti[i]:nocc) {
+if (yi[i,k] == 1) {
+if (k > lasti[i]) lasti[i] = k;
+ }
+ }
+ }
+for (ii in1:nindni) {
+for (kk in firstni[ii]:nocc) {
+if (yni[ii,kk] == 1) {
+if (kk > lastni[ii]) lastni[ii] = kk;
+ }
+ }
+ }
+
+}
+
+
+parameters {
+real<lower=0, upper=1> theta[nindni]; // probability of being male for non-identified individuals
+real<lower=0, upper=1> b0[2,nocc-1]; // intercept of p
+real a0[2,2]; // intercept for phi
+real a1[2,2]; // coefficient for phi
+}
+
+transformed parameters {
+real<lower=0,upper=1>p_male[nindni,nocc]; // capture probability
+real<lower=0,upper=1>p_female[nindni,nocc]; // capture probability
+real<lower=0,upper=1>p[nindi,nocc]; // capture probability
+
+real<lower=0,upper=1>phi_male[nindni,nocc-1]; // survival probability
+real<lower=0,upper=1>chi_male[nindni,nocc+1]; // probability that an individual
+// is never recaptured after its
+// last capture
+real<lower=0,upper=1>phi_female[nindni,nocc-1]; // survival probability
+real<lower=0,upper=1>chi_female[nindni,nocc+1]; // probability that an individual
+// is never recaptured after its
+// last capture
+real<lower=0,upper=1>phi[nindi,nocc-1]; // survival probability
+real<lower=0,upper=1>chi[nindi,nocc+1]; // probability that an individual
+// is never recaptured after its
+// last capture
+
+ {
+int k;
+int kk;
+for(ii in1:nindi){
+if (firsti[ii]>1) {
+for (z in1:(firsti[ii]-1)){
+ phi[ii,z] = 1;
+ }
+ }
+for(tt in firsti[ii]:(nocc-1)) {
+// linear predictor for phi:
+ phi[ii,tt] = inv_logit(a0[sex[ii], juvi[ii,tt]] + a1[sex[ii], juvi[ii,tt]]*x[tt]);
+
+ }
+ }
+
+for(ii in1:nindni){
+if (firstni[ii]>1) {
+for (z in1:(firstni[ii]-1)){
+ phi_female[ii,z] = 1;
+ phi_male[ii,z] = 1;
+ }
+ }
+for(tt in firstni[ii]:(nocc-1)) {
+// linear predictor for phi:
+ phi_male[ii,tt] = inv_logit(a0[1, juvni[ii,tt]] + a1[1, juvni[ii,tt]]*x[tt]);
+ phi_female[ii,tt] = inv_logit(a0[2, juvni[ii,tt]]+ a1[2, juvni[ii,tt]]*x[tt]);
+
+ }
+ }
+
+for(i in1:nindi) {
+// linear predictor for p for identified individuals
+for(w in1:firsti[i]){
+ p[i,w] = 1;
+ }
+for(kkk in (firsti[i]+1):nocc)
+ p[i,kkk] = b0[sex[i],year[kkk-1]];
+ chi[i,nocc+1] = 1.0;
+ k = nocc;
+while (k > firsti[i]) {
+ chi[i,k] = (1 - phi[i,k-1]) + phi[i,k-1] * (1 - p[i,k]) * chi[i,k+1];
+ k = k - 1;
+ }
+if (firsti[i]>1) {
+for (u in1:(firsti[i]-1)){
+ chi[i,u] = 0;
+ }
+ }
+ chi[i,firsti[i]] = (1 - p[i,firsti[i]]) * chi[i,firsti[i]+1];
+ }// close definition of transformed parameters for identified individuals
+
+for(i in1:nindni) {
+// linear predictor for p for non-identified individuals
+for(w in1:firstni[i]){
+ p_male[i,w] = 1;
+ p_female[i,w] = 1;
+ }
+for(kkkk in (firstni[i]+1):nocc){
+ p_male[i,kkkk] = b0[1,year[kkkk-1]];
+ p_female[i,kkkk] = b0[2,year[kkkk-1]];
+ }
+ chi_male[i,nocc+1] = 1.0;
+ chi_female[i,nocc+1] = 1.0;
+ k = nocc;
+while (k > firstni[i]) {
+ chi_male[i,k] = (1 - phi_male[i,k-1]) + phi_male[i,k-1] * (1 - p_male[i,k]) * chi_male[i,k+1];
+ chi_female[i,k] = (1 - phi_female[i,k-1]) + phi_female[i,k-1] * (1 - p_female[i,k]) * chi_female[i,k+1];
+ k = k - 1;
+ }
+if (firstni[i]>1) {
+for (u in1:(firstni[i]-1)){
+ chi_male[i,u] = 0;
+ chi_female[i,u] = 0;
+ }
+ }
+ chi_male[i,firstni[i]] = (1 - p_male[i,firstni[i]]) * chi_male[i,firstni[i]+1];
+ chi_female[i,firstni[i]] = (1 - p_female[i,firstni[i]]) * chi_female[i,firstni[i]+1];
+ } // close definition of transformed parameters for non-identified individuals
+
+
+ } // close block of transformed parameters exclusive parameter declarations
+} // close transformed parameters
+
+model {
+// priors
+ theta ~ beta(1, 1);
+for (g in1:(nocc-1)){
+ b0[1,g]~beta(1,1);
+ b0[2,g]~beta(1,1);
+ }
+ a0[1,1]~normal(0,1.5);
+ a0[1,2]~normal(0,1.5);
+ a1[1,1]~normal(0,3);
+ a1[1,2]~normal(0,3);
+
+ a0[2,1]~normal(0,1.5);
+ a0[2,2]~normal(a0[1,2],0.01); // for juveniles, we assume that the effect of the covariate is independet of sex
+ a1[2,1]~normal(0,3);
+ a1[2,2]~normal(a1[1,2],0.01);
+
+// likelihood for identified individuals
+for (i in1:nindi) {
+if (lasti[i]>0) {
+for (k in firsti[i]:lasti[i]) {
+if(k>1) target+= (log(phi[i, k-1]));
+if (yi[i,k] == 1) target+=(log(p[i,k]));
+elsetarget+=(log1m(p[i,k]));
+ }
+ }
+target+=(log(chi[i,lasti[i]+1]));
+ }
+
+// likelihood for non-identified individuals
+for (i in1:nindni) {
+real log_like_male = 0;
+real log_like_female = 0;
+
+if (lastni[i]>0) {
+for (k in firstni[i]:lastni[i]) {
+if(k>1){
+ log_like_male += (log(phi_male[i, k-1]));
+ log_like_female += (log(phi_female[i, k-1]));
+ }
+if (yni[i,k] == 1){
+ log_like_male+=(log(p_male[i,k]));
+ log_like_female+=(log(p_female[i,k]));
+ }
+else{
+ log_like_male+=(log1m(p_male[i,k]));
+ log_like_female+=(log1m(p_female[i,k]));
+ }
+
+ }
+ }
+ log_like_male += (log(chi_male[i,lastni[i]+1]));
+ log_like_female += (log(chi_female[i,lastni[i]+1]));
+
+target += log_mix(theta[i], log_like_male, log_like_female);
+ }
+
+}
25.5 Call Stan from R, check convergence and look at results
-
# Run STAN
-library(rstan)
-fit <-stan(file ="stanmodels/cmr_mixture_model.stan", data=datax, verbose =FALSE)
-# for above simulated data (25000 individuals x 15 time periods)
-# computing time is around 48 hours on an intel corei7 laptop
-# for larger data sets, we recommed moving the transformed parameters block
-# to the model block in order to avoid monitoring of p_male, p_female,
-# phi_male and phi_female producing memory problems
-
-# launch_shinystan(fit) # diagnostic plots
-summary(fit)
+
# Run STAN
+library(rstan)
+fit <-stan(file ="stanmodels/cmr_mixture_model.stan", data=datax, verbose =FALSE)
+# for above simulated data (25000 individuals x 15 time periods)
+# computing time is around 48 hours on an intel corei7 laptop
+# for larger data sets, we recommed moving the transformed parameters block
+# to the model block in order to avoid monitoring of p_male, p_female,
+# phi_male and phi_female producing memory problems
+
+# launch_shinystan(fit) # diagnostic plots
+summary(fit)
The following Stan model code is saved as daily_nest_survival.stan.
-
data {
-int<lower=0> Nnests; // number of nests
-int<lower=0> last[Nnests]; // day of last observation (alive or dead)
-int<lower=0> first[Nnests]; // day of first observation (alive or dead)
-int<lower=0> maxage; // maximum of last
-int<lower=0> y[Nnests, maxage]; // indicator of alive nests
-real cover[Nnests]; // a covariate of the nest
-real age[maxage]; // a covariate of the date
-}
-
-parameters {
-vector[3] b; // coef of linear pred for S
-}
-
-model {
-real S[Nnests, maxage-1]; // survival probability
-
-for(i in1:Nnests){
-for(t in first[i]:(last[i]-1)){
- S[i,t] = inv_logit(b[1] + b[2]*cover[i] + b[3]*age[t]);
- }
- }
-
-// priors
- b[1]~normal(0,5);
- b[2]~normal(0,3);
- b[3]~normal(0,3);
-
-// likelihood
-for (i in1:Nnests) {
-for(t in (first[i]+1):last[i]){
- y[i,t]~bernoulli(y[i,t-1]*S[i,t-1]);
- }
- }
-}
+
data {
+int<lower=0> Nnests; // number of nests
+int<lower=0> last[Nnests]; // day of last observation (alive or dead)
+int<lower=0> first[Nnests]; // day of first observation (alive or dead)
+int<lower=0> maxage; // maximum of last
+int<lower=0> y[Nnests, maxage]; // indicator of alive nests
+real cover[Nnests]; // a covariate of the nest
+real age[maxage]; // a covariate of the date
+}
+
+parameters {
+vector[3] b; // coef of linear pred for S
+}
+
+model {
+real S[Nnests, maxage-1]; // survival probability
+
+for(i in1:Nnests){
+for(t in first[i]:(last[i]-1)){
+ S[i,t] = inv_logit(b[1] + b[2]*cover[i] + b[3]*age[t]);
+ }
+ }
+
+// priors
+ b[1]~normal(0,5);
+ b[2]~normal(0,3);
+ b[3]~normal(0,3);
+
+// likelihood
+for (i in1:Nnests) {
+for(t in (first[i]+1):last[i]){
+ y[i,t]~bernoulli(y[i,t-1]*S[i,t-1]);
+ }
+ }
+}
It looks like cover does not affect daily nest survival, but daily nest survival decreases with the age of the nestlings.
-
#launch_shinystan(mod)
-print(mod)
+
#launch_shinystan(mod)
+print(mod)
## Inference for Stan model: anon_model.
## 5 chains, each with iter=2500; warmup=1250; thin=1;
## post-warmup draws per chain=1250, total post-warmup draws=6250.
@@ -586,23 +582,23 @@
When nest are controlled only irregularly, it may happen that a nest is found predated or dead after a longer break in controlling. In such cases, we know that the nest was predated or it died due to other causes some when between the last control when the nest was still alive and when it was found dead. In such cases, we need to tell the model that the nest could have died any time during the interval when we were not controlling.
To do so, we create a variable that indicates the time (e.g. day since first egg) when the nest was last seen alive (lastlive). A second variable indicates the time of the last check which is either the equal to lastlive when the nest survived until the last check, or it is larger than lastlive when the nest failure has been recorded. A last variable, gap, measures the time interval in which the nest failure occurred. A gap of zero means that the nest was still alive at the last control, a gapof 1 means that the nest failure occurred during the first day after lastlive, a gap of 2 means that the nest failure either occurred at the first or second day after lastlive.
-
# time when nest was last observed alive
-lastlive <-apply(datax$y, 1, function(x) max(c(1:length(x))[x==1]))
-
-# time when nest was last checked (alive or dead)
-lastcheck <- lastlive+1
-# here, we turn the above data into a format that can be used for
-# irregular nest controls. WOULD BE NICE TO HAVE A REAL DATA EXAMPLE!
-
-# when nest was observed alive at the last check, then lastcheck equals lastlive
-lastcheck[lastlive==datax$last] <- datax$last[lastlive==datax$last]
-
-datax1 <-list(Nnests=datax$Nnests,
-lastlive = lastlive,
-lastcheck= lastcheck,
-first=datax$first,
-cover=datax$cover,
-age=datax$age,
-maxage=datax$maxage)
-# time between last seen alive and first seen dead (= lastcheck)
-datax1$gap <- datax1$lastcheck-datax1$lastlive
+
# time when nest was last observed alive
+lastlive <-apply(datax$y, 1, function(x) max(c(1:length(x))[x==1]))
+
+# time when nest was last checked (alive or dead)
+lastcheck <- lastlive+1
+# here, we turn the above data into a format that can be used for
+# irregular nest controls. WOULD BE NICE TO HAVE A REAL DATA EXAMPLE!
+
+# when nest was observed alive at the last check, then lastcheck equals lastlive
+lastcheck[lastlive==datax$last] <- datax$last[lastlive==datax$last]
+
+datax1 <-list(Nnests=datax$Nnests,
+lastlive = lastlive,
+lastcheck= lastcheck,
+first=datax$first,
+cover=datax$cover,
+age=datax$age,
+maxage=datax$maxage)
+# time between last seen alive and first seen dead (= lastcheck)
+datax1$gap <- datax1$lastcheck-datax1$lastlive
In the Stan model code, we specify the likelihood for each gap separately.
-
data {
-int<lower=0> Nnests; // number of nests
-int<lower=0> lastlive[Nnests]; // day of last observation (alive)
-int<lower=0> lastcheck[Nnests]; // day of observed death or, if alive, last day of study
-int<lower=0> first[Nnests]; // day of first observation (alive or dead)
-int<lower=0> maxage; // maximum of last
-real cover[Nnests]; // a covariate of the nest
-real age[maxage]; // a covariate of the date
-int<lower=0> gap[Nnests]; // obsdead - lastlive
-}
-
-parameters {
-vector[3] b; // coef of linear pred for S
-}
-
-model {
-real S[Nnests, maxage-1]; // survival probability
-
-for(i in1:Nnests){
-for(t in first[i]:(lastcheck[i]-1)){
- S[i,t] = inv_logit(b[1] + b[2]*cover[i] + b[3]*age[t]);
- }
- }
-
-// priors
- b[1]~normal(0,1.5);
- b[2]~normal(0,3);
- b[3]~normal(0,3);
-
-// likelihood
-for (i in1:Nnests) {
-for(t in (first[i]+1):lastlive[i]){
-1~bernoulli(S[i,t-1]);
- }
-if(gap[i]==1){
-target += log(1-S[i,lastlive[i]]); //
- }
-if(gap[i]==2){
-target += log((1-S[i,lastlive[i]]) + S[i,lastlive[i]]*(1-S[i,lastlive[i]+1])); //
- }
-if(gap[i]==3){
-target += log((1-S[i,lastlive[i]]) + S[i,lastlive[i]]*(1-S[i,lastlive[i]+1]) +
- prod(S[i,lastlive[i]:(lastlive[i]+1)])*(1-S[i,lastlive[i]+2])); //
- }
-if(gap[i]==4){
-target += log((1-S[i,lastlive[i]]) + S[i,lastlive[i]]*(1-S[i,lastlive[i]+1]) +
- prod(S[i,lastlive[i]:(lastlive[i]+1)])*(1-S[i,lastlive[i]+2]) +
- prod(S[i,lastlive[i]:(lastlive[i]+2)])*(1-S[i,lastlive[i]+3])); //
- }
-
- }
-}
-
# Run STAN
-mod1 <-stan(file ="stanmodels/daily_nest_survival_irreg.stan", data=datax1,
-chains=5, iter=2500, control=list(adapt_delta=0.9), verbose =FALSE)
+
data {
+int<lower=0> Nnests; // number of nests
+int<lower=0> lastlive[Nnests]; // day of last observation (alive)
+int<lower=0> lastcheck[Nnests]; // day of observed death or, if alive, last day of study
+int<lower=0> first[Nnests]; // day of first observation (alive or dead)
+int<lower=0> maxage; // maximum of last
+real cover[Nnests]; // a covariate of the nest
+real age[maxage]; // a covariate of the date
+int<lower=0> gap[Nnests]; // obsdead - lastlive
+}
+
+parameters {
+vector[3] b; // coef of linear pred for S
+}
+
+model {
+real S[Nnests, maxage-1]; // survival probability
+
+for(i in1:Nnests){
+for(t in first[i]:(lastcheck[i]-1)){
+ S[i,t] = inv_logit(b[1] + b[2]*cover[i] + b[3]*age[t]);
+ }
+ }
+
+// priors
+ b[1]~normal(0,1.5);
+ b[2]~normal(0,3);
+ b[3]~normal(0,3);
+
+// likelihood
+for (i in1:Nnests) {
+for(t in (first[i]+1):lastlive[i]){
+1~bernoulli(S[i,t-1]);
+ }
+if(gap[i]==1){
+target += log(1-S[i,lastlive[i]]); //
+ }
+if(gap[i]==2){
+target += log((1-S[i,lastlive[i]]) + S[i,lastlive[i]]*(1-S[i,lastlive[i]+1])); //
+ }
+if(gap[i]==3){
+target += log((1-S[i,lastlive[i]]) + S[i,lastlive[i]]*(1-S[i,lastlive[i]+1]) +
+ prod(S[i,lastlive[i]:(lastlive[i]+1)])*(1-S[i,lastlive[i]+2])); //
+ }
+if(gap[i]==4){
+target += log((1-S[i,lastlive[i]]) + S[i,lastlive[i]]*(1-S[i,lastlive[i]+1]) +
+ prod(S[i,lastlive[i]:(lastlive[i]+1)])*(1-S[i,lastlive[i]+2]) +
+ prod(S[i,lastlive[i]:(lastlive[i]+2)])*(1-S[i,lastlive[i]+3])); //
+ }
+
+ }
+}
+
# Run STAN
+mod1 <-stan(file ="stanmodels/daily_nest_survival_irreg.stan", data=datax1,
+chains=5, iter=2500, control=list(adapt_delta=0.9), verbose =FALSE)
To introduce the linear mixed model, we use repeated hormone measures at nestling Barn Owls Tyto alba. The cortbowl data set contains stress hormone data (corticosterone, variable ‘totCort’) of nestling Barn owls which were either treated with a corticosterone-implant, or with a placebo-implant as the control group. The aim of the study was to quantify the corticosterone increase due to the corticosterone implants (Almasi et al. 2009). In each brood, one or two nestlings were implanted with a corticosterone-implant and one or two nestlings with a placebo-implant (variable ‘Implant’). Blood samples were taken just before implantation, and at days 2 and 20 after implantation.
-
data(cortbowl)
-dat <- cortbowl
-dat$days <-factor(dat$days, levels=c("before", "2", "20"))
-str(dat) # the data was sampled in 2004,2005, and 2005 by the Swiss Ornithologicla Institute
In total, there are 287 measurements of 151 individuals (variable ‘Ring’) of 54 broods. Because the measurements from the same individual are non-independent, we use a mixed model to analyze these data: Two additional arguments for a mixed model are: a) the mixed model allows prediction of corticosterone levels for an ‘average’ individual, whereas the fixed effect model allows prediction of corticosterone levels only for the 151 individuals that were sampled; and b) fewer parameters are needed. If we include individual as a fixed factor, we would use 150 parameters, while the random factor needs a much lower number of parameters.
-We first create a graphic to show the development for each individual, separately for owls receiving corticosterone versus owls receiving a placebo (Figure 14.1).
-
-
-
-Figure 14.1: Total corticosterone before and at day 2 and 20 after implantation of a corticosterone or a placebo implant. Lines connect measurements of the same individual.
-
To interpret this polynomial function, an effect plot is helpful. To that end, and as we have done before, we calculate fitted values over the range of the covariate, together with compatibility intervals.
Only if the model describes the data-generating process sufficiently accurately can we draw relevant conclusions from the model. It is therefore essential to assess model fit: our goal is to describe how well the model fits the data with respect to different aspects of the model. In this book, we present three ways to assess how well a model reproduces the data-generating process: (1) residual analysis,
+(2) posterior predictive model checking (this chapter)
+and (3) prior sensitivity analysis.
+
Posterior predictive model checking is the comparison of replicated data generated under the model with the observed data. The aim of posterior predictive model checking is similar to the aim of a residual analysis, that is, to look at what data structures the model does not explain. However, the possibilities of residual analyses are limited, particularly in the case of non-normal data distributions. For example, in a logistic regression, positive residuals are always associated with \(y_i = 1\) and negative residuals with \(y_i = 0\). As a consequence, temporal and spatial patterns in the residuals will always look similar to these patterns in the observations and it is difficult to judge whether the model captures these processes adequately. In such cases, simulating data from the posterior predictive distribution of a model and comparing these data with the observations (i.e., predictive model checking) gives a clearer insight into the performance of a model.
+
We follow the notation of A. Gelman et al. (2014b) in that we use “replicated
+data”, \(y^{rep}\) for a set of \(n\) new observations drawn from the posterior predictive distribution for the specific predictor variables \(x\) of the \(n\) observations in our data set. When we simulate new observations for new values of the predictor variables, for example, to show the prediction interval in an effect plot, we use \(y^{new}\).
+
The first step in posterior predictive model checking is to simulate a replicated data set for each set of simulated values of the joint posterior distribution of the model parameters. Thus, we produce, for example, 2000 replicated data sets. These replicated data sets are then compared graphically, or more formally by test statistics, with the observed data. The Bayesian p-value offers a way for formalized testing. It is defined as the probability that the replicated data from the model are more extreme than the observed data, as measured by a test statistic. In case of a perfect fit, we expect that the test statistic from the observed data is well in the middle of the ones from the replicated data. In other words, around 50% of the test statistics from the replicated data are higher than the one from the observed data, resulting in a Bayesian p-value close to 0.5. Bayesian p-values close to 0 or close to 1, on the contrary, indicate that the aspect of the model measured by the specific test statistic is not well represented by the model.
+
Test statistics have to be chosen such that they describe important data structures that are not directly measured as a model parameter. Because model parameters are chosen so that they fit the data well, it is not surprising to find p-values close to 0.5 when using model parameters as test statistics. For example, extreme values or quantiles of \(y\) are often better suited than the mean as test statistics, because they are less redundant with the model parameter that is fitted to the data. Similarly, the number of switches from 0 to 1 in binary data is suited to check for autocorrelation whereas the proportion of 1s among all the data may not give so much insight into the model fit. Other test statistics could be a measure for asymmetry, such as the relative difference between the 10 and 90% quantiles, or the proportion of zero values in a Poisson model.
+
We like predictive model checking because it allows us to look at different, specific aspects of the model. It helps us to judge which conclusions from the model are reliable and to identify the limitation of a model. Predictive model checking also helps to understand the process that has generated the data.
+
We use an analysis of the whitethroat breeding density in wildflower fields of different ages for illustration. The aim of this analysis was to identify an optimal age of wildflower fields that serves as good habitat for the whitethroat.
+
Because the Stan developers have written highly convenient user friendly functions to do posterior predictive model checks, we fit the model with Stan using the function stan_glmer from the package rstanarm.
+
data("wildflowerfields")
+dat <- wildflowerfields
+dat$size.ha <- dat$size/100# change unit to ha
+dat$size.z <-scale(dat$size) # z-transform size
+dat$year.z <-scale(dat$year)
+age.poly <-poly(dat$age, 3) # create orthogonal polynomials
+dat$age.l <- age.poly[,1] # to ease convergence of the model fit
+dat$age.q <- age.poly[,2]
+dat$age.c <- age.poly[,3]
+
+library(rstanarm)
+mod <-stan_glmer(bp ~ year.z + age.l + age.q + age.c + size.z +
+(1|field) +offset(log(size.ha)), family=poisson, data=dat)
+
The R-package shinystan(Gabry 2017) provides an easy way to do model checking. Therefore, there is no excuse to not do posterior predictive model checking. The R-code launch_shinystan(mod) opens a html-file that contains all kind of diagnostics of a model. Besides many statistics and diagnostic plots to assess how well the MCMC worked we also find a menu “PPcheck”. There, we can click through many of the plots that we, below, produce in R.
+
The function posterior_predict simulates many (exactly as many as there are draws from the posterior distributions of the model parameters, thus 4000 if the default number of iteration has been used in Stan) different data sets from a model fit. Specifically, for each single set of parameter values of the joint posterior distribution it simulates one replicated data set. We can look at histograms of the data and the replicated (Figure 16.1). The real data (bp) look similar to the replicated data.
+
set.seed(2352) # to make sure that the ylim and breaks of the histograms below can be used
+yrep <-posterior_predict(mod)
+par(mfrow=c(3,3), mar=c(2,1,2,1))
+for(i in1:8) hist(yrep[i,], col="blue",
+breaks=seq(-0.5, 18.5, by=1), ylim=c(0,85))
+hist(dat$bp, col="blue",
+breaks=seq(-0.5, 18.5, by=1), ylim=c(0,85))
+
+
+
+Figure 16.1: Histograms of 8 out of 4000 replicated data sets and of the observed data (dat$bp). The arguments breaks and ylim have been used in the function hist to produce the same scale of the x- and y-axis in all plots. This makes comparison among the plots easier.
+
-
-
16.2 Summary
-
xxx
-
+
Let’s look at specific aspects of the data. The proportion of zero counts could be a sensitive test statistic for this data set. First, we define a function “propzero” that extracts the proportion of zero counts from a vector of count data. Then we apply this function to the observed data and to each of the 4000 replicated data sets. At last, we extract the 1 and 99% quantile of the proportion of zero values of the replicated data.
+
propzeros <-function(x) sum(x==0)/length(x)
+propzeros(dat$bp) # prop. zero values in observed data
+
## [1] 0.4705882
+
pzeroyrep <-apply(yrep, 2, propzeros) # prop. zero values in yrep
+quantile(pzeroyrep, prob=c(0.01, 0.99))
+
## 1% 99%
+## 0.0335750 0.9557625
+
The observed data contain 47% zero values, which is well within the 98%-range of what the model predicted (3 - 96%). the Bayesian p-value is 0.6.
+
mean(pzeroyrep>=propzeros(dat$bp))
+
## [1] 0.5955882
+
What about the upper tail of the data? Let’s look at the 90% quantile.
+
quantile(dat$bp, prob=0.9) # for observed data
+
## 90%
+## 2
+
q90yrep <-apply(yrep, 2, quantile, prob=0.9) # for simulated data
+table(q90yrep)
Also, the 90% quantile of the data is within what the model predicts.
+
We also can look at the spatial distribution of the data and the replicated data. The variables X and Y are the coordinates of the wildflower fields. We can use them to draw transparent gray dots sized according to the number of breeding pairs.
+
par(mfrow=c(3,3), mar=c(1,1,1,1))
+plot(dat$X, dat$Y, pch=16, cex=dat$bp+0.2, col=rgb(0,0,0,0.5), axes=FALSE)
+box()
+r <-sample(1:nrow(yrep), 1) # draw 8 replicated data sets at random
+for(i in r:(r+7)){
+plot(dat$X, dat$Y, pch=16, cex=yrep[i,]+0.2,
+col=rgb(0,0,0,0.5), axes=FALSE)
+box()
+}
+
+
+
+Figure 16.2: Spatial distribution of the whitethroat breeding pair counts and of 8 randomly chosen replicated data sets with data simulated based on the model. the smallest dot correspond to a count of 0, the largest to a count of 20 breeding pairs. The panel in the upper left corner shows the data, the other panels are replicated data from the model.
+
+
The spatial distribution of the replicated data sets seems to be similar to the observed one at first look (Figure 16.2). With a second look, we may detect in the middle of the study area the model may predict slightly larger numbers than observed. This pattern may motivate us to find the reason for the imperfect fit if the main interest is whitethroat density estimates. Are there important elements in the landscape that influence whitethroat densities and that we have not yet taken into account in the model? However, our main interest is finding the optimal age of wildflower fields for the whitethroat. Therefore, we look at the mean age of the 10% of the fields with the highest breeding densities.
+To do so, we first define a function that extracts the mean field age of the 10% largest whitethroat density values, and then we apply this function to the observed data and to the 4000 replicated data sets.
The mean age of the 10% of the fields with the highest whitethroat densities is 4.4 years in the observed data set. In the replicated data set it is between 3.73 and 5.79 years. The Bayesian p-value is 0.79. Thus, in around 79% of the replicated data sets the mean age of the 10% fields with the highest whitethroat densities was higher than the observed one (Figure 16.3).
+Figure 16.3: Histogram of the average age of the 10% wildflower fields with the highest breeding densities in the replicated data sets. The orange line indicates the average age for the 10% fields with the highest observed whithethroat densities.
+
+
+
In a publication, we could summarize the results of the posterior predictive model checking in a table or give the plots in an appendix. Here, we conclude that the model fits in the most important aspects well. However, the model may predict too high whitethroat densities in the central part of the study area.
+
diff --git a/docs/priors.html b/docs/priors.html
index 429c254..020d910 100644
--- a/docs/priors.html
+++ b/docs/priors.html
@@ -4,18 +4,18 @@
- 10 Prior distributions | Bayesian Data Analysis in Ecology with R and Stan
+ 10 Prior distributions and prior sensitivity analyses | Bayesian Data Analysis in Ecology with R and Stan
-
+
-
+
@@ -23,7 +23,7 @@
-
+
@@ -262,7 +262,7 @@
What sample size is needed, is an important question when planning an empirical study?
Some authorities even ask for a justification for the planned sample size of an animal experiment.
diff --git a/docs/search_index.json b/docs/search_index.json
index 424d051..6739f67 100644
--- a/docs/search_index.json
+++ b/docs/search_index.json
@@ -1 +1 @@
-[["index.html", "Bayesian Data Analysis in Ecology with R and Stan Preface Why this book? About this book How to contribute? Acknowledgments", " Bayesian Data Analysis in Ecology with R and Stan Fränzi Korner-Nievergelt, Tobias Roth, Stefanie von Felten, Jerôme Guélat, Bettina Almasi, Pius Korner-Nievergelt 2024-09-29 Preface Why this book? In 2015, we wrote a statistics book for Master/PhD level Bayesian data analyses in ecology (Korner-Nievergelt et al. 2015). You can order it here. People seemed to like it (e.g. (Harju 2016)). Since then, two parallel processes happen. First, we learn more and we become more confident in what we do, or what we do not, and why we do what we do. Second, several really clever people develop software that broaden the spectrum of ecological models that now easily can be applied by ecologists used to work with R. With this e-book, we open the possibility to add new or substantially revised material. In most of the time, it should be in a state that it can be printed and used together with the book as handout for our stats courses. About this book We do not copy text from the book into the e-book. Therefore, we refer to the book (Korner-Nievergelt et al. 2015) for reading about the basic theory on doing Bayesian data analyses using linear models. However, Chapters 1 to 17 of this dynamic e-book correspond to the book chapters. In each chapter, we may provide updated R-codes and/or additional material. The following chapters contain completely new material that we think may be useful for ecologists. While we show the R-code behind most of the analyses, we sometimes choose not to show all the code in the html version of the book. This is particularly the case for some of the illustrations. An intrested reader can always consult the public GitHub repository with the rmarkdown-files that were used to generate the book. How to contribute? It is open so that everybody with a GitHub account can make comments and suggestions for improvement. Readers can contribute in two ways. One way is to add an issue. The second way is to contribute content directly through the edit button at the top of the page (i.e. a symbol showing a pencil in a square). That button is linked to the rmarkdown source file of each page. You can correct typos or add new text and then submit a GitHub pull request. We try to respond to you as quickly as possible. We are looking forward to your contribution! Acknowledgments We thank Yihui Xie for providing bookdown which makes it much fun to write open books such as ours. We thank many anonymous students and collaborators who searched information on new software, reported updates and gave feedback on earlier versions of the book. Specifically, we thank Carole Niffenegger for looking up the difference between the bulk and tail ESS in the brm output, Martin Küblbeck for using the conditional logistic regression in rstanarm, "],["PART-I.html", "1 Introduction to PART I 1.1 Further reading", " 1 Introduction to PART I During our courses we are sometimes asked to give an introduction to some R-related stuff covering data analysis, presentation of results or rather specialist topics in ecology. In this part we present collected these introduction and try to keep them updated. This is also a commented collection of R-code that we documented for our own work. We hope this might be useful olso for other readers. 1.1 Further reading R for Data Science by Garrett Grolemund and Hadley Wickham: Introduces the tidyverse framwork. It explains how to get data into R, get it into the most useful structure, transform it, visualise it and model it. "],["basics.html", "2 Basics of statistics 2.1 Variables and observations 2.2 Displaying and summarizing data 2.3 Inferential statistics 2.4 Bayes theorem and the common aim of frequentist and Bayesian methods 2.5 Classical frequentist tests and alternatives 2.6 Summary", " 2 Basics of statistics This chapter introduces some important terms useful for doing data analyses. It also introduces the essentials of the classical frequentist tests such as t-test. Even though we will not use nullhypotheses tests later (Amrhein, Greenland, and McShane 2019), we introduce them here because we need to understand the scientific literature. For each classical test, we provide a suggestion how to present the statistical results without using null hypothesis tests. We further discuss some differences between the Bayesian and frequentist statistics. 2.1 Variables and observations Empirical research involves data collection. Data are collected by recording measurements of variables for observational units. An observational unit may be, for example, an individual, a plot, a time interval or a combination of those. The collection of all units ideally build a random sample of the entire population of units in that we are interested. The measurements (or observations) of the random sample are stored in a data table (sometimes also called data set, but a data set may include several data tables. A collection of data tables belonging to the same study or system is normally bundled and stored in a data base). A data table is a collection of variables (columns). Data tables normally are handled as objects of class data.frame in R. All measurements on a row in a data table belong to the same observational unit. The variables can be of different scales (Table 2.1). Table 2.1: Scales of measurements Scale Examples Properties Coding in R Nominal Sex, genotype, habitat Identity (values have a unique meaning) factor() Ordinal Elevational zones Identity and magnitude (values have an ordered relationship) ordered() Numeric Discrete: counts; continuous: body weight, wing length Identity, magnitude, and intervals or ratios intgeger() numeric() The aim of many studies is to describe how a variable of interest (\\(y\\)) is related to one or more predictor variables (\\(x\\)). How these variables are named differs between authors. The y-variable is called outcome variable, response or dependent variable. The x-variables are called predictors, explanatory variables or independent variables. The choose of the terms for x and y is a matter of taste. We avoid the terms dependent and independent variables because often we do not know whether the variable \\(y\\) is in fact depending on the \\(x\\) variables and also, often the x-variables are not independent of each other. In this book, we try to use outcome and predictor variables because these terms sound most neutral to us in that they refer to how the statistical model is constructed rather than to a real life relationship. 2.2 Displaying and summarizing data 2.2.1 Histogram While nominal and ordinal variables are summarized by giving the absolute number or the proportion of observations for each category, numeric variables normally are summarized by a location and a scatter statistics, such as the mean and the standard deviation or the median and some quantiles. The distribution of a numeric variable is graphically displayed in a histogram (Fig. 2.1). Figure 2.1: Histogram of the length of ell of statistics course participants. To draw a histogram, the variable is displayed on the x-axis and the \\(x_i\\)-values are assigned to classes. The edges of the classes are called breaks. They can be set with the argument breaks= within the function hist. The values given in the breaks= argument must at least span the values of the variable. If the argument breaks= is not specified, R searches for breaks-values that make the histogram look smooth. The number of observations falling in each class is given on the y-axis. The y-axis can be re-scaled so that the area of the histogram equals 1 by setting the argument density=TRUE. In that case, the values on the y-axis correspond to the density values of a probability distribution (Chapter 4). 2.2.2 Location and scatter Location statistics are mean, median or mode. A common mean is the arithmetic mean: \\(\\hat{\\mu} = \\bar{x} = \\frac{i=1}{n} x_i \\sum_{1}^{n}\\) (R function mean), where \\(n\\) is the sample size. The parameter \\(\\mu\\) is the (unknown) true mean of the entire population of which the \\(1,...,n\\) measurements are a random sample of. \\(\\bar{x}\\) is called the sample mean and used as an estimate for \\(\\mu\\). The \\(^\\) above any parameter indicates that the parameter value is obtained from a sample and, therefore, it may be different from the true value. The median is the 50% quantile. We find 50% of the measurements below and 50% above the median. If \\(x_1,..., x_n\\) are the ordered measurements of a variable, then the median is: median \\(= x_{(n+1)/2}\\) for uneven \\(n\\), and median \\(= \\frac{1}{2}(x_{n/2} + x_{n/2+1})\\) for even \\(n\\) (R function median). The mode is the value that is occurring with highest frequency or that has the highest density. Scatter also is called spread, scale or variance. Variance parameters describe how far away from the location parameter single observations can be found, or how the measurements are scattered around their mean. The variance is defined as the average squared difference between the observations and the mean: variance \\(\\hat{\\sigma^2} = s^2 = \\frac{1}{n-1}\\sum_{i=1}^{n}(x_i-\\bar{x})^2\\) The term \\((n-1)\\) is called the degrees of freedom. It is used in the denominator of the variance formula instead of \\(n\\) to prevent underestimating the variance. Because \\(\\bar{x}\\) is in average closer to \\(x_i\\) than the unknown true mean \\(\\mu\\) would be, the variance would be underestimated if \\(n\\) is used in the denominator. The variance is used to compare the degree of scatter among different groups. However, its values are difficult to interpret because of the squared unit. Therefore, the square root of the variance, the standard deviation is normally reported: standard deviation \\(\\hat{\\sigma} = s = \\sqrt{s^2}\\) (R Function sd) The standard deviation is approximately the average deviation of an observation from the sample mean. In the case of a [normal distribution][normdist], about two thirds (68%) of the data are expected within one standard deviation around the mean. The variance and standard deviation each describe the scatter with a single value. Thus, we have to assume that the observations are scattered symmetrically around their mean in order to get a picture of the distribution of the measurements. When the measurements are spread asymmetrically (skewed distribution), then it may be more precise to describe the scatter with more than one value. Such statistics could be quantiles from the lower and upper tail of the data. Quantiles inform us about both location and spread of a distribution. The \\(p\\)th-quantile is the value with the property that a proportion \\(p\\) of all values are less than or equal to the value of the quantile. The median is the 50% quantile. The 25% quantile and the 75% quantile are also called the lower and upper quartiles, respectively. The range between the 25% and the 75% quartiles is called the interquartile range. This range includes 50% of the distribution and is also used as a measure of scatter. The R function quantile extracts sample quantiles. The median, the quartiles, and the interquartile range can be graphically displayed using box and-whisker plots (boxplots in short, R function boxplot). The horizontal fat bars are the medians (Fig. 2.2). The boxes mark the interquartile range. The whiskers reach out to the last observation within 1.5 times the interquartile range from the quartile. Circles mark observations beyond 1.5 times the interquartile range from the quartile. par(mar=c(5,4,1,1)) boxplot(ell~car, data=dat, las=1, ylab="Lenght of ell [cm]", col="tomato", xaxt="n") axis(1, at=c(1,2), labels=c("Not owing a car", "Car owner")) n <- table(dat$car) axis(1, at=c(1,2), labels=paste("n=", n, sep=""), mgp=c(3,2, 0)) Figure 2.2: Boxplot of the length of ell of statistics course participants who are or ar not owner of a car. The boxplot is an appealing tool for comparing location, variance and distribution of measurements among groups. 2.2.3 Correlations A correlation measures the strength with which characteristics of two variables are associated with each other (co-occur). If both variables are numeric, we can visualize the correlation using a scatterplot. par(mar=c(5,4,1,1)) plot(temp~ell, data=dat, las=1, xlab="Lenght of ell [cm]", ylab="Comfort temperature [°C]", pch=16) Figure 2.3: Scatterplot of the length of ell and the comfort temperature of statistics course participants. The covariance between variable \\(x\\) and \\(y\\) is defined as: covariance \\(q = \\frac{1}{n-1}\\sum_{i=1}^{n}((x_i-\\bar{x})*(y_i-\\bar{y}))\\) (R function cov) As for the variance, also the units of the covariance are sqared and therefore covariance values are difficult to interpret. A standardized covariance is the Pearson correlation coefficient: Pearson correlation coefficient: \\(r=\\frac{\\sum_{i=1}^{n}(x_i-\\bar{x})(y_i-\\bar{y})}{\\sqrt{\\sum_{i=1}^{n}(x_i-\\bar{x})^2\\sum_{i=1}^{n}(y_i-\\bar{y})^2}}\\) (R function cor) Means, variances, standard deviations, covariances and correlations are sensible for outliers. Single observations containing extreme values normally have a overproportional influence on these statistics. When outliers are present in the data, we may prefer a more robust correlation measure such as the Spearman correlation or Kendalls tau. Both are based on the ranks of the measurements instead of the measurements themselves. Spearman correlation coefficient: correlation between rank(x) and rank(y) (R function cor(x,y, method=\"spearman\")) Kendalls tau: \\(\\tau = 1-\\frac{4I}{(n(n-1))}\\), where \\(I\\) = number of pairs \\((i,k)\\) for which \\((x_i < x_k)\\) & \\((y_i > y_k)\\) or viceversa. (R function cor(x,y, method=\"kendall\")) 2.2.4 Principal components analyses PCA The principal components analysis (PCA) is a multivariate correlation analysis. A multidimensional data set with \\(k\\) variables can be seen as a cloud of points (observations) in a \\(k\\)-dimensional space. Imagine, we could move around in the space and look at the cloud from different locations. From some locations, the data looks highly correlated, whereas from others, we cannot see the correlation. That is what PCA is doing. It is rotating the coordinate system (defined by the original variables) of the data cloud so that the correlations are no longer visible. The axes of the new coordinates system are linear combinations of the original variables. They are called principal components. There are as many principal coordinates as there are original variables, i.e. \\(k\\), \\(p_1, ..., p_k\\). The principal components meet further requirements: the first component explains most variance the second component explains most of the remaining variance and is perpendicular (= uncorrelated) to the first one third component explains most of the remaining variance and is perpendicular to the first two For example, in a two-dimensional data set \\((x_1, x_2)\\) the principal components become \\(pc_{1i} = b_{11}x_{1i} + b_{12}x_{2i}\\) \\(pc_{2i} = b_{21}x_{1i} + b_{22}x_{2i}\\) with \\(b_{jk}\\) being loadings of principal component \\(j\\) and original variable \\(k\\). Fig. 2.4 shows the two principal components for a two-dimensional data set. They can be calculated using matrix algebra: principal components are eigenvectors of the covariance or correlation matrix. Figure 2.4: Principal components of a two dimensional data set based on the covariance matrix (green) and the correlation matrix (brown). The choice between correlation or covariance matrix is essential and important. The covariance matrix is an unstandardized correlation matrix. Therefore, the units, i.e., whether cm or m are used, influence the results of the PCA if it is based on the covariance matrix. When the PCA is based on the covariance matrix, the results will change, when we change the units of one variable, e.g., from cm to m. Basing the PCA on the covariance matrix only makes sense, when the variances are comparable among the variables, i.e., if all variables are measured in the same unit and we would like to weight each variable according to its variance. If this is not the case, the PCA must be based on the correlation matrix. pca <- princomp(cbind(x1,x2)) # PCA based on covariance matrix pca <- princomp(cbind(x1,x2), cor=TRUE) # PCA based on correlation matrix loadings(pca) ## ## Loadings: ## Comp.1 Comp.2 ## x1 0.707 0.707 ## x2 0.707 -0.707 ## ## Comp.1 Comp.2 ## SS loadings 1.0 1.0 ## Proportion Var 0.5 0.5 ## Cumulative Var 0.5 1.0 The loadings measure the correlation of each variable with the principal components. They inform about what aspects of the data each component is measuring. The signs of the loadings are arbitrary, thus we can multiplied them by -1 without changing the PCA. Sometimes this can be handy for describing the meaning of the principal component in a paper. For example, Zbinden et al. (2018) combined the number of hunting licenses, the duration of the hunting period and the number of black grouse cocks that were allowed to be hunted per hunter in a principal component in order to measure hunting pressure. All three variables had a negative loading in the first component, so that high values of the component meant low hunting pressure. Before the subsequent analyses, for which a measure of hunting pressure was of interest, the authors changed the signs of the loadings so that this component measured hunting pressure. The proportion of variance explained by each component is, beside the loadings, an important information. If the first few components explain the main part of the variance, it means that maybe not all \\(k\\) variables are necessary to describe the data, or, in other words, the original \\(k\\) variables contain a lot of redundant information. # extract the variance captured by each component summary(pca) ## Importance of components: ## Comp.1 Comp.2 ## Standard deviation 1.2679406 0.6263598 ## Proportion of Variance 0.8038367 0.1961633 ## Cumulative Proportion 0.8038367 1.0000000 Ridge regression is similar to doing a PCA within a linear model while components with low variance are shrinked to a higher degree than components with a high variance. 2.3 Inferential statistics 2.3.1 Uncertainty there is never a yes-or-no answer there will always be uncertainty Amrhein (2017)[https://peerj.com/preprints/26857] The decision whether an effect is important or not cannot not be done based on data alone. For making a decision we should, beside the data, carefully consider the consequences of each decision, the aims we would like to achieve, and the risk, i.e. how bad it is to make the wrong decision. Structured decision making or decision analyses provide methods to combine consequences of decisions, objectives of different stakeholders and risk attitudes of decision makers to make optimal decisions (Hemming et al. 2022, Runge2020). In most data analyses, particularly in basic research and when working on case studies, we normally do not consider consequences of decisions. However, the results will be more useful when presented in a way that other scientists can use them for a meta-analysis, or stakeholders and politicians can use them for making better decisions. Useful results always include information on the size of a parameter of interest, e.g. an effect of a drug or an average survival, together with an uncertainty measure. Therefore, statistics is describing patterns of the process that presumably has generated the data and quantifying the uncertainty of the described patterns that is due to the fact that the data is just a random sample from the larger population we would like to know the patterns of. Quantification of uncertainty is only possible if: 1. the mechanisms that generated the data are known 2. the observations are a random sample from the population of interest Most studies aim at understanding the mechanisms that generated the data, thus they are most likely not known beforehand. To overcome that problem, we construct models, e.g. statistical models, that are (strong) abstractions of the data generating process. And we report the model assumptions. All uncertainty measures are conditional on the model we used to analyze the data, i.e., they are only reliable, if the model describes the data generating process realistically. Because most statistical models do not describe the data generating process well, the true uncertainty almost always is much higher than the one we report. In order to obtain a random sample from the population under study, a good study design is a prerequisite. To illustrate how inference about a big population is drawn from a small sample, we here use simulated data. The advantage of using simulated data is that the mechanism that generated the data is known as well as the big population. Imagine there are 300000 PhD students on the world and we would like to know how many statistics courses they have taken in average before they started their PhD (Fig. 2.5). We use random number generators (rpois and rgamma) to simulate for each of the 300000 virtual students a number. We here use these 300000 numbers as the big population that in real life we almost never can sample in total. Normally, we know the number of courses students have taken just for a small sample of students. To simulate that situation we draw 12 numbers at random from the 300000 (R function sample). Then, we estimate the average number of statistics courses students take before they start a PhD from the sample of 12 students and we compare that mean to the true mean of the 300000 students. # simulate the virtual true population set.seed(235325) # set seed for random number generator # simulate fake data of the whole population # using an overdispersed Poisson distribution, # i.e. a Poisson distribution of whicht the mean # has a gamma distribution statscourses <- rpois(300000, rgamma(300000, 2, 3)) # draw a random sample from the population n <- 12 # sample size y <- sample(statscourses, 12, replace=FALSE) Figure 2.5: Histogram of the number of statistics courses of 300000 virtual PhD students have taken before their PhD started. The rugs on the x-axis indicate the random sample of 12 out of the 300000 students. The black vertical line indicates the mean of the 300000 students (true mean) and the blue line indicates the mean of the sample (sample mean). We observe the sample mean, what do we know about the population mean? There are two different approaches to answer this question. 1) We could ask us, how much the sample mean would scatter, if we repeat the study many times? This approach is called the frequentist statistics. 2) We could ask us for any possible value, what is the probability that it is the true population mean? To do so, we use probability theory and that is called the Bayesian statistics. Both approaches use (essentially similar) models. Only the mathematical techniques to calculate uncertainty measures differ between the two approaches. In cases when beside the data no other information is used to construct the model, then the results are approximately identical (at least for large enough sample sizes). A frequentist 95% confidence interval (blue horizontal segment in Fig. 2.6) is constructed such that, if you were to (hypothetically) repeat the experiment or sampling many many times, 95% of the intervals constructed would contain the true value of the parameter (here the mean number of courses). From the Bayesian posterior distribution (pink in Fig. 2.6) we could construct a 95% interval (e.g., by using the 2.5% and 97.5% quantiles). This interval has traditionally been called credible interval. It can be interpreted that we are 95% sure that the true mean is inside that interval. Both, confidence interval and posterior distribution, correspond to the statistical uncertainty of the sample mean, i.e., they measure how far away the sample mean could be from the true mean. In this virtual example, we know the true mean is 0.66, thus somewhere at the lower part of the 95% CI or in the lower quantiles of the posterior distribution. In real life, we do not know the true mean. The grey histogram in Fig. 2.6 shows how means of many different virtual samples of 12 students scatter around the true mean. The 95% interval of these virtual means corresponds to the 95% CI, and the variance of these virtual means correspond to the variance of the posterior distribution. This virtual example shows that posterior distribution and 95% CI correctly measure the statistical uncertainty (variance, width of the interval), however we never know exactly how far the sample mean is from the true mean. Figure 2.6: Histogram of means of repeated samples from the true populations. The scatter of these means visualize the true uncertainty of the mean in this example. The blue vertical line indicates the mean of the original sample. The blue segment shows the 95% confidence interval (obtained by fequensist methods) and the violet line shows the posterior distribution of the mean (obtained by Bayesian methods). Uncertainty intervals only are reliable if the model is a realistic abstraction of the data generating process (or if the model assumptions are realistic). Because both terms, confidence and credible interval, suggest that the interval indicates confidence or credibility but the intervals actually show uncertainty, it has been suggested to rename the interval into compatibility or uncertainty interval (Andrew Gelman and Greenland 2019). 2.3.2 Standard error The standard error SE is, beside the uncertainty interval, an alternative possibility to measure uncertainty. It measures an average deviation of the sample mean from the (unknown) true population mean. The frequentist method for obtaining the SE is based on the central limit theorem. According to the central limit theorem the sum of independent, not necessarily normally distributed random numbers are normally distributed when sample size is large enough (Chapter 4). Because the mean is a sum (divided by a constant, the sample size) it can be assumed that the distribution of many means of samples is normal. The standard deviation SD of the many means is called the standard error SE. It can be mathematically shown that the standard error SE equals the standard deviation SD of the sample divided by the square root of the sample size: frequentist SE = SD/sqrt(n) = \\(\\frac{\\hat{\\sigma}}{\\sqrt{n}}\\) Bayesian SE: Using Bayesian methods, the SE is the SD of the posterior distribution. It is very important to keep the difference between SE and SD in mind! SD measures the scatter of the data, whereas SE measures the statistical uncertainty of the mean (or of another estimated parameter, Fig. 2.7). SD is a descriptive statistics describing a characteristics of the data, whereas SE is an inferential statistics showing us how far away the sample mean possibly is from the true mean. When sample size increases, SE becomes smaller, whereas SD does not change (given the added observations are drawn at random from the same big population as the ones already in the sample). Figure 2.7: Illustration of the difference between SD and SE. The SD measures the scatter in the data (sample, tickmarks on the x-axis). The SD is an estimate for the scatter in the big population (grey histogram, normally not known). The SE measures the uncertainty of the sample mean (in blue). The SE measures approximately how far, in average the sample mean (blue) is from the true mean (brown). 2.4 Bayes theorem and the common aim of frequentist and Bayesian methods 2.4.1 Bayes theorem for discrete events The Bayes theorem describes the probability of event A conditional on event B (the probability of A after B has already occurred) from the probability of B conditional on A and the two probabilities of the events A and B: \\(P(A|B) = \\frac{P(B|A)P(A)}{P(B)}\\) Imagine, event A is The person likes wine as a birthday present. and event B The person has no car.. The conditional probability of A given B is the probability that a person not owing a car likes wine. Answers from students whether they have a car and what they like as a birthday present are summarized in Table 2.2. Table 2.2: Cross table of the students birthday preference and car ownership. car/birthday flowers wine sum no car 6 9 15 car 1 6 7 sum 7 15 22 We can apply the Bayes theorem to obtain the probability that the student likes wine given it has no car, \\(P(A|B)\\). Lets assume that only the ones who prefer wine go together for having a glass of wine at the bar after the statistics course. While they drink wine, the tell each other about their cars and they obtain the probability that a student who likes wine has no car, \\(P(B|A) = 0.6\\). During the statistics class the teacher asked the students about their car ownership and birthday preference. Therefore, they know that \\(P(A) =\\) likes wine \\(= 0.68\\) and \\(P(B) =\\) no car \\(= 0.68\\). With these information, they can obtain the probability that a student likes wine given it has no car, even if not all students without cars went to the bar: \\(P(A|B) = \\frac{0.6*0.68}{0.68} = 0.6\\). 2.4.2 Bayes theorem for continuous parameters When we use the Bayes theorem for analyzing data, then the aim is to make probability statements for parameters. Because most parameters are measured at a continuous scale we use probability density functions to describe what we know about them. The Bayes theorem can be formulated for probability density functions denoted with \\(p(\\theta)\\), e.g. for a parameter \\(\\theta\\) (for example probability density functions see Chapter 4). What we are interested in is the probability of the parameter \\(\\theta\\) given the data, i.e., \\(p(\\theta|y)\\). This probability density function is called the posterior distribution of the parameter \\(\\theta\\). Here is the Bayes theorem formulated for obtaining the posterior distribution of a parameter from the data \\(y\\), the prior distribution of the parameter \\(p(\\theta)\\) and assuming a model for the data generating process. The data model is defined by the likelihood that specifies how the data \\(y\\) is distributed given the parameters \\(p(y|\\theta)\\): \\(p(\\theta|y) = \\frac{p(y|\\theta)p(\\theta)}{p(y)} = \\frac{p(y|\\theta)p(\\theta)}{\\int p(y|\\theta)p(\\theta) d\\theta}\\) The probability of the data \\(p(y)\\) is also called the scaling constant, because it is a constant. It is the integral of the likelihood over all possible values of the parameter(s) of the model. 2.4.3 Estimating a mean assuming that the variance is known For illustration, we first describe a simple (unrealistic) example for which it is almost possible to follow the mathematical steps for solving the Bayes theorem even for non-mathematicians. Even if we cannot follow all steps, this example will illustrate the principle how the Bayesian theorem works for continuous parameters. The example is unrealistic because we assume that the variance \\(\\sigma^2\\) in the data \\(y\\) is known. We construct a data model by assuming that \\(y\\) is normally distributed: \\(p(y|\\theta) = normal(\\theta, \\sigma)\\), with \\(\\sigma\\) known. The function \\(normal\\) defines the probability density function of the normal distribution (Chapter 4). The parameter, for which we would like to get the posterior distribution is \\(\\theta\\), the mean. We specify a prior distribution for \\(\\theta\\). Because the normal distribution is a conjugate prior for a normal data model with known variance, we use the normal distribution. Conjugate priors have nice mathematical properties (see Chapter 10) and are therefore preferred when the posterior distribution is obtained algebraically. That is the prior: \\(p(\\theta) = normal(\\mu_0, \\tau_0)\\) With the above data, data model and prior, the posterior distribution of the mean \\(\\theta\\) is defined by: \\(p(\\theta|y) = normal(\\mu_n, \\tau_n)\\), where \\(\\mu_n= \\frac{\\frac{1}{\\tau_0^2}\\mu_0 + \\frac{n}{\\sigma^2}\\bar{y}}{\\frac{1}{\\tau_0^2}+\\frac{n}{\\sigma^2}}\\) and \\(\\frac{1}{\\tau_n^2} = \\frac{1}{\\tau_0^2} + \\frac{n}{\\sigma^2}\\) \\(\\bar{y}\\) is the arithmetic mean of the data. Because only this value is needed in order to obtain the posterior distribution, this value is called the sufficient statistics. From the mathematical formulas above and also from Fig. 2.8 we see that the mean of the posterior distribution is a weighted average between the prior mean and \\(\\bar{y}\\) with weights equal to the precisions (\\(\\frac{1}{\\tau_0^2}\\) and \\(\\frac{n}{\\sigma^2}\\)). Figure 2.8: Hypothetical example showing the result of applying the Bayes theorem for obtaining a posterior distribution of a continuous parameter. The likelhood is defined by the data and the model, the prior is expressing the knowledge about the parameter before looking at the data. Combining the two distributions using the Bayes theorem results in the posterior distribution. 2.4.4 Estimating the mean and the variance We now move to a more realistic example, which is estimating the mean and the variance of a sample of weights of Snowfinches Montifringilla nivalis (Fig. 2.9). To analyze those data, a model with two parameters (the mean and the variance or standard deviation) is needed. The data model (or likelihood) is specified as \\(p(y|\\theta, \\sigma) = normal(\\theta, \\sigma)\\). Figure 2.9: Snowfinches stay above the treeline for winter. They come to feeders. # weight (g) y <- c(47.5, 43, 43, 44, 48.5, 37.5, 41.5, 45.5) n <- length(y) Because there are two parameters, we need to specify a two-dimensional prior distribution. We looked up in A. Gelman et al. (2014b) that the conjugate prior distribution in our case is an Normal-Inverse-Chisquare distribution: \\(p(\\theta, \\sigma) = N-Inv-\\chi^2(\\mu_0, \\sigma_0^2/\\kappa_0; v_0, \\sigma_0^2)\\) From the same reference we looked up how the posterior distribution looks like in our case: \\(p(\\theta,\\sigma|y) = \\frac{p(y|\\theta, \\sigma)p(\\theta, \\sigma)}{p(y)} = N-Inv-\\chi^2(\\mu_n, \\sigma_n^2/\\kappa_n; v_n, \\sigma_n^2)\\), with \\(\\mu_n= \\frac{\\kappa_0}{\\kappa_0+n}\\mu_0 + \\frac{n}{\\kappa_0+n}\\bar{y}\\), \\(\\kappa_n = \\kappa_0+n\\), \\(v_n = v_0 +n\\), \\(v_n\\sigma_n^2=v_0\\sigma_0^2+(n-1)s^2+\\frac{\\kappa_0n}{\\kappa_0+n}(\\bar{y}-\\mu_0)^2\\) For this example, we need the arithmetic mean \\(\\bar{y}\\) and standard deviation \\(s^2\\) from the sample for obtaining the posterior distribution. Therefore, these two statistics are the sufficient statistics. The above formula look intimidating, but we never really do that calculations. We let R doing that for us in most cases by simulating many numbers from the posterior distribution, e.g., using the function sim from the package arm (Andrew Gelman and Hill 2007). In the end, we can visualize the distribution of these many numbers to have a look at the posterior distribution. In Fig. 2.10 the two-dimensional \\((\\theta, \\sigma)\\) posterior distribution is visualized by using simulated values. The two dimensional distribution is called the joint posterior distribution. The mountain of dots in Fig. 2.10 visualize the Normal-Inverse-Chisquare that we calculated above. When all values of one parameter is displayed in a histogram ignoring the values of the other parameter, it is called the marginal posterior distribution. Algebraically, the marginal distribution is obtained by integrating one of the two parameters out over the joint posterior distribution. This step is definitively way easier when simulated values from the posterior distribution are available! Figure 2.10: Visualization of the joint posterior distribution for the mean and standard deviation of Snowfinch weights. The lower left panel shows the two-dimensional joint posterior distribution, whereas the upper and right panel show the marginal posterior distributions of each parameter separately. The marginal posterior distributions of every parameter is what we normally report in a paper to report statistical uncertainty. In our example, the marginal distribution of the mean is a t-distribution (Chapter 4). Frequentist statistical methods also use a t-distribution to describe the uncertainty of an estimated mean for the case when the variance is not known. Thus, frequentist methods came to the same solution using a completely different approach and different techniques. Doesnt that increase dramatically our trust in statistical methods? 2.5 Classical frequentist tests and alternatives 2.5.1 Nullhypothesis testing Null hypothesis testing is constructing a model that describes how the data would look like in case of what we expect to be would not be. Then, the data is compared to how the model thinks the data should look like. If the data does not look like the model thinks they should, we reject that model and accept that our expectation may be true. To decide whether the data looks like the null-model thinks the data should look like the p-value is used. The p-value is the probability of observing the data or more extreme data given the null hypothesis is true. Small p-values indicate that it is rather unlikely to observe the data or more extreme data given the null hypothesis \\(H_0\\) is true. Null hypothesis testing is problematic. We discuss some of the problems after having introduces the most commonly used classical tests. 2.5.2 Comparison of a sample with a fixed value (one-sample t-test) In some studies, we would like to compare the data to a theoretical value. The theoretical value is a fixed value, e.g. calculated based on physical, biochemical, ecological or any other theory. The statistical task is then to compare the mean of the data including its uncertainty with the theoretical value. The result of such a comparison may be an estimate of the mean of the data with its uncertainty or an estimate of the difference of the mean of the data to the theoretical value with the uncertainty of this difference. For example, a null hypothesis could be \\(H_0:\\)The mean of Snowfinch weights is exactly 40g. A normal distribution with a mean of \\(\\mu_0=40\\) and a variance equal to the estimated variance in the data \\(s^2\\) is then assumed to describe how we would expect the data to look like given the null hypothesis was true. From that model it is possible to calculate the distribution of hypothetical means of many different hypothetical samples of sample size \\(n\\). The result is a t-distribution (Fig. 2.11). In classical tests, the distribution is standardized so that its variance was one. Then the sample mean, or in classical tests a standardized difference between the mean and the hypothetical mean of the null hypothesis (here 40g), called test statistics \\(t = \\frac{\\bar{y}-\\mu_0}{\\frac{s}{\\sqrt{n}}}\\), is compared to that (standardized) t-distribution. If the test statistics falls well within the expected distribution the null hypothesis is accepted. Then, the data is well compatible with the null hypothesis. However, if the test statistics falls in the tails or outside the distribution, then the null hypothesis is rejected and we could write that the mean weight of Snowfinches is statistically significantly different from 40g. Unfortunately, we cannot infer about the probability of the null hypothesis and how relevant the result is. Figure 2.11: Illustration of a one-sample t-test. The blue histogram shows the distribution of the measured weights with the sample mean (lightblue) indicated as a vertical line. The black line is the t-distribution that shows how hypothetical sample means are expected to be distributed if the big population of Snowfinches has a mean weight of 40g (i.e., if the null hypothesis was true). Orange area shows the area of the t-distribution that lays equal or farther away from 40g than the sample mean. The orange area is the p-value. We can use the r-function t.test to calculate the p-value of a one sample t-test. t.test(y, mu=40) ## ## One Sample t-test ## ## data: y ## t = 3.0951, df = 7, p-value = 0.01744 ## alternative hypothesis: true mean is not equal to 40 ## 95 percent confidence interval: ## 40.89979 46.72521 ## sample estimates: ## mean of x ## 43.8125 The output of the r-function t.test also includes the mean and the 95% confidence interval (or compatibility or uncertainty interval) of the mean. The CI could, alternatively, be obtained as the 2.5% and 97.5% quantiles of a t-distribution with a mean equal to the sample mean, degrees of freedom equal to the sample size minus one and a standard deviation equal to the standard error of the mean. # lower limit of 95% CI mean(y) + qt(0.025, df=length(y)-1)*sd(y)/sqrt(n) ## [1] 40.89979 # upper limit of 95% CI mean(y) + qt(0.975, df=length(y)-1)*sd(y)/sqrt(n) ## [1] 46.72521 When applying the Bayes theorem for obtaining the posterior distribution of the mean we end up with the same t-distribution as described above, in case we use flat prior distributions for the mean and the standard deviation. Thus, the two different approaches end up with the same result! par(mar=c(4.5, 5, 2, 2)) hist(y, col="blue", xlim=c(30,52), las=1, freq=FALSE, main=NA, ylim=c(0, 0.3)) abline(v=mean(y), lwd=2, col="lightblue") abline(v=40, lwd=2) lines(density(bsim@coef)) text(45, 0.3, "posterior distribution\\nof the mean of y", cex=0.8, adj=c(0,1), xpd=NA) Figure 2.12: Illustration of the posterior distribution of the mean. The blue histogram shows the distribution of the measured weights with the sample mean (lightblue) indicated as a vertical line. The black line is the posterior distribution that shows what we know about the mean after having looked at the data. The area under the posterior density function that is larger than 40 is the posterior probability of the hypothesis that the true mean Snwofinch weight is larger than 40g. The posterior probability of the hypothesis that the true mean Snowfinch weight is larger than 40g, \\(P(H:\\mu>40) =\\), is equal to the proportion of simulated random values from the posterior distribution, saved in the vector bsim@coef, that are larger than 40. # Two ways of calculating the proportion of values # larger than a specific value within a vector of values round(sum(bsim@coef[,1]>40)/nrow(bsim@coef),2) ## [1] 0.99 round(mean(bsim@coef[,1]>40),2) ## [1] 0.99 # Note: logical values TRUE and FALSE become # the numeric values 1 and 0 within the functions sum() and mean() We, thus, can be pretty sure that the mean Snowfinch weight (in the big world population) is larger than 40g. Such a conclusion is not very informative, because it does not tell us how much larger we can expect the mean Snowfinch weight to be. Therefore, we prefer reporting a credible interval (or compatibility interval or uncertainty interval) that tells us what values for the mean Snowfinch weight are compatible with the data (given the data model we used realistically reflects the data generating process). Based on such an interval, we can conclude that we are pretty sure that the mean Snowfinch weight is between 40 and 48g. # 80% credible interval, compatibility interval, uncertainty interval quantile(bsim@coef[,1], probs=c(0.1, 0.9)) ## 10% 90% ## 42.07725 45.54080 # 95% credible interval, compatibility interval, uncertainty interval quantile(bsim@coef[,1], probs=c(0.025, 0.975)) ## 2.5% 97.5% ## 40.90717 46.69152 # 99% credible interval, compatibility interval, uncertainty interval quantile(bsim@coef[,1], probs=c(0.005, 0.995)) ## 0.5% 99.5% ## 39.66181 48.10269 2.5.3 Comparison of the locations between two groups (two-sample t-test) Many research questions aim at measuring differences between groups. For example, we could be curious to know how different in size car owner are from people not owning a car. A boxplot is a nice possibility to visualize the ell length measurements of two (or more) groups (Fig. 2.13). From the boxplot, we do not see how many observations are in the two samples. We can add that information to the plot. The boxplot visualizes the samples but it does not show what we know about the big (unmeasured) population mean. To show that, we need to add a compatibility interval (or uncertainty interval, credible interval, confidence interval, in brown in Fig. 2.13). Figure 2.13: Ell length of car owners (Y) and people not owning a car (N). Horizontal bar = median, box = interquartile range, whiskers = extremest observation within 1.5 times the interquartile range from the quartile, circles=observations farther than 1.5 times the interquartile range from the quartile. Filled brown circles = means, vertical brown bars = 95% compatibility interval. When we added the two means with a compatibility interval, we see what we know about the two means, but we do still not see what we know about the difference between the two means. The uncertainties of the means do not show the uncertainty of the difference between the means. To do so, we need to extract the difference between the two means from a model that describes (abstractly) how the data has been generated. Such a model is a linear model that we will introduce in Chapter 11. The second parameter measures the differences in the means of the two groups. And from the simulated posterior distribution we can extract a 95% compatibility interval. Thus, we can conclude that the average ell length of car owners is with high probability between 0.5 cm smaller and 2.5 cm larger than the averag ell of people not having a car. mod <- lm(ell~car, data=dat) mod ## ## Call: ## lm(formula = ell ~ car, data = dat) ## ## Coefficients: ## (Intercept) carY ## 43.267 1.019 bsim <- sim(mod, n.sim=nsim) quantile(bsim@coef[,"carY"], prob=c(0.025, 0.5, 0.975)) ## 2.5% 50% 97.5% ## -0.501348 1.014478 2.494324 The corresponding two-sample t-test gives a p-value for the null hypothesis: The difference between the two means equals zero., a confidence interval for the difference and the two means. While the function lmgives the difference Y minus N, the function t.testgives the difference N minus Y. Luckily the two means are also given in the output, so we know which group mean is the larger one. t.test(ell~car, data=dat, var.equal=TRUE) ## ## Two Sample t-test ## ## data: ell by car ## t = -1.4317, df = 20, p-value = 0.1677 ## alternative hypothesis: true difference in means between group N and group Y is not equal to 0 ## 95 percent confidence interval: ## -2.5038207 0.4657255 ## sample estimates: ## mean in group N mean in group Y ## 43.26667 44.28571 In both possibilities, we used to compare the to means, the Bayesian posterior distribution of the difference and the t-test or the confidence interval of the difference, we used a data model. We thus assumed that the observations are normally distributed. In some cases, such an assumption is not a reasonable assumption. Then the result is not reliable. In such cases, we can either search for a more realistic model or use non-parametric (also called distribution free) methods. Nowadays, we have almost infinite possibilities to construct data models (e.g. generalized linear models and beyond). Therefore, we normally start looking for a model that fits the data better. However, in former days, all these possiblities did not exist (or were not easily available for non-mathematicians). Therefore, we here introduce two of such non-parametric methods, the Wilcoxon-test (or Mann-Whitney-U-test) and the randomisation test. Some of the distribution free statistical methods are based on the rank instead of the value of the observations. The principle of the Wilcoxon-test is to rank the observations and sum the ranks per group. It is not completely true that the non-parametric methods do not have a model. The model of the Wilcoxon-test knows how the difference in the sum of the ranks between two groups is distributed given the mean of the two groups do not differ (null hypothesis). Therefore, it is possible to get a p-value, e.g. by the function wilcox.test. wilcox.test(ell~car, data=dat) ## ## Wilcoxon rank sum test with continuity correction ## ## data: ell by car ## W = 34.5, p-value = 0.2075 ## alternative hypothesis: true location shift is not equal to 0 The note in the output tells us that ranking is ambiguous, when some values are equal. Equal values are called ties when they should be ranked. The result of the Wilcoxon-test tells us how probable it is to observe the difference in the rank sum between the two sample or a more extreme difference given the means of the two groups are equal. That is at least something. A similar result is obtained by using a randomisation test. This test is not based on ranks but on the original values. The aim of the randomisation is to simulate a distribution of the difference in the arithmetic mean between the two groups assuming this difference would be zero. To do so, the observed values are randomly distributed among the two groups. Because of the random distribution among the two groups, we expect that, if we repeat that virtual experiment many times, the average difference between the group means would be zero (both virtual samples are drawn from the same big population). We can use a loop in R for repeating the random re-assignement to the two groups and, each time, extracting the difference between the group means. As a result, we have a vector of many (nsim) values that all are possible differences between group means given the two samples were drawn from the same population. The proportion of these values that have an equal or larger absolute value give the probability that the observed or a larger difference between the group means is observed given the null hypothesis would be true, thus that is a p-value. diffH0 <- numeric(nsim) for(i in 1:nsim){ randomcars <- sample(dat$car) rmod <- lm(ell~randomcars, data=dat) diffH0[i] <- coef(rmod)["randomcarsY"] } mean(abs(diffH0)>abs(coef(mod)["carY"])) # p-value ## [1] 0.1858 Visualizing the possible differences between the group means given the null hypothesis was true shows that the observed difference is well within what is expected if the two groups would not differ in their means (Fig. 2.14). Figure 2.14: Histogram if differences between the means of randomly assigned groups (grey) and the difference between the means of the two observed groups (red) The randomization test results in a p-value and, we could also report the observed difference between the group means. However, it does not tell us, what values of the difference all would be compatible with the data. We do not get an uncertainty measurement for the difference. In order to get a compatibility interval without assuming a distribution for the data (thus non-parametric) we could bootstrap the samples. Bootstrapping is sampling observations from the data with replacement. For example, if we have a sample of 8 observations, we draw 8 times a random observation from the 8 observation. Each time, we assume that all 8 observations are available. Thus a bootstrapped sample could include some observations several times, whereas others are missing. In this way, we simulate the variance in the data that is due to the fact that our data do not contain the whole big population. Also bootstrapping can be programmed in R using a loop. diffboot <- numeric(1000) for(i in 1:nsim){ ngroups <- 1 while(ngroups==1){ bootrows <- sample(1:nrow(dat), replace=TRUE) ngroups <- length(unique(dat$car[bootrows])) } rmod <- lm(ell~car, data=dat[bootrows,]) diffboot[i] <- coef(rmod)[2] } quantile(diffboot, prob=c(0.025, 0.975)) ## 2.5% 97.5% ## -0.3395643 2.4273810 The resulting values for the difference between the two group means can be interpreted as the distribution of those differences, if we had repeated the study many times (Fig. 2.15). A 95% interval of the distribution corresponds to a 95% compatibility interval (or confidence interval or uncertainty interval). hist(diffboot); abline(v=coef(mod)[2], lwd=2, col="red") Figure 2.15: Histogram of the boostrapped differences between the group means (grey) and the observed difference. For both methods, randomisation test and bootstrapping, we have to assume that all observations are independent. Randomization and bootstrapping becomes complicated or even unfeasible when data are structured. 2.6 Summary Bayesian data analysis is applying the Bayes theorem for summarizing knowledge based on data, priors and the model assumptions. Frequentist statistics is quantifying uncertainty by hypothetical repetitions. "],["analyses_steps.html", "3 Data analysis step by step 3.1 Plausibility of Data 3.2 Relationships 3.3 Data Distribution 3.4 Preparation of Explanatory Variables 3.5 Data Structure 3.6 Define Prior Distributions 3.7 Fit the Model 3.8 Check Model 3.9 Model Uncertainty 3.10 Draw Conclusions Further reading", " 3 Data analysis step by step In this chapter we provide a checklist with some guidance for data analysis. However, do not expect the list to be complete and for different studies, a different order of the steps may make more sense. We usually repeat steps 3.2 to 3.8 until we find a model that fit the data well and that is realistic enough to be useful for the intended purpose. Data analysis is always a lot of work and, often, the following steps have to be repeated many times until we find a useful model. There is a chance and danger at the same time: we may find interesting results that answer different questions than we asked originally. They may be very exciting and important, however they may be biased. We can report such findings, but we should state that they appeared (more or less by chance) during the data exploration and model fitting phase, and we have to be aware that the estimates may be biased because the study was not optimally designed with respect to these findings. It is important to always keep the original aim of the study in mind. Do not adjust the study question according to the data. We also recommend reporting what the model started with at the first iteration and describing the strategy and reasoning behind the model development process. 3.1 Plausibility of Data Prepare the data and check graphically, or via summary statistics, whether all the data are plausible. Prepare the data so that errors (typos, etc.) are minimal, for example, by double-checking the entries. See chapter 5 for useful R-code that can be used for data preparation and to make plausibility controls. 3.2 Relationships Think about the direct and indirect relationships among the variables of the study. We normally start a data analysis by drawing a sketch of the model including all explanatory variables and interactions that may be biologically meaningful. We will most likely repeat this step after having looked at the model fit. To make the data analysis transparent we should report every model that was considered. A short note about why a specific model was considered and why it was discarded helps make the modeling process reproducible. 3.3 Data Distribution What is the nature of the variable of interest (outcome, dependent variable)? At this stage, there is no use of formally comparing the distribution of the outcome variable to a statistical distribution, because the rawdata is not required to follow a specific distribution. The models assume that conditional on the explanatory variables and the model structure, the outcome variable follows a specific distribution. Therefore, checking how well the chosen distribution fits to the data is done after the model fit 3.8. This first choice is solely done based on the nature of the data. Normally, our first choice is one of the classical distributions for which robust software for model fitting is available. Here is a rough guideline for this first choice: continuous measurements \\(\\Longrightarrow\\) normal distribution > exceptions: time-to-event data \\(\\Longrightarrow\\) see survival analysis count \\(\\Longrightarrow\\) Poisson or negative-binomial distribution count with upper bound (proportion) \\(\\Longrightarrow\\) binomial distribution binary \\(\\Longrightarrow\\) Bernoully distribution rate (count by a reference) \\(\\Longrightarrow\\) Poisson including an offset nominal \\(\\Longrightarrow\\) multinomial distribution Chapter 4 gives an overview of the distributions that are most relevant for ecologists. 3.4 Preparation of Explanatory Variables Look at the distribution (histogram) of every explanatory variable: Linear models do not assume that the explanatory variables have any specific distribution. Thus there is no need to check for a normal distribution! However, very skewed distributions result in unequal weighting of the observations in the model. In extreme cases, the slope of a regression line is defined by one or a few observations only. We also need to check whether the variance is large enough, and to think about the shape of the expected effect. The following four questions may help with this step: Is the variance (of the explanatory variable) big enough so that an effect of the variable can be measured? Is the distribution skewed? If an explanatory variable is highly skewed, it may make sense to transform the variable (e.g., log, square-root). Does it show a bimodal distribution? Consider making the variable binary. Is it expected that a change of 1 at lower values for x has the same biological effect as a change of 1 at higher values of x? If not, a trans- formation (e.g., log) could linearize the relationship between x and y. Centering: Centering (\\(x.c = x-mean(x)\\)) is a transformation that produces a variable with a mean of 0. Centering is optional. We have two reasons to center a predictor variable. First, it helps the model fitting algorithm to better converge because it reduces correlations among model parameters. Second, with centered predictors, the intercept and main effects in the linear model are better interpretable (they are measured at the center of the data instead of at the covariate value of 0 which may be far off). Scaling: Scaling (\\(x.s = x/c\\), where \\(c\\) is a constant) is a transformation that changes the unit of the variable. Also scaling is optional. We have three reasons to scale an predictor variable. First, to make the effect sizes better understandable. For example, a population change from one year to the next may be very small and hard to interpret. When we give the change for a 10-year period, its ecological meaning is better understandable. Second, to make the estimate of the effect sizes comparable between variables, we may use \\(x.s = x/sd(x)\\). The resulting variable has a unit of one standard deviation. A standard deviation may be comparable between variables that oritinally are measured in different units (meters, seconds etc). A. Gelman and Hill (2007) (p. 55 f) propose to scale the variables by two times the standard deviation (\\(x.s = x/(2*sd(x))\\)) to make effect sizes comparable between numeric and binary variables. Third, scaling can be important for model convergence, especially when polynomials are included. Also, consider the use of orthogonal polynomials, see Chapter 4.2.9 in Korner-Nievergelt et al. (2015). Collinearity: Look at the correlation among the explanatory variables (pairs plot or correlation matrix). If the explanatory variables are correlated, go back to step 2. Also, Chapter 4.2.7 in Korner-Nievergelt et al. (2015) discusses collinearity. Are interactions and polynomial terms needed in the model? If not already done in step 2, think about the relationship between each explanatory variable and the dependent variable. Is it linear or do polynomial terms have to be included in the model? If the relationship cannot be described appropriately by polynomial terms, think of a nonlinear model or a generalized additive model (GAM). May the effect of one explanatory variable depend on the value of another explanatory variable (interaction)? 3.5 Data Structure After having taken into account all of the (fixed effect) terms from step 4: are the observations independent or grouped/structured? What random factors are needed in the model? Are the data obviously temporally or spatially correlated? Or, are other correlation structures present, such as phylogenetic relationships? Our strategy is to start with a rather simple model that may not account for all correlation structures that in fact are present in the data. We first, only include those that are known to be important a priory. Only when residual analyses reveals important additional correlation structures, we include them in the model. 3.6 Define Prior Distributions Decide whether we would like to use informative prior distributions or whether we would like use priors that only have a negligible effect on the results. When the results are later used for informing authorities or for making a decision (as usual in applied sciences), then we would like to base the results on all information available. Information from the literature is then used to construct informative prior distributions. In contrast to applied sciences, in basic research we often would like to show only the information in the data that should not be influenced by earlier results. Therefore, in basic research we look for priors that do not influence the results. 3.7 Fit the Model Fit the model. 3.8 Check Model We assess model fit by graphical analyses of the residuals (Chapter 6 in Korner-Nievergelt et al. (2015)), by predictive model checking (Section 10.1 in Korner-Nievergelt et al. (2015)), or by sensitivity analysis (Chapter 15 in Korner-Nievergelt et al. (2015)). For non-Gaussian models it is often easier to assess model fit using pos- terior predictive checks (Chapter 10 in Korner-Nievergelt et al. (2015)) rather than residual analyses. Posterior predictive checks usually show clearly in which aspect the model failed so we can go back to step 2 of the analysis. Recognizing in what aspect a model does not fit the data based on residual plots improves with experience. Therefore, we list in Chapter 16 of Korner-Nievergelt et al. (2015) some patterns that can appear in residual plots together with what these patterns possibly indicate. We also indicate what could be done in the specific cases. 3.9 Model Uncertainty If, while working through steps 1 to 8, possibly repeatedly, we came up with one or more models that fit the data reasonably well, we then turn to the methods presented in Chapter 11 (Korner-Nievergelt et al. (2015)) to draw inference from more than one model. If we have only one model, we proceed to 3.10. 3.10 Draw Conclusions Simulate values from the joint posterior distribution of the model parameters (sim or Stan). Use these samples to present parameter uncertainty, to obtain posterior distributions for predictions, probabilities of specific hypotheses, and derived quantities. Further reading R for Data Science by Garrett Grolemund and Hadley Wickham: Introduces the tidyverse framwork. It explains how to get data into R, get it into the most useful structure, transform it, visualise it and model it. "],["distributions.html", "4 Probability distributions 4.1 Introduction 4.2 Discrete distributions 4.3 Continuous distributions", " 4 Probability distributions 4.1 Introduction In Bayesian statistics, probability distributions are used for two fundamentally different purposes. First, they are used to describe distributions of data. These distributions are also called data distributions. Second, probability distributions are used to express information or knowledge about parameters. Such distributions are called prior or posterior distributions. The data distributions are part of descriptive statistics, whereas prior and posterior distributions are part of inferential statistics. The usage of probability distributions for describing data does not differ between frequentist and Bayesian statistics. Classically, the data distribution is known as model assumption. Specifically to Bayesian statistics is the formal expression of statistical uncertainty (or information or knowledge) by prior and posterior distributions. We here introduce some of the most often used probability distributions and present how they are used in statistics. Probability distributions are grouped into discrete and continuous distributions. Discrete distributions define for any discrete value the probability that exactly this value occurs. They are usually used as data distributions for discrete data such as counts. The function that describes a discrete distribution is called a probability function (their values are probabilities, i.e. a number between 0 and 1). Continuous distributions describe how continuous values are distributed. They are used as data distributions for continuous measurements such as body size and also as prior or posterior distributions for parameters such as the mean body size. Most parameters are measured on a continuous scale. The function that describes continuous distributions is called density function. Its values are non-negative and the area under the density function equals one. The area under a density function corresponds to probabilities. For example, the area under the density function above the value 2 corresponds to the proportion of data with values above 2 if the density function describes data, or it corresponds to the probability that the parameter takes on a value bigger than 2 if the density function is a posterior distribution. 4.2 Discrete distributions 4.2.1 Bernoulli distribution Bernoulli distributed data take on the exact values 0 or 1. The value 1 occurs with probability \\(p\\). \\(x \\sim Bernoulli(p)\\) The probability function is \\(p(x) = p^x(1-p)^{1-x}\\). The expected value is \\(E(x) = p\\) and the variance is \\(Var(x) = p(1-p)\\). The flipping experiment of a fair coin produces Bernoulli distributed data with \\(p=0.5\\) if head is taken as one and tail is taken as zero. The Bernoulli distribution is usually used as a data model for binary data such as whether a nest box is used or not, whether a seed germinated or not, whether a species occurs or not in a plot etc. 4.2.2 Binomial distribution The binomial distribution describes the number of ones among a predefined number of Bernoulli trials. For example, the number of heads among 20 coin flips, the number of used nest boxes among the 50 nest boxes of the study area, or the number of seed that germinated among the 10 seeds in the pot. Binomially distributed data are counts with an upper limit (\\(n\\)). \\(x \\sim binomial(p,n)\\) The probability function is \\(p(x) = {n\\choose x} p^x(1-p)^{(n-x)}\\). The expected value is \\(E(x) = np\\) and the variance is \\(Var(x) = np(1-p)\\). Figure 4.1: Two examples of a binomial distribution. size: number of trials (the argument in the corresponding R function, for example in rbinom, is called size). p: success probability. 4.2.3 Poisson distribution The Poisson distribution describes the distribution of counts without upper boundary, i.e., when we know how many times something happened but we do not know how many times it did not happen. A typical Poisson distributed variable is the number of raindrops in equally-sized grid cells on the floor, if we can assume that every rain drop falls down completely independent of the other raindrops and at a completely random point (Figure 4.2). \\(x \\sim Poisson(\\lambda)\\) The probability function is \\(p(x) = \\frac{1}{x!}\\lambda^xexp(-\\lambda)\\). It is implemented in the R-function dpois. The expected values is \\(E(x) = \\lambda\\) and the variance is \\(Var(x) = \\lambda\\). set.seed(1338) n <- 500 # simulate 500 raindrops x <- runif(n) # they fall at some random point (x,y) in space y <- runif(n) par(mfrow=c(1,2)) par(mar=c(1,1,1,1)) plot(c(0,1), c(0,1), type="n", xaxs="i", yaxs="i", xlab="", ylab="", axes=F) box() points(x,y, pch=16) # add a grid grid(10, 10, col=1) # number of points per grid-cell xcell <- cut(x, breaks=seq(0,1, by=0.1)) ycell <- cut(y, breaks=seq(0,1, by=0.1)) counts <- as.numeric(table(xcell, ycell)) par(mar=c(4,4,1,1)) hist(counts, col="blue", cex.lab=1.4, las=1, cex.axis=1.2, main="") Figure 4.2: A natural process that produces Poisson distributed data is the number of raindrops falling (at random) into equally sized cells of a grid. Left: spatial distribution of raindrops, right: corresponding distribution of the number of raindrops per cell. An important property of the Poisson distribution is that it has only one parameter \\(\\lambda\\). As a consequence, it does not allow for any combination of means and variances. In fact, they are assumed to be the same. In the real world, most count data do not behave like rain drops, that means variances of count data are in most real world examples not equal to the mean as assumed by the Poisson distribution. Therefore, when using the Poisson distribution as a data model, it is important to check for overdispersion. The property that in a Poisson distribution the mean equals the variance can be used to quickly assess whether the spatial distribution of observations, for example, nest locations, is clustered, random, or equally spaced. Animal locations could be clustered due to coloniality, social, or other attraction. More equally spaced location may be due to territoriality. Let \\(x\\) be the number of observations per grid cell; if \\(var(x)/mean(x)>>1\\) the observations are clustered, whereas if \\(var(x)/mean(x)<<1\\), the observations are more equally spaced than expected by chance. Clustering will lead to overdispersion in the counts whereas more equally spaced locations will lead to underdispersion. Further, note that not all variables measured as an integer number are count data! For example, the number of days an animal spends in a specific area before moving away looks like a count. However, it is a continuous measurement. The duration an animal spends in a specific areas could also be measured in hours or minutes. The Poisson model assumes that the counts are all events that happened. However, an emigration of an animal is just one event, independent of how long it stayed. 4.2.4 Negative-binomial distribution The negative-binomial distribution represents the number of zeros which occur in a sequence of Bernoulli trials before a target number of ones is reached. It is hard to see this situation in, e.g., the number of individuals counted on plots. Therefore, we were reluctant to introduce this distribution in our old book (Korner-Nievergelt et al. 2015). However, the negative-binomial distribution often fits much better to count data than the Poisson model because it has two parameters and therefore allows for fitting both the mean and the variance to the data. Therefore, we started using the negative-binomial distribution as a data model more often. \\(x \\sim negative-binomial(p,n)\\) Its probability function is rather complex: \\(p(x) = \\frac{\\Gamma(x+n)}{\\Gamma(n) x!} p^n (1-p)^x\\) with \\(\\Gamma\\) being the Gamma-function. Luckily, the negative-binomial probability function is implemented in the R-function dnegbin. The expected value of the negative-binomial distribution is \\(E(x) = n\\frac{(1-p)}{p}\\) and the variance is \\(Var(x) = n\\frac{(1-p)}{p^2}\\). We like to specify the distribution using the mean and the scale parameter \\(x \\sim negativ-binomial(\\mu,\\theta)\\), because in practice we often specify a linear predictor for the logarithm of the mean \\(\\mu\\). 4.3 Continuous distributions 4.3.1 Beta distribution The beta distribution is restricted to the range [0,1]. It describes the knowledge about a probability parameter. Therefore, it is usually used as a prior or posterior distribution for probabilities. The beta distribution sometimes is used as a data model for continuous probabilities, However, it is difficult to get a good fit of such models, because measured proportions often take on values of zero and ones which is not allowed in most (but not all) beta distributions, thus this distribution does not describe the variance of measured proportions correctly. However, for describing knowledge of a proportion parameter, it is a very convenient distribution with two parameters. \\(x \\sim beta(a,b)\\) Its density function is \\(p(x) = \\frac{\\Gamma(a+b)}{\\Gamma(a)\\Gamma(b)}x^{a-1}(1-x)^{b-1}\\). The R-function dbetadoes the rather complicated calculations for us. The expected value of a beta distribution is \\(E(x) = \\frac{a}{(a+b)}\\) and the variance is \\(Var(x) = \\frac{ab}{(a+b)^2(a+b+1)}\\). The \\(beta(1,1)\\) distribution is equal to the \\(uniform(0,1)\\) distribution. The higher the sum of \\(a\\) and \\(b\\), the more narrow is the distribution (Figure 4.3). Figure 4.3: Beta distributions with different parameter values. 4.3.2 Normal distribution The normal, or Gaussian distribution is widely used since a long time in statistics. It describes the distribution of measurements that vary because of a sum of random errors. Based on the central limit theorem, sample averages are approximately normally distributed (2). \\(x \\sim normal(\\mu, \\sigma^2)\\) The density function is \\(p(x) = \\frac{1}{\\sqrt{2\\pi}\\sigma}exp(-\\frac{1}{2\\sigma^2}(x -\\mu)^2)\\) and it is implemented in the R-function dnorm. The expected value is \\(E(x) = \\mu\\) and the variance is \\(Var(x) = \\sigma^2\\). The variance parameter can be specified to be a variance, a standard deviation or a precision. Different software (or authors) have different habits, e.g., R and Stan use the standard deviation sigma \\(\\sigma\\), whereas BUGS (WinBugs, OpenBUGS or jags) use the precision, which is the inverse of the variance $= $. The normal distribution is used as a data model for measurements that scatter symmetrically around a mean, such as body size (in m), food consumption (in g), or body temperature (°C). The normal distribution also serves as prior distribution for parameters that can take on negative or positive values. The larger the variance, the flatter (less informative) is the distribution. The standard normal distribution is a normal distribution with a mean of zero and a variance of one, \\(z \\sim normal(0, 1)\\). The standard normal distribution is also called the z-distribution. Or, a z-variable is a variable with a mean of zero and a standard deviation of one. x <- seq(-3, 3, length=100) y <- dnorm(x) # density function of a standard normal distribution dat <- tibble(x=x,y=y) plot(x,y, type="l", lwd=2, col="#d95f0e", las=1, ylab="normal density of x") segments(0, dnorm(1), 1, dnorm(1), lwd=2) segments(0, dnorm(0), 0, 0) text(0.5, 0.23, expression(sigma)) Figure 4.4: Standard normal distribution Plus minus one times the standard deviation (\\(\\sigma\\)) from the mean includes around 68% of the area under the curve (corresponding to around 68% of the data points in case the normal distribution is used as a data model, or 68% of the prior or posterior mass if the normal distribution is used to describe the knowledge about a parameter). Plus minus two times the standard deviation includes around 95% of the area under the curve. 4.3.3 Gamma distribution The gamma distribution is a continuous probability distribution for strictly positive values (zero is not included). The shape of the gamma distribution is right skewed with a long upper tail, whereas most of the mass is centered around a usually small value. It has two parameters, the shape \\(\\alpha\\) and the inverse scale \\(\\beta\\). \\(x \\sim gamma(\\alpha,\\beta)\\) Its density function is \\(p(x) = \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)} x^{(\\alpha-1)} exp(-\\beta x)\\), or dgamma in R. The expected value is \\(E(x) = \\frac{\\alpha}{\\beta}\\) and the variance is \\(Var(x) = \\frac{\\alpha}{\\beta^2}\\). The gamma distribution is becoming more and more popular as a data model for durations (time to event) or other highly right skewed continuous measurements that do not have values of zero. The gamma distribution is a conjugate prior distribution for the mean of a Poisson distribution and for the precision parameter of a normal distribution. However, in hierarchical models with normally distributed random effects, it is not recommended to use the gamma distribution as a prior distribution for the among-group variance (A. Gelman 2006). The Cauchy or folded t-distribution seem to have less influence on the posterior distributions of the variance parameters. 4.3.4 Cauchy distribution The Cauchy distribution is a symmetric distribution with much heavier tails compared to the normal distribution. $ x Cauchy(a,b)$ Its probability density function is \\(p(x) = \\frac{1}{\\pi b[1+(\\frac{x-a}{b})^2]}\\). The mean and the variance of the Cauchy distribution are not defined. The median is \\(a\\). The part of the Cauchy distribution for positive values, i.e., half of the Cauchy distribution, is often used as a prior distribution for variance parameters. 4.3.5 t-distribution The t-distribution is the marginal posterior distribution of a the mean of a sample with unknown variance when conjugate prior distributions are used to obtain the posterior distribution. The t-distribution has three parameters, the degrees of freedom \\(v\\), the location \\(\\mu\\) and the scale \\(\\sigma\\). \\(x \\sim t(v, \\mu, \\sigma)\\) Its density function is \\(p(x) = \\frac{\\Gamma((v+1)/2)}{\\Gamma(v/2)\\sqrt{v\\pi}\\sigma}(1+\\frac{1}{v}(\\frac{x-\\mu}{\\sigma})^2)^{-(v+1)/2}\\). Its expected value is \\(E(x) = \\mu\\) for \\(v>1\\) and the variance is \\(Var(x) = \\frac{v}{v-2}\\sigma ^2\\) for \\(v>2\\). The t-distribution is sometimes used as data model. Because of its heavier tails compared to the normal model, the model parameters are less influenced by measurement errors when a t-distribution is used instead of a normal distribution. This is called robust statistics. Similar to the Cauchy distribution, the folded t-distribution, i.e., the positive part of the t-distribution, can serve as a prior distribution for variance parameters. 4.3.6 F-distribution The F-distribution is not important in Bayesian statistics. Ratios of sample variances drawn from populations with equal variances follow an F-distribution. The density function of the F-distribution is even more complicated than the one of the t-distribution! We do not copy it here. Further, we have not yet met any Bayesian example where the F-distribution is used (that does not mean that there is no). It is used in frequentist analyses in order to compare variances, e.g. within ANOVAs. If two variances only differ because of natural variance in the data (nullhypothesis) then \\(\\frac{Var(X_1)}{Var(X_2)}\\sim F_{df_1,df_2}\\). Figure 4.5: Different density functions of the F statistics "],["rfunctions.html", "5 Important R-functions 5.1 Data preparation 5.2 Figures 5.3 Summary", " 5 Important R-functions THIS CHAPTER IS UNDER CONSTRUCTION!!! 5.1 Data preparation 5.2 Figures 5.3 Summary "],["reproducibleresearch.html", "6 Reproducible research 6.1 Summary 6.2 Further reading", " 6 Reproducible research THIS CHAPTER IS UNDER CONSTRUCTION!!! 6.1 Summary 6.2 Further reading Rmarkdown: The first official book authored by the core R Markdown developers that provides a comprehensive and accurate reference to the R Markdown ecosystem. With R Markdown, you can easily create reproducible data analysis reports, presentations, dashboards, interactive applications, books, dissertations, websites, and journal articles, while enjoying the simplicity of Markdown and the great power of R and other languages. Bookdown by Yihui Xie: A guide to authoring books with R Markdown, including how to generate figures and tables, and insert cross-references, citations, HTML widgets, and Shiny apps in R Markdown. The book can be exported to HTML, PDF, and e-books (e.g. EPUB). The book style is customizable. You can easily write and preview the book in RStudio IDE or other editors, and host the book wherever you want (e.g. bookdown.org). Our book is written using bookdown. "],["furthertopics.html", "7 Further topics 7.1 Bioacoustic analyse 7.2 Python", " 7 Further topics This is a collection of short introductions or links with commented R code that cover other topics that might be useful for ecologists. 7.1 Bioacoustic analyse Bioacoustic analyses are nicely covered in a blog by Marcelo Araya-Salas. 7.2 Python Like R, python is a high-level programming language that is used by many ecologists. The reticulate package provides a comprehensive set of tools for interoperability between Python and R. "],["PART-II.html", "8 Introduction to PART II Further reading", " 8 Introduction to PART II Further reading A really good introductory book to Bayesian data analyses is (McElreath 2016). This book starts with a thorough introduction to applying the Bayes theorem for drawing inference from data. In addition, it carefully discusses what can and what cannot be concluded from statistical results. We like this very much. We like looking up statistical methods in papers and books written by Andrew Gelman (e.g. A. Gelman et al. 2014b) and Trevor Hastie (e.g. Efron and Hastie (2016)) because both explain complicated things in a concise and understandable way. "],["bayesian_paradigm.html", "9 The Bayesian paradigm 9.1 Introduction 9.2 Summary", " 9 The Bayesian paradigm THIS CHAPTER IS UNDER CONSTRUCTION!!! 9.1 Introduction 9.2 Summary xxx "],["priors.html", "10 Prior distributions 10.1 Introduction 10.2 How to choose a prior 10.3 Prior sensitivity", " 10 Prior distributions 10.1 Introduction The prior is an integral part of a Bayesian model. We must specify one. When to use informative priors: In practice (management, politics etc.) we would like to base our decisions on all information available. Therefore, we consider it to be responsible including informative priors in applied research whenever possible. Priors allow combining information from the literature with information in data or combining information from different data sets. When using non-informative, flat or weakly informative priors: in basic research when results should only report the information in the current data set it may be reasonable to use non-informative priors. Results from a case study may later be used in a meta-analyses that assumes independence across the different studies included. However, flat priors are not always non-informative, may lead to overconfidence in spuriously large effects (similar to frequentist methods) and may be accompanied by computational difficulties. Therefore, weakly informative priors are recommended (Lemoine 2019). 10.2 How to choose a prior The Stan development team gives a profound and up-to-date prior choice recommendation. We are not yet sure what we can further add here that may be useful, as we normally check the prior choice recommendation by the Stan development team. Further references: Lemoine (2019) A. Gelman (2006) 10.2.1 Priors for variance parameters A. Gelman (2006) discusses advantages of using folded t-distributions or cauchy distributions as prior distributions for variance parameters in hierarchical models. When specifying t-distributions, we find it hard to imagine how the distributions looks like with what parameter values. Therefore, we simulate values from different distributions and look at the histograms. Because the parameterisation of the t-distribution differs among software language, it is important to use the software the model is finally fitted in Figure 10.1 we give some examples of folded t-distributions specified in jags using different values for the precision (second parameter) and degrees of freedom (third parameter). Figure 10.1: Folded t-distributions with different precisions and degrees of freedom. The panel titles give the jags code of the distribution. Dark blue vertical lines indicate 90% quantiles, light-blue lines indicate 98% quantiles. Todo: give examples for Stan 10.3 Prior sensitivity Todo: it may be helpful to present a worked-through example of a prior sensitivity analysis? "],["lm.html", "11 Normal Linear Models 11.1 Linear regression 11.2 Linear model with one categorical predictor (one-way ANOVA) 11.3 Other variants of normal linear models: Two-way anova, analysis of covariance and multiple regression 11.4 Partial coefficients and some comments on collinearity 11.5 Ordered Factors and Contrasts 11.6 Quadratic and Higher Polynomial Terms", " 11 Normal Linear Models 11.1 Linear regression 11.1.1 Background Linear regression is the basis of a large part of applied statistical analysis. Analysis of variance (ANOVA) and analysis of covariance (ANCOVA) can be considered special cases of linear regression, and generalized linear models are extensions of linear regression. Typical questions that can be answered using linear regression are: How does \\(y\\) change with changes in \\(x\\)? How is y predicted from \\(x\\)? An ordinary linear regression (i.e., one numeric \\(x\\) and one numeric \\(y\\) variable) can be represented by a scatterplot of \\(y\\) against \\(x\\). We search for the line that ts best and describe how the observations scatter around this regression line (see Fig. 11.2 for an example). The model formula of a simple linear regression with one continuous predictor variable \\(x_i\\) (the subscript \\(i\\) denotes the \\(i=1,\\dots,n\\) data points) is: \\[\\begin{align} \\mu_i &=\\beta_0 + \\beta_1 x_i \\\\ y_i &\\sim normal(\\mu_i, \\sigma^2) \\tag{11.1} \\end{align}\\] While the first part of Equation (11.1) describes the regression line, the second part describes how the data points, also called observations, are distributed around the regression line (Figure 11.1). In other words: the observation \\(y_i\\) stems from a normal distribution with mean \\(\\mu_i\\) and variance \\(\\sigma^2\\). The mean of the normal distribution, \\(\\mu_i\\) , equals the sum of the intercept (\\(b_0\\) ) and the product of the slope (\\(b_1\\)) and the continuous predictor value, \\(x_i\\). Equation (11.1) is called the data model, because it describes mathematically the process that has (or, better, that we think has) produced the data. This nomenclature also helps to distinguish data models from models for parameters such as prior or posterior distributions. The differences between observation \\(y_i\\) and the predicted values \\(\\mu_i\\) are the residuals (i.e., \\(\\epsilon_i=y_i-\\mu_i\\)). Equivalently to Equation (11.1), the regression could thus be written as: \\[\\begin{align} y_i &= \\beta_0 + \\beta_1 x_i + \\epsilon_i\\\\ \\epsilon_i &\\sim normal(0, \\sigma^2) \\tag{11.2} \\end{align}\\] We prefer the notation in Equation (11.1) because, in this formula, the stochastic part (second row) is nicely separated from the deterministic part (first row) of the model, whereas, in the second notation (11.2) the rst row contains both stochastic and deterministic parts. For illustration, we here simulate a data set and below t a linear regression to these simulated data. The advantage of simulating data is that the following analyses can be reproduced without having to read data into R. Further, for simulating data, we need to translate the algebraic model formula into R language which helps us understanding the model structure. set.seed(34) # set a seed for the random number generator # define the data structure n <- 50 # sample size x <- runif(n, 10, 30) # sample values of the predictor variable # define values for each model parameter sigma <- 5 # standard deviation of the residuals b0 <- 2 # intercept b1 <- 0.7 # slope # simulate y-values from the model mu <- b0 + b1 * x # define the regression line (deterministic part) y <- rnorm(n, mu, sd = sigma) # simulate y-values # save data in a data.frame dat <- tibble(x = x, y = y) Figure 11.1: Illustration of a linear regression. The blue line represents the deterministic part of the model, i.e., here regression line. The stochastic part is represented by a probability distribution, here the normal distribution. The normal distribution changes its mean but not the variance along the x-axis, and it describes how the data are distributed. The blue line and the orange distribution together are a statistical model, i.e., an abstract representation of the data which is given in black. Using matrix notation equation (11.1) can also be written in one row: \\[\\boldsymbol{y} \\sim Norm(\\boldsymbol{X} \\boldsymbol{\\beta}, \\sigma^2\\boldsymbol{I})\\] where \\(\\boldsymbol{ I}\\) is the \\(n \\times n\\) identity matrix (it transforms the variance parameter to a \\(n \\times n\\) matrix with its diagonal elements equal \\(\\sigma^2\\) ; \\(n\\) is the sample size). The multiplication by \\(\\boldsymbol{ I}\\) is necessary because we use vector notation, \\(\\boldsymbol{y}\\) instead of \\(y_{i}\\) . Here, \\(\\boldsymbol{y}\\) is the vector of all observations, whereas \\(y_{i}\\) is a single observation, \\(i\\). When using vector notation, we can write the linear predictor of the model, \\(\\beta_0 + \\beta_1 x_i\\) , as a multiplication of the vector of the model coefcients \\[\\boldsymbol{\\beta} = \\begin{pmatrix} \\beta_0 \\\\ \\beta_1 \\end{pmatrix}\\] times the model matrix \\[\\boldsymbol{X} = \\begin{pmatrix} 1 & x_1 \\\\ \\dots & \\dots \\\\ 1 & x_n \\end{pmatrix}\\] where \\(x_1 , \\dots, x_n\\) are the observed values for the predictor variable, \\(x\\). The rst column of \\(\\boldsymbol{X}\\) contains only ones because the values in this column are multiplied with the intercept, \\(\\beta_0\\) . To the intercept, the product of the second element of \\(\\boldsymbol{\\beta}\\), \\(\\beta_1\\) , with each element in the second column of \\(\\boldsymbol{X}\\) is added to obtain the predicted value for each observation, \\(\\boldsymbol{\\mu}\\): \\[\\begin{align} \\boldsymbol{X \\beta}= \\begin{pmatrix} 1 & x_1 \\\\ \\dots & \\dots \\\\ 1 & x_n \\end{pmatrix} \\times \\begin{pmatrix} \\beta_0 \\\\ \\beta_1 \\end{pmatrix} = \\begin{pmatrix} \\beta_0 + \\beta_1x_1 \\\\ \\dots \\\\ \\beta_0 + \\beta_1x_n \\end{pmatrix}= \\begin{pmatrix} \\hat{y}_1 \\\\ \\dots \\\\ \\hat{y}_n \\end{pmatrix} = \\boldsymbol{\\mu} \\tag{11.3} \\end{align}\\] 11.1.2 Fitting a Linear Regression in R In Equation (11.1), the fitted values \\(\\mu_i\\) are directly dened by the model coefcients, \\(\\beta_{0}\\) and \\(\\beta_{1}\\) . Therefore, when we can estimate \\(\\beta_{0}\\), \\(\\beta_{1}\\) , and \\(\\sigma^2\\), the model is fully dened. The last parameter \\(\\sigma^2\\) describes how the observations scatter around the regression line and relies on the assumption that the residuals are normally distributed. The estimates for the model parameters of a linear regression are obtained by searching for the best tting regression line. To do so, we search for the regression line that minimizes the sum of the squared residuals. This model tting method is called the least-squares method, abbreviated as LS. It has a very simple solution using matrix algebra (see e.g., Aitkin et al. 2009). The least-squares estimates for the model parameters of a linear regression are obtained in R using the function lm. mod <- lm(y ~ x, data = dat) coef(mod) ## (Intercept) x ## 2.0049517 0.6880415 summary(mod)$sigma ## [1] 5.04918 The object mod produced by lm contains the estimates for the intercept, \\(\\beta_0\\) , and the slope, \\(\\beta_1\\). The residual standard deviation \\(\\sigma^2\\) is extracted using the function summary. We can show the result of the linear regression as a line in a scatter plot with the covariate (x) on the x-axis and the observations (y) on the y-axis (Fig. 11.2). Figure 11.2: Linear regression. Black dots = observations, blue solid line = regression line, orange dotted lines = residuals. The tted values lie where the orange dotted lines touch the blue regression line. Conclusions drawn from a model depend on the model assumptions. When model assumptions are violated, estimates usually are biased and inappropriate conclusions can be drawn. We devote Chapter 12 to the assessment of model assumptions, given its importance. 11.1.3 Drawing Conclusions To answer the question about how strongly \\(y\\) is related to \\(x\\) taking into account statistical uncertainty we look at the joint posterior distribution of \\(\\boldsymbol{\\beta}\\) (vector that contains \\(\\beta_{0}\\) and \\(\\beta_{1}\\) ) and \\(\\sigma^2\\) , the residual variance. The function sim calculates the joint posterior distribution and renders a simulated values from this distribution. What does sim do? It simulates parameter values from the joint posterior distribution of a model assuming flat prior distributions. For a normal linear regression, it rst draws a random value, \\(\\sigma^*\\) from the marginal posterior distribution of \\(\\sigma\\), and then draws random values from the conditional posterior distribution for \\(\\boldsymbol{\\beta}\\) given \\(\\sigma^*\\) (A. Gelman et al. 2014a). The conditional posterior distribution of the parameter vector \\(\\boldsymbol{\\beta}\\), \\(p(\\boldsymbol{\\beta}|\\sigma^*,\\boldsymbol{y,X})\\) can be analytically derived. With at prior distributions, it is a uni- or multivariate normal distribution \\(p(\\boldsymbol{\\beta}|\\sigma^*,\\boldsymbol{y,X})=normal(\\boldsymbol{\\hat{\\beta}},V_\\beta,(\\sigma^*)^2)\\) with: \\[\\begin{align} \\boldsymbol{\\hat{\\beta}=(\\boldsymbol{X^TX})^{-1}X^Ty} \\tag{11.4} \\end{align}\\] and \\(V_\\beta = (\\boldsymbol{X^T X})^{-1}\\). The marginal posterior distribution of \\(\\sigma^2\\) is independent of specic values of \\(\\boldsymbol{\\beta}\\). It is, for at prior distributions, an inverse chi-square distribution \\(p(\\sigma^2|\\boldsymbol{y,X})=Inv-\\chi^2(n-k,\\sigma^2)\\), where \\(\\sigma^2 = \\frac{1}{n-k}(\\boldsymbol{y}-\\boldsymbol{X,\\hat{\\beta}})^T(\\boldsymbol{y}-\\boldsymbol{X,\\hat{\\beta}})\\), and \\(k\\) is the number of parameters. The marginal posterior distribution of \\(\\boldsymbol{\\beta}\\) can be obtained by integrating the conditional posterior distribution \\(p(\\boldsymbol{\\beta}|\\sigma^2,\\boldsymbol{y,X})=normal(\\boldsymbol{\\hat{\\beta}},V_\\beta\\sigma^2)\\) over the distribution of \\(\\sigma^2\\) . This results in a uni- or multivariate \\(t\\)-distribution. Because sim simulates values \\(\\beta_0^*\\) and \\(\\beta_1^*\\) always conditional on \\(\\sigma^*\\), a triplet of values (\\(\\beta_0^*\\), \\(\\beta_1^*\\), \\(\\sigma^*\\)) is one draw of the joint posterior distribution. When we visualize the distribution of the simulated values for one parameter only, ignoring the values for the other, we display the marginal posterior distribution of that parameter. Thus, the distribution of all simulated values for the parameter \\(\\beta_0\\) is a \\(t\\)-distribution even if a normal distribution has been used for simulating the values. The \\(t\\)-distribution is a consequence of using a different \\(\\sigma^2\\)-value for every draw of \\(\\beta_0\\). Using the function sim from the package, we can draw values from the joint posterior distribution of the model parameters and describe the marginal posterior distribution of each model parameter using these simulated values. library(arm) nsim <- 1000 bsim <- sim(mod, n.sim = nsim) The function sim simulates (in our example) 1000 values from the joint posterior distribution of the three model parameters \\(\\beta_0\\) , \\(\\beta_1\\), and \\(\\sigma\\). These simulated values are shown in Figure 11.3. Figure 11.3: Joint (scatterplots) and marginal (histograms) posterior distribution of the model parameters. The six scatterplots show, using different axes, the three-dimensional cloud of 1000 simulations from the joint posterior distribution of the three parameters. The posterior distribution describes, given the data and the model, which values relative to each other are more likely to correspond to the parameter value we aim at measuring. It expresses the uncertainty of the parameter estimate. It shows what we know about the model parameter after having looked at the data and given the model is realistic. The 2.5% and 97.5% quantiles of the marginal posterior distributions can be used as 95% uncertainty intervals of the model parameters. The function coef extracts the simulated values for the beta coefcients, returning a matrix with nsim rows and the number of columns corresponding to the number of parameters. In our example, the rst column contains the simulated values from the posterior distribution of the intercept and the second column contains values from the posterior distribution of the slope. The 2 in the second argument of the apply-function (see Chapter ??) indicates that the quantile function is applied columnwise. apply(X = coef(bsim), MARGIN = 2, FUN = quantile, probs = c(0.025, 0.975)) %>% round(2) ## (Intercept) x ## 2.5% -2.95 0.44 ## 97.5% 7.17 0.92 We also can calculate an uncertainty interval of the estimated residual standard deviation, \\(\\hat{\\sigma}\\). quantile(bsim@sigma, probs = c(0.025, 0.975)) %>% round(1) ## 2.5% 97.5% ## 4.2 6.3 We can further get a posterior probability for specic hypotheses, such as The slope parameter is larger than 1 or The slope parameter is larger than 0.5. These probabilities are the proportion of simulated values from the posterior distribution that are larger than 1 and 0.5, respectively. sum(coef(bsim)[,2] > 1) / nsim # alternatively: mean(coef(bsim)[,2]>1) ## [1] 0.008 sum(coef(bsim)[,2] > 0.5) / nsim ## [1] 0.936 From this, there is very little evidence in the data that the slope is larger than 1, but we are quite condent that the slope is larger than 0.5 (assuming that our model is realistic). We often want to show the effect of \\(x\\) on \\(y\\) graphically, with information about the uncertainty of the parameter estimates included in the graph. To draw such effect plots, we use the simulated values from the posterior distribution of the model parameters. From the deterministic part of the model, we know the regression line \\(\\mu = \\beta_0 + \\beta_1 x_i\\). The simulation from the joint posterior distribution of \\(\\beta_0\\) and \\(\\beta_1\\) gives 1000 pairs of intercepts and slopes that describe 1000 different regression lines. We can draw these regression lines in an x-y plot (scatter plot) to show the uncertainty in the regression line estimation (Fig. 11.4, left). Note, that in this case it is not advisable to use ggplot because we draw many lines in one plot, which makes ggplot rather slow. par(mar = c(4, 4, 0, 0)) plot(x, y, pch = 16, las = 1, xlab = "Outcome (y)") for(i in 1:nsim) { abline(coef(bsim)[i,1], coef(bsim)[i,2], col = rgb(0, 0, 0, 0.05)) } Figure 11.4: Regression with 1000 lines based on draws form the joint posterior distribution for the intercept and slope parameters to visualize the uncertainty of the estimated regression line. A more convenient way to show uncertainty is to draw the 95% uncertainty interval, CrI, of the regression line. To this end, we rst dene new x-values for which we would like to have the tted values (about 100 points across the range of x will produce smooth-looking lines when connected by line segments). We save these new x-values within the new tibble newdat. Then, we create a new model matrix that contains these new x-values (newmodmat) using the function model.matrix. We then calculate the 1000 tted values for each element of the new x (one value for each of the 1000 simulated regressions, Fig. 11.4), using matrix multiplication (%*%). We save these values in the matrix tmat. Finally, we extract the 2.5% and 97.5% quantiles for each x-value from tmat, and draw the lines for the lower and upper limits of the credible interval (Fig. 11.5). # Calculate 95% credible interval newdat <- tibble(x = seq(10, 30, by = 0.1)) newmodmat <- model.matrix( ~ x, data = newdat) fitmat <- matrix(ncol = nsim, nrow = nrow(newdat)) for(i in 1:nsim) {fitmat[,i] <- newmodmat %*% coef(bsim)[i,]} newdat$CrI_lo <- apply(fitmat, 1, quantile, probs = 0.025) newdat$CrI_up <- apply(fitmat, 1, quantile, probs = 0.975) # Make plot regplot <- ggplot(dat, aes(x = x, y = y)) + geom_point() + geom_smooth(method = lm, se = FALSE) + geom_line(data = newdat, aes(x = x, y = CrI_lo), lty = 3) + geom_line(data = newdat, aes(x = x, y = CrI_up), lty = 3) + labs(x = "Predictor (x)", y = "Outcome (y)") regplot Figure 11.5: Regression with 95% credible interval of the posterior distribution of the tted values. The interpretation of the 95% uncertainty interval is straightforward: We are 95% sure that the true regression line is within the credible interval (given the data and the model). As with all statistical results, this interpretation is only valid in the model world (if the world would look like the model). The larger the sample size, the narrower the interval, because each additional data point increases information about the true regression line. The uncertainty interval measures statistical uncertainty of the regression line, but it does not describe how new observations would scatter around the regression line. If we want to describe where future observations will be, we have to report the posterior predictive distribution. We can get a sample of random draws from the posterior predictive distribution \\(\\hat{y}|\\boldsymbol{\\beta},\\sigma^2,\\boldsymbol{X}\\sim normal( \\boldsymbol{X \\beta, \\sigma^2})\\) using the simulated joint posterior distributions of the model parameters, thus taking the uncertainty of the parameter estimates into account. We draw a new \\(\\hat{y}\\)-value from \\(normal( \\boldsymbol{X \\beta, \\sigma^2})\\) for each simulated set of model parameters. Then, we can visualize the 2.5% and 97.5% quantiles of this distribution for each new x-value. # increase number of simulation to produce smooth lines of the posterior # predictive distribution set.seed(34) nsim <- 50000 bsim <- sim(mod, n.sim=nsim) fitmat <- matrix(ncol=nsim, nrow=nrow(newdat)) for(i in 1:nsim) fitmat[,i] <- newmodmat%*%coef(bsim)[i,] # prepare matrix for simulated new data newy <- matrix(ncol=nsim, nrow=nrow(newdat)) # for each simulated tted value, simulate one new y-value for(i in 1:nsim) { newy[,i] <- rnorm(nrow(newdat), mean = fitmat[,i], sd = bsim@sigma[i]) } # Calculate 2.5% and 97.5% quantiles newdat$pred_lo <- apply(newy, 1, quantile, probs = 0.025) newdat$pred_up <- apply(newy, 1, quantile, probs = 0.975) # Add the posterior predictive distribution to plot regplot + geom_line(data = newdat, aes(x = x, y = pred_lo), lty = 2) + geom_line(data = newdat, aes(x = x, y = pred_up), lty = 2) Figure 11.6: Regression line with 95% uncertainty interval (dotted lines) and the 95% interval of the simulated predictive distribution (broken lines). Note that we increased the number of simulations to 50,000 to produce smooth lines. Of future observations, 95% are expected to be within the interval dened by the broken lines in Fig. 11.6. Increasing sample size will not give a narrower predictive distribution because the predictive distribution primarily depends on the residual variance \\(\\sigma^2\\) which is a property of the data that is independent of sample size. The way we produced Fig. 11.6 is somewhat tedious compared to how easy we could have obtained the same gure using frequentist methods: predict(mod, newdata = newdat, interval = \"prediction\") would have produced the y-values for the lower and upper lines in Fig. 11.6 in one R-code line. However, once we have a simulated sample of the posterior predictive distribution, we have much more information than is contained in the frequentist prediction interval. For example, we could give an estimate for the proportion of observations greater than 20, given \\(x = 25\\). sum(newy[newdat$x == 25, ] > 20) / nsim ## [1] 0.44504 Thus, we expect 44% of future observations with \\(x = 25\\) to be higher than 20. We can extract similar information for any relevant threshold value. Another reason to learn the more complicated R code we presented here, compared to the frequentist methods, is that, for more complicated models such as mixed models, the frequentist methods to obtain condence intervals of tted values are much more complicated than the Bayesian method just presented. The latter can be used with only slight adaptations for mixed models and also for generalized linear mixed models. 11.1.4 Interpretation of the R summary output The solution for \\(\\boldsymbol{\\beta}\\) is the Equation (11.3). Most statistical software, including R, return an estimated frequentist standard error for each \\(\\beta_k\\). We extract these standard errors together with the estimates for the model parameters using the summary function. summary(mod) ## ## Call: ## lm(formula = y ~ x, data = dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -11.5777 -3.6280 -0.0532 3.9873 12.1374 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.0050 2.5349 0.791 0.433 ## x 0.6880 0.1186 5.800 0.000000507 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.049 on 48 degrees of freedom ## Multiple R-squared: 0.412, Adjusted R-squared: 0.3998 ## F-statistic: 33.63 on 1 and 48 DF, p-value: 0.0000005067 The summary output rst gives a rough summary of the residual distribution. However, we will do more rigorous residual analyses in Chapter 12. The estimates of the model coefcients follow. The column Estimate contains the estimates for the intercept \\(\\beta_0\\) and the slope \\(\\beta_1\\) . The column Std. Error contains the estimated (frequentist) standard errors of the estimates. The last two columns contain the t-value and the p-value of the classical t-test for the null hypothesis that the coefcient equals zero. The last part of the summary output gives the parameter \\(\\sigma\\) of the model, named residual standard error and the residual degrees of freedom. We think the name residual standard error for sigma is confusing, because \\(\\sigma\\) is not a measurement of uncertainty of a parameter estimate like the standard errors of the model coefcients are. \\(\\sigma\\) is a model parameter that describes how the observations scatter around the tted values, that is, it is a standard deviation. It is independent of sample size, whereas the standard errors of the estimates for the model parameters will decrease with increasing sample size. Such a standard error of the estimate of \\(\\sigma\\), however, is not given in the summary output. Note that, by using Bayesian methods, we could easily obtain the standard error of the estimated \\(\\sigma\\) by calculating the standard deviation of the posterior distribution of \\(\\sigma\\). The \\(R^2\\) and the adjusted \\(R^2\\) measure the proportion of variance in the outcome variable \\(y\\) that is explained by the predictors in the model. \\(R^2\\) is calculated from the sum of squared residuals, \\(SSR = \\sum_{i=1}^{n}(y_i - \\hat{y})\\), and the total sum of squares, \\(SST = \\sum_{i=1}^{n}(y_i - \\bar{y})\\), where \\(\\bar{y})\\) is the mean of \\(y\\). \\(SST\\) is a measure of total variance in \\(y\\) and \\(SSR\\) is a measure of variance that cannot be explained by the model, thus \\(R^2 = 1- \\frac{SSR}{SST}\\) is a measure of variance that can be explained by the model. If \\(SSR\\) is close to \\(SST\\), \\(R^2\\) is close to zero and the model cannot explain a lot of variance. The smaller \\(SSR\\), the closer \\(R^2\\) is to 1. This version of \\(R2\\) approximates 1 if the number of model parameters approximates sample size even if none of the predictor variables correlates with the outcome. It is exactly 1 when the number of model parameters equals sample size, because \\(n\\) measurements can be exactly described by \\(n\\) parameters. The adjusted \\(R^2\\), \\(R^2 = \\frac{var(y)-\\hat\\sigma^2}{var(y)}\\) takes sample size \\(n\\) and the number of model parameters \\(k\\) into account (see explanation to variance in chapter 2). Therefore, the adjusted \\(R^2\\) is recommended as a measurement of the proportion of explained variance. 11.2 Linear model with one categorical predictor (one-way ANOVA) The aim of analysis of variance (ANOVA) is to compare means of an outcome variable \\(y\\) between different groups. To do so in the frequentists framework, variances between and within the groups are compared using F-tests (hence the name analysis of variance). When doing an ANOVA in a Bayesian way, inference is based on the posterior distributions of the group means and the differences between the group means. One-way ANOVA means that we only have one predictor variable, specifically a categorical predictor variable (in R defined as a factor). We illustrate the one-way ANOVA based on an example of simulated data (Fig. 11.7). We have measured weights of 30 virtual individuals for each of 3 groups. Possible research questions could be: How big are the differences between the group means? Are individuals from group 2 heavier than the ones from group 1? Which group mean is higher than 7.5 g? # settings for the simulation set.seed(626436) b0 <- 12 # mean of group 1 (reference group) sigma <- 2 # residual standard deviation b1 <- 3 # difference between group 1 and group 2 b2 <- -5 # difference between group 1 and group 3 n <- 90 # sample size # generate data group <- factor(rep(c("group 1","group 2", "group 3"), each=30)) simresid <- rnorm(n, mean=0, sd=sigma) # simulate residuals y <- b0 + as.numeric(group=="group 2") * b1 + as.numeric(group=="group 3") * b2 + simresid dat <- tibble(y, group) # make figure dat %>% ggplot(aes(x = group, y = y)) + geom_boxplot(fill = "orange") + labs(y = "Weight (g)", x = "") + ylim(0, NA) Figure 11.7: Weights (g) of the 30 individuals in each group. The dark horizontal line is the median, the box contains 50% of the observations (i.e., the interquartile range), the whiskers mark the range of all observations that are less than 1.5 times the interquartile range away from the edge of the box. An ANOVA is a linear regression with a categorical predictor variable instead of a continuous one. The categorical predictor variable with \\(k\\) levels is (as a default in R) transformed to \\(k-1\\) indicator variables. An indicator variable is a binary variable containing 0 and 1 where 1 indicates a specic level (a category of the predictor variable). Often, one indicator variable is constructed for every level except for the reference level. In our example, the categorical variable is group with the three levels group 1, group 2, and group 3 (\\(k = 3\\)). Group 1 is taken as the reference level (default in R is the first in the alphabeth), and for each of the other two groups an indicator variable is constructed, \\(I(group_i = 2)\\) and \\(I(group_i = 3)\\). The function \\(I()\\) gives out 1, if the expression is true and 0 otherwise. We can write the model as a formula: \\[\\begin{align} \\mu_i &=\\beta_0 + \\beta_1 I(group_i=2) + \\beta_1 I(group_i=3) \\\\ y_i &\\sim normal(\\mu_i, \\sigma^2) \\tag{11.5} \\end{align}\\] where \\(y_i\\) is the \\(i\\)-th observation (weight measurement for individual \\(i\\) in our example), and \\(\\beta_{0,1,2}\\) are the model coefcients. The residual variance is \\(\\sigma^2\\). The model coefcients \\(\\beta_{0,1,2}\\) constitute the deterministic part of the model. From the model formula it follows that the group means, \\(m_g\\), are: \\[\\begin{align} m_1 &=\\beta_0 \\\\ m_2 &=\\beta_0 + \\beta_1 \\\\ m_3 &=\\beta_0 + \\beta_2 \\\\ \\tag{11.6} \\end{align}\\] There are other possibilities to describe three group means with three parameters, for example: \\[\\begin{align} m_1 &=\\beta_1 \\\\ m_2 &=\\beta_2 \\\\ m_3 &=\\beta_3 \\\\ \\tag{11.7} \\end{align}\\] In this case, the model formula would be: \\[\\begin{align} \\mu_i &= \\beta_1 I(group_i=1) + \\beta_2 I(group_i=2) + \\beta_3 I(group_i=3) \\\\ y_i &\\sim Norm(\\mu_i, \\sigma^2) \\tag{11.8} \\end{align}\\] The way the group means are calculated within a model is called the parameterization of the model. Different statistical software use different parameterizations. The parameterization used by R by default is the one shown in Equation (11.5). R automatically takes the rst level as the reference (the rst level is the rst one alphabetically unless the user denes a different order for the levels). The mean of the rst group (i.e., of the rst factor level) is the intercept, \\(b_0\\) , of the model. The mean of another factor level is obtained by adding, to the intercept, the estimate of the corresponding parameter (which is the difference from the reference group mean). The parameterization of the model is dened by the model matrix. In the case of a one-way ANOVA, there are as many columns in the model matrix as there are factor levels (i.e., groups); thus there are k factor levels and k model coefcients. Recall from Equation (11.3) that for each observation, the entry in the \\(j\\)-th column of the model matrix is multiplied by the \\(j\\)-th element of the model coefcients and the \\(k\\) products are summed to obtain the tted values. For a data set with \\(n = 5\\) observations of which the rst two are from group 1, the third from group 2, and the last two from group 3, the model matrix used for the parameterization described in Equation (11.6) and defined in R by the formula ~ group is \\[\\begin{align} \\boldsymbol{X}= \\begin{pmatrix} 1 & 0 & 0 \\\\ 1 & 0 & 0 \\\\ 1 & 1 & 0 \\\\ 1 & 0 & 1 \\\\ 1 & 0 & 1 \\\\ \\end{pmatrix} \\end{align}\\] If parameterization of Equation (11.7) (corresponding R formula: ~ group - 1) were used, \\[\\begin{align} \\boldsymbol{X}= \\begin{pmatrix} 1 & 0 & 0 \\\\ 1 & 0 & 0 \\\\ 0 & 1 & 0 \\\\ 0 & 0 & 1 \\\\ 0 & 0 & 1 \\\\ \\end{pmatrix} \\end{align}\\] To obtain the parameter estimates for model parameterized according to Equation (11.6) we t the model in R: # fit the model mod <- lm(y~group, data=dat) # parameter estimates mod ## ## Call: ## lm(formula = y ~ group, data = dat) ## ## Coefficients: ## (Intercept) groupgroup 2 groupgroup 3 ## 12.367 2.215 -5.430 summary(mod)$sigma ## [1] 1.684949 The Intercept is \\(\\beta_0\\). The other coefcients are named with the factor name (group) and the factor level (either group 2 or group 3). These are \\(\\beta_1\\) and \\(\\beta_2\\) , respectively. Before drawing conclusions from an R output we need to examine whether the model assumptions are met, that is, we need to do a residual analysis as described in Chapter 12. Different questions can be answered using the above ANOVA: What are the group means? What is the difference in the means between group 1 and group 2? What is the difference between the means of the heaviest and lightest group? In a Bayesian framework we can directly assess how strongly the data support the hypothesis that the mean of the group 2 is larger than the mean of group 1. We rst simulate from the posterior distribution of the model parameters. library(arm) nsim <- 1000 bsim <- sim(mod, n.sim=nsim) Then we obtain the posterior distributions for the group means according to the parameterization of the model formula (Equation (11.6)). m.g1 <- coef(bsim)[,1] m.g2 <- coef(bsim)[,1] + coef(bsim)[,2] m.g3 <- coef(bsim)[,1] + coef(bsim)[,3] The histograms of the simulated values from the posterior distributions of the three means are given in Fig. 11.8. The three means are well separated and, based on our data, we are condent that the group means differ. From these simulated posterior distributions we obtain the means and use the 2.5% and 97.5% quantiles as limits of the 95% uncertainty intervals (Fig. 11.8, right). # save simulated values from posterior distribution in tibble post <- tibble(`group 1` = m.g1, `group 2` = m.g2, `group 3` = m.g3) %>% gather("groups", "Group means") # histograms per group leftplot <- ggplot(post, aes(x = `Group means`, fill = groups)) + geom_histogram(aes(y=..density..), binwidth = 0.5, col = "black") + labs(y = "Density") + theme(legend.position = "top", legend.title = element_blank()) # plot mean and 95%-CrI rightplot <- post %>% group_by(groups) %>% dplyr::summarise( mean = mean(`Group means`), CrI_lo = quantile(`Group means`, probs = 0.025), CrI_up = quantile(`Group means`, probs = 0.975)) %>% ggplot(aes(x = groups, y = mean)) + geom_point() + geom_errorbar(aes(ymin = CrI_lo, ymax = CrI_up), width = 0.1) + ylim(0, NA) + labs(y = "Weight (g)", x ="") multiplot(leftplot, rightplot, cols = 2) Figure 11.8: Distribution of the simulated values from the posterior distributions of the group means (left); group means with 95% uncertainty intervals obtained from the simulated distributions (right). To obtain the posterior distribution of the difference between the means of group 1 and group 2, we simply calculate this difference for each draw from the joint posterior distribution of the group means. d.g1.2 <- m.g1 - m.g2 mean(d.g1.2) ## [1] -2.209551 quantile(d.g1.2, probs = c(0.025, 0.975)) ## 2.5% 97.5% ## -3.128721 -1.342693 The estimated difference is -2.2095511. In the small model world, we are 95% sure that the difference between the means of group 1 and 2 is between -3.1287208 and -1.3426929. How strongly do the data support the hypothesis that the mean of group 2 is larger than the mean of group 1? To answer this question we calculate the proportion of the draws from the joint posterior distribution for which the mean of group 2 is larger than the mean of group 1. sum(m.g2 > m.g1) / nsim ## [1] 1 This means that in all of the 1000 simulations from the joint posterior distribution, the mean of group 2 was larger than the mean of group 1. Therefore, there is a very high probability (i.e., it is close to 1; because probabilities are never exactly 1, we write >0.999) that the mean of group 2 is larger than the mean of group 1. 11.3 Other variants of normal linear models: Two-way anova, analysis of covariance and multiple regression Up to now, we introduced normal linear models with one predictor only. We can add more predictors to the model and these can be numerical or categorical ones. Traditionally, models with 2 or 3 categorical predictors are called two-way or three-way ANOVA, respectively. Models with a mixture of categorical and numerical predictors are called ANCOVA. And, models containing only numerical predictors are called multiple regressions. Nowadays, we only use the term normal linear model as an umbrella term for all these types of models. While it is easy to add additional predictors in the R formula of the model, it becomes more difficult to interpret the coefficients of such multi-dimensional models. Two important topics arise with multi-dimensional models, interactions and partial effects. We dedicate partial effects the full next chapter and introduce interactions in this chapter using two examples. The first, is a model including two categorical predictors and the second is a model with one categorical and one numeric predictor. 11.3.1 Linear model with two categorical predictors (two-way ANOVA) In the first example, we ask how large are the differences in wing length between age and sex classes of the Coal tit Periparus ater. Wing lengths were measured on 19 coal tit museum skins with known sex and age class. data(periparusater) dat <- tibble(periparusater) # give the data a short handy name dat$age <- recode_factor(dat$age, "4"="adult", "3"="juvenile") # replace EURING code dat$sex <- recode_factor(dat$sex, "2"="female", "1"="male") # replace EURING code To describe differences in wing length between the age classes or between the sexes a normal linear model with two categorical predictors is fitted to the data. The two predictors are specified on the right side of the model formula separated by the + sign, which means that the model is an additive combination of the two effects (as opposed to an interaction, see following). mod <- lm(wing ~ sex + age, data=dat) After having seen that the residual distribution does not appear to violate the model assumptions (as assessed with diagnostic residual plots, see Chapter 12), we can draw inferences. We first have a look at the model parameter estimates: mod ## ## Call: ## lm(formula = wing ~ sex + age, data = dat) ## ## Coefficients: ## (Intercept) sexmale agejuvenile ## 61.3784 3.3423 -0.8829 summary(mod)$sigma ## [1] 2.134682 R has taken the first level of the factors age and sex (as defined in the data.frame dat) as the reference levels. The intercept is the expected wing length for individuals having the reference level in age and sex, thus adult female. The other two parameters provide estimates of what is to be added to the intercept to get the expected wing length for the other levels. The parameter sexmale is the average difference between females and males. We can conclude that in males have in average a 3.3 mm longer wing than females. Similarly, the parameter agejuvenile measures the differences between the age classes and we can conclude that, in average, juveniles have a 0.9 shorter wing than adults. When we insert the parameter estimates into the model formula, we get the receipt to calculate expected values for each age and sex combination: \\(\\hat{y_i} = \\hat{\\beta_0} + \\hat{\\beta_1}I(sex=male) + \\hat{\\beta_2}I(age=juvenile)\\) which yields \\(\\hat{y_i}\\) = 61.4 \\(+\\) 3.3 \\(I(sex=male) +\\) -0.9 \\(I(age=juvenile)\\). Alternatively, we could use matrix notation. We construct a new data set that contains one virtual individual for each age and sex class. newdat <- tibble(expand.grid(sex=factor(levels(dat$sex)), age=factor(levels(dat$age)))) # expand.grid creates a data frame with all combination of values given newdat ## # A tibble: 4 × 2 ## sex age ## <fct> <fct> ## 1 female adult ## 2 male adult ## 3 female juvenile ## 4 male juvenile newdat$fit <- predict(mod, newdata=newdat) # fast way of getting fitted values # or Xmat <- model.matrix(~sex+age, data=newdat) # creates a model matrix newdat$fit <- Xmat %*% coef(mod) For this new data set the model matrix contains four rows (one for each combination of age class and sex) and three columns. The first column contains only ones because the values of this column are multiplied by the intercept (\\(\\beta_0\\)) in the matrix multiplication. The second column contains an indicator variable for males (so only the rows corresponding to males contain a one) and the third column has ones for juveniles. \\[\\begin{align} \\hat{y} = \\boldsymbol{X \\hat{\\beta}} = \\begin{pmatrix} 1 & 0 & 0 \\\\ 1 & 1 & 0 \\\\ 1 & 0 & 1 \\\\ 1 & 1 & 1 \\\\ \\end{pmatrix} \\times \\begin{pmatrix} 61.4 \\\\ 3.3 \\\\ -0.9 \\end{pmatrix} = \\begin{pmatrix} 61.4 \\\\ 64.7 \\\\ 60.5 \\\\ 63.8 \\end{pmatrix} = \\boldsymbol{\\mu} \\tag{11.3} \\end{align}\\] The result of the matrix multiplication is a vector containing the expected wing length for adult and juvenile females and adult and juvenile males. When creating the model matrix with model.matrix care has to be taken that the columns in the model matrix match the parameters in the vector of model coefficients. To achieve that, it is required that the model formula is identical to the model formula of the model (same order of terms!), and that the factors in newdat are identical in their levels and their order as in the data the model was fitted to. To describe the uncertainty of the fitted values, we use 2000 sets of parameter values of the joint posterior distribution to obtain 2000 values for each of the four fitted values. These are stored in the object fitmat. In the end, we extract for every fitted value, i.e., for every row in fitmat, the 2.5% and 97.5% quantiles as the lower and upper limits of the 95% uncertainty interval. nsim <- 2000 bsim <- sim(mod, n.sim=nsim) fitmat <- matrix(ncol=nsim, nrow=nrow(newdat)) for(i in 1:nsim) fitmat[,i] <- Xmat %*% coef(bsim)[i,] newdat$lwr <- apply(fitmat, 1, quantile, probs=0.025) newdat$upr <- apply(fitmat, 1, quantile, probs=0.975) dat$sexage <- factor(paste(dat$sex, dat$age)) newdat$sexage <- factor(paste(newdat$sex, newdat$age)) dat$pch <- 21 dat$pch[dat$sex=="male"] <- 22 dat$col="blue" dat$col[dat$age=="adult"] <- "orange" par(mar=c(4,4,0.5,0.5)) plot(wing~jitter(as.numeric(sexage), amount=0.05), data=dat, las=1, ylab="Wing length (mm)", xlab="Sex and age", xaxt="n", pch=dat$pch, bg=dat$col, cex.lab=1.2, cex=1, cex.axis=1, xlim=c(0.5, 4.5)) axis(1, at=c(1:4), labels=levels(dat$sexage), cex.axis=1) segments(as.numeric(newdat$sexage), newdat$lwr, as.numeric(newdat$sexage), newdat$upr, lwd=2, lend="butt") points(as.numeric(newdat$sexage), newdat$fit, pch=17) Figure 11.9: Wing length measurements on 19 museumm skins of coal tits per age class and sex. Fitted values are from the additive model (black triangles) and from the model including an interaction (black dots). Vertical bars = 95% uncertainty intervals. We can see that the fitted values are not equal to the arithmetic means of the groups; this is especially clear for juvenile males. The fitted values are constrained because only three parameters were used to estimate four means. In other words, this model assumes that the age difference is equal in both sexes and, vice versa, that the difference between the sexes does not change with age. If the effect of sex changes with age, we would include an interaction between sex and age in the model. Including an interaction adds a fourth parameter enabling us to estimate the group means exactly. In R, an interaction is indicated with the : sign. mod2 <- lm(wing ~ sex + age + sex:age, data=dat) # alternative formulations of the same model: # mod2 <- lm(wing ~ sex * age, data=dat) # mod2 <- lm(wing ~ (sex + age)^2, data=dat) The formula for this model is \\(\\hat{y_i} = \\hat{\\beta_0} + \\hat{\\beta_1}I(sex=male) + \\hat{\\beta_2}I(age=juvenile) + \\hat{\\beta_3}I(age=juvenile)I(sex=male)\\). From this formula we get the following expected values for the sexes and age classes: for adult females: \\(\\hat{y} = \\beta_0\\) for adult males: \\(\\hat{y} = \\beta_0 + \\beta_1\\) for juveniles females: \\(\\hat{y} = \\beta_0 + \\beta_2\\) for juveniles males: \\(\\hat{y} = \\beta_0 + \\beta_1 + \\beta_2 + \\beta_3\\) The interaction parameter measures how much different between age classes is the difference between the sexes. To obtain the fitted values the R-code above can be recycled with two adaptations. First, the model name needs to be changed to mod2. Second, importantly, the model matrix needs to be adapted to the new model formula. newdat$fit2 <- predict(mod2, newdata=newdat) bsim <- sim(mod2, n.sim=nsim) Xmat <- model.matrix(~ sex + age + sex:age, data=newdat) fitmat <- matrix(ncol=nsim, nrow=nrow(newdat)) for(i in 1:nsim) fitmat[,i] <- Xmat %*% coef(bsim)[i,] newdat$lwr2 <- apply(fitmat, 1, quantile, probs=0.025) newdat$upr2 <- apply(fitmat, 1, quantile, probs=0.975) print(newdat[,c(1:5,7:9)], digits=3) ## # A tibble: 4 × 8 ## sex age fit[,1] lwr upr fit2 lwr2 upr2 ## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 female adult 61.4 59.3 63.3 61.1 58.8 63.5 ## 2 male adult 64.7 63.3 66.2 64.8 63.3 66.4 ## 3 female juvenile 60.5 58.4 62.6 60.8 58.2 63.4 ## 4 male juvenile 63.8 61.7 66.0 63.5 60.7 66.2 These fitted values are now exactly equal to the arithmetic means of each groups. tapply(dat$wing, list(dat$age, dat$sex), mean) # arithmetic mean per group ## female male ## adult 61.12500 64.83333 ## juvenile 60.83333 63.50000 We can also see that the uncertainty of the fitted values is larger for the model with an interaction than for the additive model. This is because, in the model including the interaction, an additional parameter has to be estimated based on the same amount of data. Therefore, the information available per parameter is smaller than in the additive model. In the additive model, some information is pooled between the groups by making the assumption that the difference between the sexes does not depend on age. The degree to which a difference in wing length is important depends on the context of the study. Here, for example, we could consider effects of wing length on flight energetics and maneuverability or methodological aspects like measurement error. Mean between-observer difference in wing length measurement is around 0.3 mm (Jenni and Winkler 1989). Therefore, we may consider that the interaction is important because its point estimate is larger than 0.3 mm. mod2 ## ## Call: ## lm(formula = wing ~ sex + age + sex:age, data = dat) ## ## Coefficients: ## (Intercept) sexmale agejuvenile ## 61.1250 3.7083 -0.2917 ## sexmale:agejuvenile ## -1.0417 summary(mod2)$sigma ## [1] 2.18867 Further, we think a difference of 1 mm in wing length may be relevant compared to the among-individual variation of which the standard deviation is around 2 mm. Therefore, we report the parameter estimates of the model including the interaction together with their uncertainty intervals. Table 11.1: Parameter estimates of the model for wing length of Coal tits with 95% uncertainty interval. Parameter Estimate lwr upr (Intercept) 61.12 58.85 63.53 sexmale 3.71 0.93 6.59 agejuvenile -0.29 -3.93 3.36 sexmale:agejuvenile -1.04 -5.96 3.90 From these parameters we obtain the estimated differences in wing length between the sexes for adults of 3.7mm and the posterior probability of the hypotheses that males have an average wing length that is at least 1mm larger compared to females is mean(bsim@coef[,2]>1) which is 0.97. Thus, there is some evidence that adult Coal tit males have substantially larger wings than adult females in these data. However, we do not draw further conclusions on other differences from these data because statistical uncertainty is large due to the low sample size. 11.3.2 A linear model with a categorical and a numeric predictor (ANCOVA) An analysis of covariance, ANCOVA, is a normal linear model that contains at least one factor and one continuous variable as predictor variables. The continuous variable is also called a covariate, hence the name analysis of covariance. An ANCOVA can be used, for example, when we are interested in how the biomass of grass depends on the distance from the surface of the soil to the ground water in two different species (Alopecurus pratensis, Dactylis glomerata). The two species were grown by Ellenberg (1953) in tanks that showed a gradient in distance from the soil surface to the ground water. The distance from the soil surface to the ground water is used as a covariate (water). We further assume that the species react differently to the water conditions. Therefore, we include an interaction between species and water. The model formula is then \\(\\hat{y_i} = \\beta_0 + \\beta_1I(species=Dg) + \\beta_2water_i + \\beta_3I(species=Dg)water_i\\) \\(y_i \\sim normal(\\hat{y_i}, \\sigma^2)\\) To fit the model, it is important to first check whether the factor is indeed defined as a factor and the continuous variable contains numbers (i.e., numeric or integer values) in the data frame. data(ellenberg) index <- is.element(ellenberg$Species, c("Ap", "Dg")) & complete.cases(ellenberg$Yi.g) dat <- ellenberg[index,c("Water", "Species", "Yi.g")] # select two species dat <- droplevels(dat) str(dat) ## 'data.frame': 84 obs. of 3 variables: ## $ Water : int 5 20 35 50 65 80 95 110 125 140 ... ## $ Species: Factor w/ 2 levels "Ap","Dg": 1 1 1 1 1 1 1 1 1 1 ... ## $ Yi.g : num 34.8 28 44.5 24.8 37.5 ... Species is a factor with two levels and Water is an integer variable, so we are fine and we can fit the model mod <- lm(log(Yi.g) ~ Species + Water + Species:Water, data=dat) # plot(mod) # 4 standard residual plots We log-transform the biomass to make the residuals closer to normally distributed. So, the normal distribution assumption is met well. However, a slight banana shaped relationship exists between the residuals and the fitted values indicating a slight non-linear relationship between biomass and water. Further, residuals showed substantial autocorrelation because the grass biomass was measured in different tanks. Measurements from the same tank were more similar than measurements from different tanks after correcting for the distance to water. Thus, the analysis we have done here suffers from pseudoreplication. We will re-analyze the example data in a more appropriate way in Chapter 13. Lets have a look at the model matrix (first and last six rows only). head(model.matrix(mod)) # print the first 6 rows of the matrix ## (Intercept) SpeciesDg Water SpeciesDg:Water ## 24 1 0 5 0 ## 25 1 0 20 0 ## 26 1 0 35 0 ## 27 1 0 50 0 ## 28 1 0 65 0 ## 29 1 0 80 0 tail(model.matrix(mod)) # print the last 6 rows of the matrix ## (Intercept) SpeciesDg Water SpeciesDg:Water ## 193 1 1 65 65 ## 194 1 1 80 80 ## 195 1 1 95 95 ## 196 1 1 110 110 ## 197 1 1 125 125 ## 198 1 1 140 140 The first column of the model matrix contains only 1s. These are multiplied by the intercept in the matrix multiplication that yields the fitted values. The second column contains the indicator variable for species Dactylis glomerata (Dg). Species Alopecurus pratensis (Ap) is the reference level. The third column contains the values for the covariate. The last column contains the product of the indicator for species Dg and water. This column specifies the interaction between species and water. The parameters are the intercept, the difference between the species, a slope for water and the interaction parameter. mod ## ## Call: ## lm(formula = log(Yi.g) ~ Species + Water + Species:Water, data = dat) ## ## Coefficients: ## (Intercept) SpeciesDg Water SpeciesDg:Water ## 4.33041 -0.23700 -0.01791 0.01894 summary(mod)$sigma ## [1] 0.9001547 These four parameters define two regression lines, one for each species (Figure 11.10 Left). For Ap, it is \\(\\hat{y_i} = \\beta_0 + \\beta_2water_i\\), and for Dg it is \\(\\hat{y_i} = (\\beta_0 + \\beta_1) + (\\beta_2 + \\beta_3)water_i\\). Thus, \\(\\beta_1\\) is the difference in the intercept between the species and \\(\\beta_3\\) is the difference in the slope. Figure 11.10: Aboveground biomass (g, log-transformed) in relation to distance to ground water and species (two grass species). Fitted values from a model including an interaction species x water (left) and a model without interaction (right) are added. The dotted line indicates water=0. As a consequence of including an interaction in the model, the interpretation of the main effects become difficult. From the above model output, we read that the intercept of the species Dg is lower than the intercept of the species Ap. However, from a graphical inspection of the data, we would expect that the average biomass of species Dg is higher than the one of species Ap. The estimated main effect of species is counter-intuitive because it is measured where water is zero (i.e, it is the difference in the intercepts and not between the mean biomasses of the species). Therefore, the main effect of species in the above model does not have a biologically meaningful interpretation. We have two possibilities to get a meaningful species effect. First, we could delete the interaction from the model (Figure 11.10 Right). Then the difference in the intercept reflects an average difference between the species. However, the fit for such an additive model is much worth compared to the model with interaction, and an average difference between the species may not make much sense because this difference so much depends on water. Therefore, we prefer to use a model including the interaction and may opt for th second possibility. Second, we could move the location where water equals 0 to the center of the data by transforming, specifically centering, the variable water: \\(water.c = water - mean(water)\\). When the predictor variable (water) is centered, then the intercept corresponds to the difference in fitted values measured in the center of the data. For drawing biological conclusions from these data, we refer to Chapter 13, where we use a more appropriate model. 11.4 Partial coefficients and some comments on collinearity Many biologists think that it is forbidden to include correlated predictor variables in a model. They use variance inflating factors (VIF) to omit some of the variables. However, omitting important variables from the model just because a correlation coefficient exceeds a threshold value can have undesirable effects. Here, we explain why and we present the usefulness and limits of partial coefficients (also called partial correlation or partial effects). We start with an example illustrating the usefulness of partial coefficients and then, give some guidelines on how to deal with collinearity. As an example, we look at hatching dates of Snowfinches and how these dates relate to the date when snow melt started (first date in the season when a minimum of 5% ground is snow free). A thorough analyses of the data is presented by Schano et al. (2021). An important question is how well can Snowfinches adjust their hatching dates to the snow conditions. For Snowfinches, it is important to raise their nestlings during snow melt. Their nestlings grow faster when they are reared during the snow melt compared to after snow has completely melted, because their parents find nutrient rich insect larvae in the edges of melting snow patches. load("RData/snowfinch_hatching_date.rda") # Pearson's correlation coefficient cor(datsf$elevation, datsf$meltstart, use = "pairwise.complete") ## [1] 0.3274635 mod <- lm(meltstart~elevation, data=datsf) 100*coef(mod)[2] # change in meltstart with 100m change in elevation ## elevation ## 2.97768 Hatching dates of Snowfinch broods were inferred from citizen science data from the Alps, where snow melt starts later at higher elevations compared to lower elevations. Thus, the start of snow melt is correlated with elevation (Pearsons correlation coefficient 0.33). In average, snow starts melting 3 days later with every 100m increase in elevation. mod1 <- lm(hatchday.mean~meltstart, data=datsf) mod1 ## ## Call: ## lm(formula = hatchday.mean ~ meltstart, data = datsf) ## ## Coefficients: ## (Intercept) meltstart ## 167.99457 0.06325 From a a normal linear regression of hatching date on the snow melt date, we obtain an estimate of 0.06 days delay in hatching date with one day later snow melt. This effect sizes describes the relationship in the data that were collected along an elevational gradient. Along the elevational gradient there are many factors that change such as average temperature, air pressure or sun radiation. All these factors may have an influence on the birds decision to start breeding. Consequentily, from the raw correlation between hatching dates and start of snow melt we cannot conclude how Snowfinches react to changes in the start of snow melt because the correlation seen in the data may be caused by other factors changing with elevation (such a correlation is called pseudocorrelation). However, we are interested in the correlation between hatching date and date of snow melt independent of other factors changing with elevation. In other words, we would like to measure how much in average hatching date delays when snow melt starts one day later while all other factors are kept constant. This is called the partial effect of snow melt date. Therefore, we include elevation as a covariate in the model. library(arm) mod <- lm(hatchday.mean~elevation + meltstart, data=datsf) mod ## ## Call: ## lm(formula = hatchday.mean ~ elevation + meltstart, data = datsf) ## ## Coefficients: ## (Intercept) elevation meltstart ## 154.383936 0.007079 0.037757 From this model, we obtain an estimate of 0.04 days delay in hatching date with one day later snow melt at a given elevation. That gives a difference in hatching date between early and late years (around one month difference in snow melt date) at a given elevation of 1.13 days (Figure 11.11). We further get an estimate of 0.71 days later hatching date for each 100m shift in elevation. Thus, a 18.75 days later snow melt corresponds to a similar delay in average hatching date when elevation increases by 100m. When we estimate the coefficient within a constant elevation (coloured regression lines in Figure 11.11), it is lower than the raw correlation and closer to a causal relationship, because it is corrected for elevation. However, in observational studies, we never can be sure whether the partial coefficients can be interpreted as a causal relationship unless we include all factors that influence hatching date. Nevertheless, partial effects give much more insight into a system compared to univariate analyses because we can separated effects of simultaneously acting variables (that we have measured). The result indicates that Snowfinches may not react very sensibly to varying timing of snow melt, whereas at higher elevations they clearly breed later compared to lower elevations. Figure 11.11: Illustration of the partial coefficient of snow melt date in a model of hatching date. Panel A shows the entire raw data together with the regression lines drawn for three different elevations. The regression lines span the range of snow melt dates occurring at the respective elevation (shown in panel C). Panel B is the same as panel A, but zoomed in to the better see the regression lines and with an additional regression line (in black) from the model that does not take elevation into account. We have seen that it can be very useful to include more than one predictor variable in a model even if they are correlated with each other. In fact, there is nothing wrong with that. However, correlated predictors (collinearity) make things more complicated. For example, partial regression lines should not be drawn across the whole range of values of a variable, to avoid extrapolating out of data. At 2800 m asl snow melt never starts in the beginning of March. Therefore, the blue regression line would not make sense for snow melt dates in March. Further, sometimes correlations among predictors indicate that these predictors measure the same underlying aspect and we are actually interested in the effect of this underlying aspect on our response. For example, we could include also the date of the end of snow melt. Both variables, the start and the end of the snow melt measure the timing of snow melt. Including both as predictor in the model would result in partial coefficients that measure how much hatching date changes when the snow melt starts one day later, while the end date is constant. That interpretation is a mixture of the effect of timing and duration rather than of snow melt timing alone. Similarly, the coefficient of the end of snow melt measures a mixture of duration and timing. Thus, if we include two variables that are correlated because they measure the same aspect (just a little bit differently), we get coefficients that are hard to interpret and may not measure what we actually are interested in. In such a cases, we get easier to interpret model coefficients, if we include just one variable of each aspect that we are interested in, e.g. we could include one timing variable (e.g. start of snow melt) and the duration of snow melt that may or may not be correlated with the start of snow melt. To summarize, the decision of what to do with correlated predictors primarily relies on the question we are interested in, i.e., what exactly should the partial coefficients be an estimate for. A further drawback of collinearity is that model fitting can become difficult. When strong correlations are present, model fitting algorithms may fail. If they do not fail, the statistical uncertainty of the estimates often becomes large. This is because the partial coefficient of one variable needs to be estimated for constant values of the other predictors in the model which means that a reduced range of values is available as illustrated in Figure 11.11 C. However, if uncertainty intervals (confidence, credible or compatibility intervals) are reported alongside the estimates, then using correlated predictors in the same model is absolutely fine, if the fitting algorithm was successful. The correlations per se can be interesting. Further readings on how to visualize and analyse data with complex correlation structures: principal component analysis (Manly 1994) path analyses, e.g. Shipley (2009) structural equation models (Hoyle 2012) 11.5 Ordered Factors and Contrasts In this chapter, we have seen that the model matrix is an \\(n \\times k\\) matrix (with \\(n\\) = sample size and \\(k\\) = number of model coefficients) that is multiplied by the vector of the \\(k\\) model coefficients to obtain the fitted values of a normal linear model. The first column of the model matrix normally contains only ones. This column is multiplied by the intercept. The other columns contain the observed values of the predictor variables if these are numeric variables, or indicator variables (= dummy variables) for factor levels if the predictors are categorical variables (= factors). For categorical variables the model matrix can be constructed in a number of ways. How it is constructed determines how the model coefficients can be interpreted. For example, coefficients could represent differences between means of specific factor levels to the mean of the reference level. That is what we have introduced above. However, they could also represent a linear, quadratic or cubic effect of an ordered factor. Here, we show how this works. An ordered factor is a categorical variable with levels that have a natural order, for example, low, medium and high. How do we tell R that a factor is ordered? The swallow data contain a factor nesting_aid that contains the type aid provided in a barn for the nesting swallows. The natural order of the levels is none < support (e.g., a wooden stick in the wall that helps support a nest built by the swallow) < artificial_nest < both (support and artificial nest). However, when we read in the data R orders these levels alphabetically rather than according to the logical order. data(swallows) levels(swallows$nesting_aid) ## [1] "artif_nest" "both" "none" "support" And with the function contrasts we see how R will construct the model matrix. contrasts(swallows$nesting_aid) ## both none support ## artif_nest 0 0 0 ## both 1 0 0 ## none 0 1 0 ## support 0 0 1 R will construct three dummy variables and call them both, none, and support. The variable both will have an entry of one when the observation is both and zero otherwise. Similarly, the other two dummy variables are indicator variables of the other two levels and artif_nest is the reference level. The model coefficients can then be interpreted as the difference between artif_nest and each of the other levels. The instruction how to transform a factor into columns of a model matrix is called the contrasts. Now, lets bring the levels into their natural order and define the factor as an ordered factor. swallows$nesting_aid <- factor(swallows$nesting_aid, levels=c("none", "support", "artif_nest", "both"), ordered=TRUE) levels(swallows$nesting_aid) ## [1] "none" "support" "artif_nest" "both" The levels are now in the natural order. R will, from now on, use this order for analyses, tables, and plots, and because we defined the factor to be an ordered factor, R will use polynomial contrasts: contrasts(swallows$nesting_aid) ## .L .Q .C ## [1,] -0.6708204 0.5 -0.2236068 ## [2,] -0.2236068 -0.5 0.6708204 ## [3,] 0.2236068 -0.5 -0.6708204 ## [4,] 0.6708204 0.5 0.2236068 When using polynomial contrasts, R will construct three (= number of levels minus one) variables that are called .L, .Q, and .C for linear, quadratic and cubic effects. The contrast matrix defines which numeric value will be inserted in each of the three corresponding columns in the model matrix for each observation, for example, an observation with support in the factor nesting_aid will get the values -0.224, -0.5 and 0.671 in the columns L, Q and C of the model matrix. These contrasts define yet another way to get 4 different group means: \\(m1 = \\beta_0 0.671* \\beta_1 + 0.5*\\beta_2 - 0.224* \\beta_3\\) \\(m2 = \\beta_0 0.224* \\beta_1 - 0.5*\\beta_2 + 0.671* \\beta_3\\) \\(m3 = \\beta_0 + 0.224* \\beta_1 - 0.5*\\beta_2 - 0.671* \\beta_3\\) \\(m4 = \\beta_0 + 0.671* \\beta_1 + 0.5*\\beta_2 + 0.224* \\beta_3\\) The group means are the same, independent of whether a factor is defined as ordered or not. The ordering also has no effect on the variance that is explained by the factor nesting_aid or the overall model fit. Only the model coefficients and their interpretation depend on whether a factor is defined as ordered or not. When we define a factor as ordered, the coefficients can be interpreted as linear, quadratic, cubic, or higher order polynomial effects. The number of the polynomials will always be the number of factor levels minus one (unless the intercept is omitted from the model in which case it is the number of factor levels). Linear, quadratic, and further polynomial effects normally are more interesting for ordered factors than single differences from a reference level because linear and polynomial trends tell us something about consistent changes in the outcome along the ordered factor levels. Therefore, an ordered factor with k levels is treated like a covariate consisting of the centered level numbers (-1.5, -0.5, 0.5, 1.5 in our case with four levels) and k-1 orthogonal polynomials of this covariate are included in the model. Thus, if we have an ordered factor A with three levels, y~A is equivalent to y~x+I(x^2), with x=-1 for the lowest, x=0 for the middle and x=1 for the highest level. Note that it is also possible to define own contrasts if we are interested in specific differences or trends. However, it is not trivial to find meaningful and orthogonal (= uncorrelated) contrasts. 11.6 Quadratic and Higher Polynomial Terms The straight regression line for the biomass of grass species Ap Alopecurus pratensis dependent on the distance to the ground water does not fit well (Figure 11.10). The residuals at low and high values of water tend to be positive and intermediate water levels are associated with negative residuals. This points out a possible violation of the model assumptions. The problem is that the relationship between distance to water and biomass of species Ap is not linear. In real life, we often find non-linear relationships, but if the shape of the relationship is quadratic (plus, potentially, a few more polynomials) we can still use linear modeling (the term linear refers to the linear function used to describe the relationship between the outcome and the predictor variables: \\(f(x) = \\beta_0 + \\beta_1x + \\beta_2x^2\\) is a linear function compared to, e.g., \\(f(x) = \\beta^x\\), which is not a linear function). We simply add the quadratic term of the predictor variable, that is, water in our example, as a further predictor in the linear predictor: \\(\\hat{y_i} = \\beta_0+\\beta_1water_i+\\beta_2water_i^2\\). A quadratic term can be fitted in R using the function I() which tells R that we want the squared values of distance to water. If we do not use I() the ^2 indicates a two-way interaction. The model specification is then lm(log(Yi.g) ~ Water + I(Water^2), data=...). The cubic term would be added by +I(Water^3). As with interactions, a polynomial term changes the interpretation of lower level polynomials. Therefore, we normally include all polynomials up to a specific degree. Furthermore, polynomials are normally correlated (if no special transformation is used, see below) which could cause problems when fitting the model such as non-convergence. To avoid collinearity among polynomials, so called orthogonal polynomials can be used. These are polynomials that are uncorrelated. To that end, we can use the function poly which creates as many orthogonal polynomials of the variable as we want: poly(dat$Water, 2) creates two columns, the first one can be used to model the linear effect of water, the second one to model the quadratic term of water: t.poly <- poly(dat$Water, 2) dat$Water.l <- t.poly[,1] # linear term for water dat$Water.q <- t.poly[,2] # quadratic term for water mod <- lm(log(Yi.g) ~ Water.l + Water.q, data=dat) When orthogonal polynomials are used, the estimated linear and quadratic effects can be interpreted as purely linear and purely quadratic influences of the predictor on the outcome. The function poly applies a specific transformation to the original variables. To reproduce the transformation (e.g. for getting the corresponding orthogonal polynomials for new data used to draw an effect plot), the function predict can be used with the poly-object created based on the original data. newdat <- data.frame(Water = seq(0,130)) # transformation analogous to the one used to fit the model: newdat$Water.l <- predict(t.poly, newdat$Water)[,1] newdat$Water.q <- predict(t.poly, newdat$Water)[,2] These transformed variables can then be used to calculate fitted values that correspond to the water values specified in the new data. "],["residualanalysis.html", "12 Assessing Model Assumptions 12.1 Model Assumptions 12.2 Independent and Identically Distributed 12.3 The QQ-Plot 12.4 Temporal Autocorrelation 12.5 Spatial Autocorrelation 12.6 Heteroscedasticity", " 12 Assessing Model Assumptions 12.1 Model Assumptions Every statistical model makes assumptions. We try to build models that reect the data-generating process as realistically as possible. However, a model never is the truth. Yet, all inferences drawn from a model, such as estimates of effect size or derived quantities with credible intervals, are based on the assumption that the model is true. However, if a model captures the datagenerating process poorly, for example, because it misses important structures (predictors, interactions, polynomials), inferences drawn from the model are probably biased and results become unreliable. In a (hypothetical) model that captures all important structures of the data generating process, the stochastic part, the difference between the observation and the tted value (the residuals), should only show random variation. Analyzing residuals is a very important part of the data analysis process. Residual analysis can be very exciting, because the residuals show what remains unexplained by the present model. Residuals can sometimes show surprising patterns and, thereby, provide deeper insight into the system. However, at this step of the analysis it is important not to forget the original research questions that motivated the study. Because these questions have been asked without knowledge of the data, they protect against data dredging. Of course, residual analysis may raise interesting new questions. Nonetheless, these new questions have emerged from patterns in the data, which might just be random, not systematic, patterns. The search for a model with good t should be guided by thinking about the process that generated the data, not by trial and error (i.e., do not try all possible variable combinations until the residuals look good; that is data dredging). All changes done to the model should be scientically justied. Usually, model complexity increases, rather than decreases, during the analysis. 12.2 Independent and Identically Distributed Usually, we model an outcome variable as independent and identically distributed (iid) given the model parameters. This means that all observations with the same predictor values behave like independent random numbers from the identical distribution. As a consequence, residuals should look iid. Independent means that: The residuals do not correlate with other variables (those that are included in the model as well as any other variable not included in the model). The residuals are not grouped (i.e., the means of any set of residuals should all be equal). The residuals are not autocorrelated (i.e., no temporal or spatial autocorrelation exist; Sections 12.4 and 12.5). Identically distributed means that: All residuals come from the same distribution. In the case of a linear model with normal error distribution (Chapter 11) the residuals are assumed to come from the same normal distribution. Particularly: The residual variance is homogeneous (homoscedasticity), that is, it does not depend on any predictor variable, and it does not change with the tted value. The mean of the residuals is zero over the whole range of predictor values. When numeric predictors (covariates) are present, this implies that the relationship between x and y can be adequately described by a straight line. Residual analysis is mainly done graphically. R makes it easy to plot residuals to look at the different aspects just listed. As an example, we use a linear regression for the biomass of the grass species Dactylis glomerata in relation to water conditions in the soil. The first panel in Fig. 12.1 shows the residuals against the fitted values together with a smoother (red line). This plot is called the Tukey-Ascombe plot. The mean of the residuals should be around zero along the whole range of fitted values. Note that smoothers are very sensitive to random structures in the data, especially for low sample sizes and toward the edges of the data range. Often, curves at the edges of the data do not worry us because the edges of smoothers are based on small sample sizes. The second panel a normal quantile-quantile (QQ) plot of the residuals. When the residuals are normally distributed, the points lie aong the diagonal line. This plot is explained in more detail below. The third panel shows the square root of the absolute values of the standardized residuals, a measure of residual variance, versus the fitted values, together with a smoother. When the residual variance is homogeneous along the range of fitted values, the smoother is horizontal. The fourth panel shows the residuals against the leverage. An observation with a measurement of a predictor variable far from the others has a large leverage. When all predictors are factors, observations with a rare combination of factor levels have higher leverages than observations with a common combination of factor levels. Such observations have the potential to have a large influence on the regression line. A high leverage does not necessarily mean that this observation has a big influence on the model. If that observation fits well to the pattern of all other data points, the observation does not have an unduly large influence on the model estimates, despite its large leverage. However, if it does not fit into the picture, this observation has a strong influence on the parameter estimates. The influence of one observation on the parameter estimates is measured by the Cooks distance. Observations with large Cooks distances lie beyond the red dashed lines in the fourth of the residual plots (the 0.5 and 1 iso lines for Cooks distances are given as dashed lines). Observations with a Cooks distance larger than 1 are usually considered to be overly influential and should be checked. The diagnostic plots (Fig. 12.1) of the residuals of the model fitted to the data of the species Dactylis glomerata look quite acceptable. 1. The average residual value is around zero along the range of fitted values, 2. the points are alined diagonally in the QQ-plot, 3. the variance does not noticably change along the fitted values, and 4. no observation has a large Cooks distance. data(ellenberg) mod <- lm(Yi.g~Water, data=ellenberg[ellenberg$Species=="Dg",]) par(mfrow=c(2,2)) plot(mod) Figure 12.1: Standard diagnostic residual plots of a linear regression for the biomass data of D. glomerata. However, when the same model is fitted to data of Alopecurus pratensis, the model assumptions may not be met well (Fig. 12.2). The average of the residuals decreases with increasing fitted values (panel 1). A few observations, in particular observation 133, do not fit to a normal distribution (panel 2). The residual variance increases with increasing fitted values (panel 3). Observation 133 has a too high Cooks distance. mod <- lm(Yi.g~Water, data=ellenberg[ellenberg$Species=="Ap",]) par(mfrow=c(2,2)) plot(mod) Figure 12.2: Standard diagnostic residual plots of a linear regression for the biomass data of A. pratensis. An increasing variance with increasing fitted values is a widespread case. The logarithm or square-root transformation of the response variable often is a quick and simple solution. Also, in this case, the log transformation improved the diagnostic plots (Fig. 12.3). mod <- lm(log(Yi.g)~Water, data=ellenberg[ellenberg$Species=="Ap",]) par(mfrow=c(2,2)) plot(mod) Figure 12.3: Standard diagnostic residual plots of a linear regression for the logarithm of the biomass data of A. pratensis. The four plots produced by plot(mod) show the most important aspects of the model fit. However, often these four plots are not sufficient. IN addition, we recommend plotting the residuals against all variables in the data set (including those not used in the current model). It is further recommended to think about the data structure. Can we assume that all observations are independent of each other? May there be spatial or temporal correlation? 12.3 The QQ-Plot Each residual represents a quantile of the sample of \\(n\\) residuals. These quantiles are defined by the sample size \\(n\\). A useful choice is the \\(((1,...,n)-0.5)/n\\)-th quantiles. A QQ-plot shows the residuals on the y-axis and the values of the \\(((1,...,n)-0.5)/n\\)-th quantiles of a theoretical normal distribution on the x-axis. A QQ-plot could also be used to compare the distribution of whatever variable with any distribution, but we want to use the normal distribution here because that is the assumed distribution of the residuals in the model. If the residuals are normally distributed, the points are expected to lie along the diagonal line in the QQ-plot. It is often rather difficult to decide whether a deviation from the line is tolerable or not. The function compareqqnorm may help. It draws, eight times, a random sample of \\(n\\) values from a normal distribution with a mean of zero and a standard deviation equal to the residual standard deviation of the model. It then creates a QQ-plot for all eight random samples and for the residuals in a random order. If the QQ-plot of the residuals can easily be identified amont the nine QQ-plots, there is reason to think the distribution of the residuals deviates from normal. Otherwise, there is no indication to suspect violation of the normality assumption. The position of the residual plot of the model in the nine panels is printed to the R console. 12.4 Temporal Autocorrelation 12.5 Spatial Autocorrelation 12.6 Heteroscedasticity "],["lmer.html", "13 Linear Mixed Effect Models 13.1 Background 13.2 Fitting a normal linear mixed model in R 13.3 Restricted maximum likelihood estimation (REML)", " 13 Linear Mixed Effect Models 13.1 Background 13.1.1 Why Mixed Effects Models? Mixed effects models (or hierarchical models A. Gelman and Hill (2007) for a discussion on the terminology) are used to analyze nonindependent, grouped, or hierarchical data. For example, when we measure growth rates of nestlings in different nests by taking mass measurements of each nestling several times during the nestling phase, the measurements are grouped within nestlings (because there are repeated measurements of each) and the nestlings are grouped within nests. Measurements from the same individual are likely to be more similar than measurements from different individuals, and individuals from the same nest are likely to be more similar than nestlings from different nests. Measurements of the same group (here, the groups are individuals or nests) are not independent. If the grouping structure of the data is ignored in the model, the residuals do not fulfill the independence assumption. Further, predictor variables can be measured on different hierarchical levels. For example, in each nest some nestlings were treated with a hormone implant whereas others received a placebo. Thus, the treatment is measured at the level of the individual, while clutch size is measured at the level of the nest. Clutch size was measured only once per nest but entered in the data file more than once (namely for each individual from the same nest). Repeated measure results in pseudoreplication if we do not account for the hierarchical data structure in the model. Mixed models allow modeling of the hierarchical structure of the data and, therefore, account for pseudoreplication. Mixed models are further used to analyze variance components. For example, when the nestlings were cross-fostered so that they were not raised by their genetic parents, we would like to estimate the proportions of the variance (in a measurement, e.g., wing length) that can be assigned to genetic versus to environmental differences. The three problems, grouped data, repeated measure and interest in variances are solved by adding further variance parameters to the model. As a result, the linear predictor of such models contain parameters that are fixed and parameters that vary among levels of a grouping variable. The latter are called random effects. Thus, a mixed model contains fixed and random effects. Often the grouping variable, which is a categorical variable, i.e., a factor, is called the random effect, even though it is not the factor that is random. The levels of the factor are seen as a random sample from a bigger population of levels, and a distribution, usually the normal distribution, is fitted to the level-specific parameter values. Thus, a random effect in a model can be seen as a model (for a parameter) that is nested within the model for the data. Predictors that are defined as fixed effects are either numeric or, if they are categorical, they have a finite (fixed) number of levels. For example, the factor treatment in the Barn owl study below has exactly two levels placebo and corticosterone and nothing more. In contrast, random effects have a theoretically infinite number of levels of which we have measured a random sample. For example, we have measured 10 nests, but there are many more nests in the world that we have not measured. Normally, fixed effects have a low number of levels whereas random effects have a large number of levels (at least 3!). For fixed effects we are interested in the specific differences between levels (e.g., between males and females), whereas for random effects we are only interested in the between-level (between-group, e.g., between-nest) variance rather than in differences between specific levels (e.g., nest A versus nest B). Typical fixed effects are: treatment, sex, age classes, or season. Typical random effects are: nest, individual, field, school, or study plot. It depends sometimes on the aim of the study whether a factor should be treated as fixed or random. When we would like to compare the average size of a corn cob between specific regions, then we include region as a fixed factor. However, when we would like to know how the size of a corn cob is related to the irrigation system and we have several measurements within each of a sample of regions, then we treat region as a random factor. 13.1.2 Random Factors and Partial Pooling In a model with fixed factors, the differences of the group means to the mean of the reference group are separately estimated as model parameters. This produces \\(k-1\\) (independent) model parameters, where \\(k\\) is the number of groups (or number of factor levels). In contrast, for a random factor, the between-group variance is estimated and the \\(k\\) group-specific means are assumed to be normally distributed around the population mean. These \\(k\\) means are thus not independent. We usually call the differences between the specific mean of group \\(g\\) and the mean of all groups \\(b_g\\). They are assumed to be realizations of the same (in most cases normal) distribution with a mean of zero. They are like residuals. The variance of the \\(b_g\\) values is the among-group variance. Treating a factor as a random factor is equivalent to partial pooling of the data. There are three different ways to obtain means for grouped data. First, the grouping structure of the data can be ignored. This is called complete pooling (left panel in Figure 13.1). Second, group means may be estimated separately for each group. In this case, the data from all other groups are ignored when estimating a group mean. No pooling occurs in this case (right panel in Figure 13.1). Third, the data of the different groups can be partially pooled (i.e., treated as a random effect). Thereby, the group means are weighted averages of the population mean and the unpooled group means. The weights are proportional to sample size and the inverse of the variance (see A. Gelman and Hill (2007), p. 252). Further, the estimated mean of all group equals the mean of the group specific means, thus, every group is weighed similarly for calculating the overall mean. In contrast, in the complete pooling case, the groups get weights proportional to their sample sizes. Complete pooling Partial pooling No pooling \\(\\hat{y_i} = \\beta_0\\) \\(y_i \\sim normal(\\hat{y_i}, \\sigma^2)\\) \\(\\hat{y_i} = \\beta_0 + b_{g[i]}\\) \\(b_g \\sim normal(0, \\sigma_b^2)\\) \\(y_i \\sim normal(\\hat{y_i}, \\sigma^2)\\) \\(\\hat{y_i} = \\beta_{0[g[i]]}\\) \\(y_i \\sim normal(\\hat{y_i}, \\sigma_g^2)\\) Figure 13.1: Three possibilities to obtain group means for grouped data: complete pooling, partial pooling, and no pooling. Open symbols = data, orange dots with vertical bars = group means with 95% uncertainty intervals, horizontal black line with shaded interval = population mean with 95% uncertainty interval. What is the advantage of analyses using partial pooling (i.e., mixed, hierarchical, or multilevel modelling) compared to the complete or no pooling analyses? Complete pooling ignores the grouping structure of the data. As a result, the uncertainty interval of the population mean may be too narrow. We are too confident in the result because we assume that all observations are independent when they are not. This is a typical case of pseudoreplication. On the other hand, the no pooling method (which is equivalent to treating the factor as fixed) has the danger of overestimation of the among-group variance because the group means are estimated independently of each other. The danger of overestimating the among-group variance is particularly large when sample sizes per group are low and within-group variance large. In contrast, the partial pooling method assumes that the group means are a random sample from a common distribution. Therefore, information is exchanged between groups. Estimated means for groups with low sample sizes, large variances, and means far away from the population mean are shrunk towards the population mean. Thus, group means that are estimated with a lot of imprecision (because of low sample size and high variance) are shrunk towards the population mean. How strongly they are shrunk depends on the precision of the estimates for the group specific means and the population mean. An example will help make this clear. Imagine that we measured 60 nestling birds from 10 nests (6 nestlings per nest) and found that the average nestling mass at day 10 was around 20 g with a among-nest standard deviation of 2 g. Then, we measure only one nestling from one additional nest (from the same population) whose mass was 12 g. What do we know about the average mass of this new nest? The mean of the measurements for this nest is 12 g, but with n = 1 uncertainty is high. Because we know that the average mass of the other nests was 20 g, and because the new nest belonged to the same population, a value higher than 12 g is a better estimate for an average nestling mass of the new nest than the 12 g measurement of one single nestling, which could, by chance, have been an exceptionally light individual. This is the shrinkage that partial pooling allows in a mixed model. Because of this shrinkage, the estimates for group means from a mixed model are sometimes called shrinkage estimators. A consequence of the shrinkage is that the residuals are positively correlated with the fitted values. To summarize, mixed models are used to appropriately estimate among-group variance, and to account for non-independency among data points. 13.2 Fitting a normal linear mixed model in R To introduce the linear mixed model, we use repeated hormone measures at nestling Barn Owls Tyto alba. The cortbowl data set contains stress hormone data (corticosterone, variable totCort) of nestling Barn owls which were either treated with a corticosterone-implant, or with a placebo-implant as the control group. The aim of the study was to quantify the corticosterone increase due to the corticosterone implants (Almasi et al. 2009). In each brood, one or two nestlings were implanted with a corticosterone-implant and one or two nestlings with a placebo-implant (variable Implant). Blood samples were taken just before implantation, and at days 2 and 20 after implantation. data(cortbowl) dat <- cortbowl dat$days <- factor(dat$days, levels=c("before", "2", "20")) str(dat) # the data was sampled in 2004,2005, and 2005 by the Swiss Ornithologicla Institute ## 'data.frame': 287 obs. of 6 variables: ## $ Brood : Factor w/ 54 levels "231","232","233",..: 7 7 7 7 8 8 9 9 10 10 ... ## $ Ring : Factor w/ 151 levels "898054","898055",..: 44 45 45 46 31 32 9 9 18 19 ... ## $ Implant: Factor w/ 2 levels "C","P": 2 2 2 1 2 1 1 1 2 1 ... ## $ Age : int 49 29 47 25 57 28 35 53 35 31 ... ## $ days : Factor w/ 3 levels "before","2","20": 3 2 3 2 3 1 2 3 2 2 ... ## $ totCort: num 5.76 8.42 8.05 25.74 8.04 ... In total, there are 287 measurements of 151 individuals (variable Ring) of 54 broods. Because the measurements from the same individual are non-independent, we use a mixed model to analyze these data: Two additional arguments for a mixed model are: a) the mixed model allows prediction of corticosterone levels for an average individual, whereas the fixed effect model allows prediction of corticosterone levels only for the 151 individuals that were sampled; and b) fewer parameters are needed. If we include individual as a fixed factor, we would use 150 parameters, while the random factor needs a much lower number of parameters. We first create a graphic to show the development for each individual, separately for owls receiving corticosterone versus owls receiving a placebo (Figure 13.2). Figure 13.2: Total corticosterone before and at day 2 and 20 after implantation of a corticosterone or a placebo implant. Lines connect measurements of the same individual. We fit a normal linear model with Ring as a random factor, and Implant, days and the interaction of Implant \\(\\times\\) days as fixed effects. Note that both Implant and days are defined as factors, thus R creates indicator variables for all levels except the reference level. Later, we will also include Brood as a grouping level; for now, we ignore this level and start with a simpler (less perfect) model for illustrative purposes. \\(\\hat{y_i} = \\beta_0 + b_{Ring[i]} + \\beta_1I(days=2) + \\beta_2I(days=20) + \\beta_3I(Implant=P) + \\beta_4I(days=2)I(Implant=P) + \\beta_5I(days=20)I(Implant=P)\\) \\(b_{Ring} \\sim normal(0, \\sigma_b)\\) \\(y_i \\sim normal(\\hat{y_i}, \\sigma)\\) Several different functions to fit a mixed model have been written in R: lme, gls, gee have been the first ones. Then lmer followed, and now, stan_lmer and brm allow to fit a large variety of hierarchical models. We here start w ith using lmer from the package lme4 (which is automatically loaded to the R-console when loading arm), because it is a kind of basis function also for stan_lmerand brm. Further, sim can treat lmer-objects but none of the earlier ones. The function lmer is used similarly to the function lm. The only difference is that the random factors are added in the model formula within parentheses. The 1 stands for the intercept and the | means grouped by. (1|Ring), therefore, adds the random deviations for each individual to the average intercept. These deviations are the b_{Ring} in the model formula above. Corticosterone data are log transformed to achieve normally distributed residuals. After having fitted the model, in real life, we always first inspect the residuals, before we look at the model output. However, that is a dilemma for this text book. Here, we would like to explain how the model is constructed just after having shown the model code. Therefore, we do the residual analyses later, but in real life, we would do it now. mod <- lmer(log(totCort) ~ Implant + days + Implant:days + (1|Ring), data=dat, REML=TRUE) mod ## Linear mixed model fit by REML ['lmerMod'] ## Formula: log(totCort) ~ Implant + days + Implant:days + (1 | Ring) ## Data: dat ## REML criterion at convergence: 611.9053 ## Random effects: ## Groups Name Std.Dev. ## Ring (Intercept) 0.3384 ## Residual 0.6134 ## Number of obs: 287, groups: Ring, 151 ## Fixed Effects: ## (Intercept) ImplantP days2 days20 ## 1.91446 -0.08523 1.65307 0.26278 ## ImplantP:days2 ImplantP:days20 ## -1.71999 -0.09514 The output of the lmer-object tells us that the model was fitted using the REML-method, which is the restricted maximum likelihood method. The REML criterion is the statistic describing the model fit for a model fitted by REML. The model output further contains the parameter estimates. These are grouped into a random effects and fixed effects section. The random effects section gives the estimates for the among-individual standard deviation of the intercept (\\(\\sigma_{Ring} =\\) 0.34) and the residual standard deviation (\\(\\sigma =\\) 0.61). The fixed effects section gives the estimates for the intercept (\\(\\beta_0 =\\) 1.91), which is the mean logarithm of corticosterone for an average individual that received a corticosterone implant at the day of implantation. The other model coefficients are defined as follows: the difference in the logarithm of corticosterone between placebo- and corticosterone-treated individuals before implantation (\\(\\beta_1 =\\) -0.09), the difference between day 2 and before implantation for the corticosterone-treated individuals (\\(\\beta_2 =\\) 1.65), the difference between day 20 and before implantation for the corticosterone-treated individuals (\\(\\beta_3 =\\) 0.26), and the interaction parameters which tell us how the differences between day 2 and before implantation (\\(\\beta_4 =\\) -1.72), and day 20 and before implantation (\\(\\beta_5 =\\) -0.1), differ for the placebo-treated individuals compared to the corticosterone treated individuals. Neither the model output shown above nor the summary function (not shown) give any information about the proportion of variance explained by the model such as an \\(R^2\\). The reason is that it is not straightforward to obtain a measure of model fit in a mixed model, and different definitions of \\(R^2\\) exist (Nakagawa and Schielzeth 2013). The function fixef extracts the estimates for the fixed effects, the function ranef extracts the estimates for the random deviations from the population intercept for each individual. The ranef-object is a list with one element for each random factor in the model. We can extract the random effects for each ring using the $Ring notation. round(fixef(mod), 3) ## (Intercept) ImplantP days2 days20 ImplantP:days2 ## 1.914 -0.085 1.653 0.263 -1.720 ## ImplantP:days20 ## -0.095 head(ranef(mod)$Ring) # print the first 6 Ring effects ## (Intercept) ## 898054 0.24884979 ## 898055 0.11845863 ## 898057 -0.10788277 ## 898058 0.06998959 ## 898059 -0.08086498 ## 898061 -0.08396839 13.3 Restricted maximum likelihood estimation (REML) For a mixed model the restricted maximum likelihood method is used by default instead of the maximum likelihood (ML) method. The reason is that the ML-method underestimates the variance parameters because this method assumes that the fixed parameters are known without uncertainty when estimating the variance parameters. However, the estimates of the fixed effects have uncertainty. The REML method uses a mathematical trick to make the estimates for the variance parameters independent of the estimates for the fixed effects. We recommend reading the very understandable description of the REML method in Zuur et al. (2009). For our purposes, the relevant difference between the two methods is that the ML-estimates are unbiased for the fixed effects but biased for the random effects, whereas the REML-estimates are biased for the fixed effects and unbiased for the random effects. However, when sample size is large compared to the number of model parameters, the differences between the ML- and REML-estimates become negligible. As a guideline, use REML if the interest is in the random effects (variance parameters), and ML if the interest is in the fixed effects. The estimation method can be chosen by setting the argument REML to FALSE (default is TRUE). mod <- lmer(log(totCort) ~ Implant + days + Implant:days + (1|Ring), data=dat, REML=FALSE) # using ML When we fit the model by stan_lmer from the rstanarm-package or brm from the brms-package, i.e., using the Bayes theorem instead of ML or REML, we do not have to care about this choice (of course!). The result from a Bayesian analyses is unbiased for all parameters (at least from a mathematical point of view - also parameters from a Bayesian model can be biased if the model violates assumptions or is confounded). "],["glm.html", "14 Generalized linear models 14.1 Introduction 14.2 Bernoulli model 14.3 Binomial model 14.4 Poisson model", " 14 Generalized linear models 14.1 Introduction Up to now, we have dealt with models that assume normally distributed residuals. Sometimes the nature of the outcome variable makes it impossible to fulfill this assumption as might occur with binary variables (e.g., alive/dead, a specific behavior occurred/did not occur), proportions (which are confined to be between 0 and 1), or counts that cannot have negative values. For such cases, models for distributions other than the normal distribution are needed; such models are called generalized linear models (GLM). They consist of three elements: the linear predictor \\(\\bf X \\boldsymbol \\beta\\) the link function \\(g()\\) the data distribution The linear predictor is exactly the same as in normal linear models. It is a linear function that defines the relationship between the dependent and the explanatory variables. The link function transforms the expected values of the outcome variable into the range of the linear predictor, which ranges from \\(-\\infty\\) to \\(+\\infty\\). Or, perhaps more intuitively, the inverse link function transforms the values of the linear predictor into the range of the outcome variable. Table 14.1 gives a list of possible link functions that work with different data distributions. Then, a specific data distribution, for example, binomial or Poisson, is used to describe how the observations scatter around the expected values. A general model formula for a generalized linear model is: \\[\\bf y \\sim ExpDist(\\bf\\hat y, \\boldsymbol\\theta)\\] \\[g(\\bf\\hat y) = \\bf X\\boldsymbol \\beta \\] where ExpDist is a distribution of the exponential family and \\(g()\\) is the link function. The vector \\(\\bf y\\) contains the observed values of the outcome variable, \\(\\bf \\beta\\) contains the model parameters in the linear predictor (also called the model coefficients), and \\(\\bf X\\) is the model matrix containing the values of the predictor variables. \\(\\boldsymbol \\theta\\) is an optional vector of additional parameters needed to define the data distribution (e.g., the number of trials in the binomial distribution or the variance in the normal distribution). The normal linear model is a specific case of a generalized linear model, namely when ExpDist equals the normal distribution and \\(g()\\) is the identity function (\\(g(x) = x\\)). Statistical distributions of the exponential family are normal, Bernoulli, binomial, Poisson, inverse-normal, gamma, negative binomial, among others. The normal, Bernoulli, binomial, Poisson or negative binomial distributions are by far the most often used distributions. Most, but not all, data we gather in the life sciences can be analyzed assuming one of these few distributions. Table 14.1: Frequently used distributions for the glm function with their default (D) link functions and other link functions that are possible. link Gaussian Binomial Gamma Inv_Gauss Poisson Negative_binomial logit D probit x cloglog x identity D x x x inverse D log x D D 1/mu^2 D sqrt x cauchit x x exponent (mu^a) x Paul Buerkner has implemented many different distributions and link function in the package brms, see here. 14.2 Bernoulli model 14.2.1 Background If the outcome variable can only take one of two values (e.g., a species is present or absent, or the individual survived or died; coded as 1 or 0) we use a Bernoulli model, also called logistic regression. The Bernoulli distribution only allows for the values zero and ones and it has only one parameter \\(p\\), which defines the probability that the value is 1. When fitting a Bernoulli model to data, we have to estimate \\(p\\). Often we are interested in correlations between \\(p\\) and one or several explanatory variables. Therefore, we model \\(p\\) as linearly dependent on the explanatory variables. Because the values of \\(p\\) are squeezed between 0 and 1 (because it is a probability), \\(p\\) is transformed by the link-function before the linear relationship is modeled. \\[g(p_i) = \\bf X\\boldsymbol \\beta \\] Functions that can transform a probability into the scale of the linear predictor (\\(-\\infty\\) to \\(+\\infty\\)) are, for example, logit, probit, cloglog, or cauchit. These link functions differ slightly in the way they link the outcome variable to the explanatory variables (Figure 14.1). The logit link function is the most often used link function in binomial models. However, sometimes another link function might fit the data better. Kevin S. Van Horn gives useful tipps when to use which link function. Figure 14.1: Left panel: Shape of different link functions commonly used for modelling probabilities. Right panel: The relationship between the predictor x (x-axis) and p on the scale of the link function (y-axis) is assumed to be linear. 14.2.2 Fitting a Bernoulli model in R Functions to fit a Bernoulli model are glm, stan_glm, brm, and there are many more that we do not know so well as the three we focus on in this book. We start by using the function glm. It uses the iteratively reweighted least-squares method which is an adaptation of the least-square (LS) method for fitting generalized linear models. The argument familyallows to choose a data distribution. For fitting a Bernoulli model, we need to specify binomial. That is because the Bernoulli distribution is equal to the binomial distribution with only one trial (size parameter = 1). Note, if we forget the family argument, we fit a normal linear model, and there is no warning by R! With the specification of the distribution, we also choose the link-function. The default link function for the binomial or Bernoulli model is the logit-function. To change the link-function, use e.g. family=binomial(link=cloglog). As an example, we use presence-absence data of little owls Athene noctua in nest boxes during the breeding season. The original data are published in Gottschalk, Ekschmitt, and Wolters (2011); here we use only parts of these data. The variable PA contains the presence of a little owl: 1 indicates a nestbox used by little owls, whereas 0 stands for an empty nestbox. The variable elevation has the elevation in meters above sea level. We are interested in how the presence of the little owl is associated with elevation within the study area, that is, how the probability of presence changes with elevation. Our primary interest, therefore, is the slope \\(\\beta_1\\) of the regression line. \\[ y_i \\sim Bernoulli(p_i) \\] \\[ logit(p_i) = \\beta_0 + \\beta_1 elevation\\] where \\(logit(p_i) = log(p_i/(1-p_i))\\). data(anoctua) # Athene noctua data in the blmeco package mod <- glm(PA~elevation, data=anoctua, family=binomial) mod ## ## Call: glm(formula = PA ~ elevation, family = binomial, data = anoctua) ## ## Coefficients: ## (Intercept) elevation ## 0.579449 -0.006106 ## ## Degrees of Freedom: 360 Total (i.e. Null); 359 Residual ## Null Deviance: 465.8 ## Residual Deviance: 445.6 AIC: 449.6 14.2.3 Assessing model assumptions in a Bernoulli model As for the normal linear model, the Bernoulli model (and any oder statistical model) assumes that the residuals are independent and identically distributed (iid). Independent means that every observation \\(i\\) is independent of the other observations. Particularly, there are no groups in the data and no temporal or spatial correlation. For generalised linear model different residuals exist. The standard residual plots obtained by plot(mod) produce the same four plots as for an lm object, but it uses the deviance residuals for the first three plots (residuals versus fitted values, QQ plot, and residual variance versus fitted values) and the Pearsons residuals for the last (residuals versus leverage). The deviance residuals are the contribution of each observation to the deviance of the model. This is the default type when the residuals are extracted from the model using the function resid. The Pearsons residual for observation \\(i\\) is the difference between the observed and the fitted number of successes divided by the standard deviation given the number of trials and the fitted success probability: \\(\\epsilon_i = \\frac{y_i-n_i \\hat{p_i}}{\\sqrt{n_i \\hat{p_i}(1-\\hat{p_i})}}\\). Other types of residuals are working, response, or partial (see Davison and Snell (1991)). For the residual plots, R chooses the type of residuals so that each plot should look roughly like the analogous plot for the normal linear model. However, in most cases the plots look awkward due to the discreteness of the data, especially when success probabilities are close to 0 or 1. We recommend thinking about why they do not look perfect; with experience, serious violations of model assumptions can be recognized. But often posterior predictive model checking or graphical comparison of fitted values to the data are better suited to assess model fit in GLMs. For Bernoulli models, the residual plots normally look quite awful because the residual distribution very often has two peaks, a negative and a positive one resulting from the binary nature of the outcome variable. However, it is still good to have a look at these plots using plot(mod). At least the average should roughly be around zero and not show a trend. An often more informative plot to judge model fit for a binary logistic regression is to compare the fitted values with the data. To better see the observations, we slightly jitter them in the vertical direction. If the model would fit the data well, the data would be, on average, equal to the fitted values. Thus, we add the \\(y = x\\)-line to the plot using the abline function with intercept 0 and slope 1. Of course, binary data cannot lie on this line because they can only take on the two discrete values 0 or 1. However, the mean of the 0 and 1 values should lie on the line if the model fits well. Therefore, we calculate the mean for suitably selected classes of fitted values. In our example, we choose a class width of 0.1. Then, we calculate means per class and add these to the plot, together with a classical standard error that tells us how reliable the means are. This can be an indication whether our arbitrarily chosen class width is reasonable. plot(fitted(mod), jitter(anoctua$PA, amount=0.05), xlab="Fitted values", ylab="Probability of presence", las=1, cex.lab=1.2, cex=0.8) abline(0,1, lty=3) t.breaks <- cut(fitted(mod), seq(0,1, by=0.1)) means <- tapply(anoctua$PA, t.breaks, mean) semean <- function(x) sd(x)/sqrt(length(x)) means.se <- tapply(anoctua$PA, t.breaks, semean) points(seq(0.05, 0.95, by=0.1), means, pch=16, col="orange") segments(seq(0.05, 0.95, by=0.1), means-2*means.se, seq(0.05, 0.95,by=0.1), means+2*means.se,lwd=2, col="orange") mod <- glm(PA ~ elevation + I(elevation^2) + I(elevation^3) + I(elevation^4), data=anoctua, family=binomial) t.breaks <- cut(fitted(mod), seq(0,1, by=0.1)) means <- tapply(anoctua$PA, t.breaks, mean) semean <- function(x) sd(x)/sqrt(length(x)) means.se <- tapply(anoctua$PA, t.breaks, semean) points(seq(0.05, 0.95, by=0.1)+0.01, means, pch=16, col="lightblue", cex=0.7) segments(seq(0.05, 0.95, by=0.1)+0.01, means-2*means.se, seq(0.05, 0.95,by=0.1)+0.01, means+2*means.se,lwd=2, col="lightblue") Figure 14.2: Goodness of fit plot for the Bernoulli model fitted to little owl presence-absence data. Open circles = observed presence (1) or absence (0) jittered in the vertical direction; orange dots = mean (and 95% compatibility intervals given as vertical bards) of the observations within classes of width 0.1 along the x-axis. The dotted line indicates perfect coincidence between observation and fitted values. Orange larger points are from the model assuming a linear effect of elevation, wheras the smaller light blue points are from a model assuming a non-linear effect. The means of the observed data (orange dots) do not fit well to the data (Figure 14.2). For low presence probabilities, the model overestimates presence probabilities whereas, for medium presence probabilities, the model underestimates presence probability. This indicates that the relationship between little owl presence and elevation may not be linear. After including polynomials up to the fourth degree, we obtained a reasonable fit (light blue dots in Figure 14.2). Further aspects of model fit that may be checked in Bernoulli models: Are all observations independent? May spatial or temporal correlation be an issue? Are all parameters well informed by the data? Some parameters may not be identifiable due to complete separation, i.e. when there is no overlap between the 0 and 1s regarding one of the predictor variables. In such cases glm may fail to fit the model. However, Bayesian methods (stan_glm or brm) do not fail but the result may be highly influenced by the prior distributions. A prior sensitivity analysis is recommended. Note, we do not have to worry about overdispersion when the outcome variable is binary, even though the variance of the Bernoulli distribution is defined by p and no separate variance parameter exists. However, because the data can only take the values 0 and 1, there is no possibility that the data can show a higher variance than the one assumed by the Bernoulli distribution. 14.2.4 Visualising the results When we are ready to report and visualise the results (i.e. after assessing the model fit, when we think the model reasonably well describes the data generating process). We can simulate the posterior distribution of \\(\\beta_1\\) and obtain the 95% compatibility interval. library(arm) nsim <- 5000 bsim <- sim(mod, n.sim=nsim) # sim from package arm apply(bsim@coef, 2, quantile, prob=c(0.5, 0.025, 0.975)) ## (Intercept) elevation I(elevation^2) I(elevation^3) I(elevation^4) ## 50% -24.27945 0.3953864 -0.0021756836 0.000004798096 -0.0000000037490422 ## 2.5% -35.02347 0.1887128 -0.0034319692 0.000001217286 -0.0000000070720019 ## 97.5% -12.85627 0.5910082 -0.0008527594 0.000008247819 -0.0000000003734525 To interpret this polynomial function, an effect plot is helpful. To that end, and as we have done before, we calculate fitted values over the range of the covariate, together with compatibility intervals. newdat <- data.frame(elevation = seq(80,600,by=1)) Xmat <- model.matrix(~elevation+I(elevation^2)+I(elevation^3)+ I(elevation^4), data=newdat) # the model matrix fitmat <- matrix(nrow=nrow(newdat), ncol=nsim) for(i in 1:nsim) fitmat[,i] <- plogis(Xmat %*% bsim@coef[i,]) newdat$lwr <- apply(fitmat,1,quantile,probs=0.025) newdat$fit <- plogis(Xmat %*% coef(mod)) newdat$upr <- apply(fitmat,1,quantile,probs=0.975) We now can plot the data together with the estimate and its compatibility interval. We, again, use the function jitter to slightly scatter the points along the y-axis to make overlaying points visible. plot(anoctua$elevation, jitter(anoctua$PA, amount=0.05), las=1, cex.lab=1.4, cex.axis=1.2, xlab="Elevation", ylab="Probability of presence") lines(newdat$elevation, newdat$fit, lwd=2) lines(newdat$elevation, newdat$lwr, lty=3) lines(newdat$elevation, newdat$upr, lty=3) Figure 14.3: Little owl presence data versus elevation with regression line and 95% compatibility interval (dotted lines). Open circles = observed presence (1) or abesnce (0) jittered in the vertical direction. 14.2.5 Some remarks Binary data do not contain a lot of information. Therefore, large sample sizes are needed to obtain robust results. Often presence/absence data are obtained by visiting plots several times during a distinct period, for example, a breeding period, and then it is reported whether a species has been seen or not. If it has been seen and if there is no misidentification in the data, it is present, however, if it has not been seen we are usually not sure whether we have not detected it or whether it is absent. In the case of repeated visits to the same plot, it is possible to estimate the detection probability using occupancy models MacKenzie et al. (2002) or point count models Royle (2004). Finally, logistic regression can be used in the sense of a discriminant function analysis that aims to find predictors that discriminate members of two groups Anderson (1974). However, if one wants to use the fitted value from such an analysis to assign group membership of a new subject, one has to take the prevalence of the two groups in the data into account. 14.3 Binomial model 14.3.1 Background The binomial model is usesd when the response variable is a count with an upper limit, e.g., the number of seeds that germinated among a total number of seeds in a pot, or the number of chicks hatching from the total number of eggs. Thus, we can use the binomial model always when the response is the sum of a predefined number of Bernoulli trials. Whether a seed germinates or not is a Bernoulli trial. If we have more than one seed, the number of germinated seeds follow a binomial distribution. As an example, we use data from a study on the effects of anthropogenic fire regimes traditionally applied to savanna habitat in Gabon, Central Africa (Walters 2012). Young trees survive fires better or worse depending, among other factors, on the fuel load, which, in turn, depends heavily on the time since the last fire happened. Thus, plots were burned after different lengths of time since the previous fire (4, 9, or 12 months ago). Trees that resprouted after the previous (first) fire were counted before and after the experimental (second) fire to estimate their survival of the experimental fire depending on the time since the previous fire. The outcome variable is the number of surviving trees among the total number of trees per plot \\(y_i\\). The explanatory variable is the time since the previous fire, a factor with three levels: 4m, 9m, and 12m. Assuming that the data follow a binomial distribution, the following model can be fitted to the data: \\[ y_i \\sim binomial(p_i, n_i) \\] \\[ logit(p_i) = \\beta_0 + \\beta_1 I(treatment_i=9m) + \\beta_2 I(treatment_i=12m)\\] where \\(p_i\\) being the survival probability and \\(n_i\\) the total number of tree sprouts on plot \\(i\\). Note that \\(n_i\\) should not be confused with the sample size of the data set, i.e. the number of rows in the data table. 14.3.2 Fitting a binomial model in R We normally use glm, stan_glm or brm for fitting a binomial model depending on the complexity of the predictors and correlation structure. We here, again, start with using the glm function. A peculiarity with binomial models is that the outcome is not just one number, it is the number of trees still live \\(y_i\\) out of \\(n_i\\) trees that were alive before the experimental fire. Therefore, the outcome variable has to be given as a matrix with two columns. The first column contains the number of successes (number of survivors \\(y_i\\)) and the second column contains the number of failures (number of trees killed by the fire, \\(n_i - y_i\\)). We build this matrix using cbind (column bind). data(resprouts) # example data from package blmeco resprouts$succ <- resprouts$post resprouts$fail <- resprouts$pre - resprouts$post mod <- glm(cbind(succ, fail) ~ treatment, data=resprouts, family=binomial) mod ## ## Call: glm(formula = cbind(succ, fail) ~ treatment, family = binomial, ## data = resprouts) ## ## Coefficients: ## (Intercept) treatment9m treatment12m ## -1.241 1.159 -2.300 ## ## Degrees of Freedom: 40 Total (i.e. Null); 38 Residual ## Null Deviance: 845.8 ## Residual Deviance: 395 AIC: 514.4 Experienced readers will be alarmed because the residual deviance is much larger than the residual degrees of freedom, which indicates overdispersion. We will soon discuss overdispersion, but, for now, we continue with the analysis for the sake of illustration. The estimated model parameters are \\(\\hat{b_0} =\\) -1.24, \\(\\hat{b_1} =\\) 1.16, and \\(\\hat{b_2} =\\) -2.3. These estimates tell us that tree survival was higher for the 9-month fire lag treatment compared to the 4-month treatment (which is the reference level), but lowest in the 12-month treatment. To obtain the mean survival probabilities per treatment, some math is needed because we have to back-transform the linear predictor to the scale of the outcome variable. The mean survival probability for the 4-month treatment is \\(logit^{-1}(\\)-1.24$) = =$0.22, for the 9-month treatment it is \\(logit^{-1}(\\)-1.24$ +$ 1.16\\() =\\) 0.48, and for the 12-month treatment it is \\(logit^{-1}(\\)-1.24$ +$ -2.3\\() =\\) 0.03. The function plogis gives the inverse of the logit function and can be used to estimate the survival probabilities, for example: plogis(coef(mod)[1]+ coef(mod)[2]) # for the 9month treatment ## (Intercept) ## 0.4795799 The direct interpretation of the model coefficients \\(\\beta_1\\) and \\(\\beta_2\\) is that they are the log of the ratio of the odds of two treatment levels (i.e., the log odds ratio). The odds for treatment 4 months are 0.22/(1-0.22)=0.29 (calculated using non rounded values), which is the estimated ratio of survived to killed trees in this treatment. For treatment 9 months, the odds are 0.48/(1-0.48) = 0.92, and the log odds ratio is log(0.92/0.29) = 1.16 = \\(beta_1\\). The model output includes the null deviance and the residual deviance. Deviance is a measure of the difference between the data and a model. It corresponds to the sum of squares in the normal linear model. The smaller the residual deviance the better the model fits to the data. Adding a predictor reduces the deviance, even if the predictor does not have any relation to the outcome variable. The Akaike information criterion (AIC) value in the model output (last line) is a deviance measure that is penalized for the number of model parameters. It can be used for model comparison. The residual deviance is defined as minus two times the difference of the log-likelihoods of the saturated model and our model. The saturated model is a model that uses the observed proportion of successes as the success probability for each observation \\(y_i \\sim binomial(y_i/n_i, n_i)\\). The saturated model has the highest possible likelihood (given the data set and the binomial model). This highest possible likelihood is compared to the likelihood of the model at hand, \\(y_i \\sim binomial(p_i, n_i)\\) with \\(p_i\\) dependent on some predictor variables. The null deviance is minus two times the difference of the log-likelihoods of the saturated model, and a model that contains only one overall mean success probability, the null model \\(y_i \\sim binomial(p, n_i)\\). The null deviance corresponds to the total sum of squares, that is, it is a measure of the total variance in the data. 14.3.3 Assessing assumptions in a binomial model In the standard residual plots, we see that in our example data there are obviously a number of influential points (especially the data points with row numbers 7, 20, and 26; Figure 14.4). The corresponding data points may be inspected for errors, or additional predictors may be identified that help to explain why these points are extreme (Are they close/far from the village? Were they grazed? etc.). par(mfrow=c(2,2)) plot(mod) Figure 14.4: The four standard residual plots obtained by using the plot-function. For whatever reason, the variance in the data is larger than assumed by the binomial distribution. We detect this higher variance in the mean of the absolute values of the standardized residuals that is clearly larger than one (lower left panel in Figure 14.4). This is called overdispersion, which we mentioned earlier and deal with next. The variance of a binomial model is defined by \\(n\\) and \\(p\\), that is, there is no separate variance parameter. In our example \\(p\\) is fully defined by \\(\\beta_0\\), \\(\\beta_1\\), and \\(\\beta_2\\): \\(p_i = logit^{-1}(\\beta_0 + \\beta_1 I(treatment_i = 9m) + \\beta_2 I(treatment_i = 12m))\\), and \\(n_i\\) is part of the data. Similarly, in a Poisson model (which we will introduce in the next chapter) the variance is defined by the mean. Unfortunately, real data, as in our example, often show higher and sometimes lower variance than expected by a binomial (or a Poisson) distribution (Figure 14.5). When the variance in the data is higher than expected by the binomial (or the Poisson) distribution we have overdispersion. The uncertainties for the parameter estimates will be underestimated if we do not take overdispersion into account. Overdispersion is indicated when the residual deviance is substantially larger than the residual degrees of freedom. This always has to be checked in the output of a binomial or a Poisson model. In our example, the residual deviance is 10 times larger than the residual degrees of freedom, thus, we have strong overdispersion. Figure 14.5: Histogram of a binomial distribution without overdispersion (orange) and one with the same total number of trials and average success probability, but with overdispersion (blue). What can we do when we have overdispersion? The best way to deal with overdispersion is to find the reason for it. Overdispersion is common in biological data because animals do not behave like random objects but their behavior is sensitive to many factors that we cannot always measure such as social relationships, weather, habitat, experience, and genetics. In most cases, overdispersion is caused by influential factors that were not included in the model. If we find them and can include them in the model (as fixed or as random variables) overdispersion may disappear. If we do not find such predictor variables, we have at least three options. use a quasi-binomial model add an observation level random factor use a beta-binomial model or in case of an overdispersed Poisson model, the negative binomial model may be a good option Fit a quasibinomial or quasi-Poisson model by specifying quasibinomial or quasipoisson in the family-argument. mod <- glm(cbind(succ,fail) ~ treatment, data=resprouts, family=quasibinomial) This will fit a binomial model that estimates, in addition to the other model parameters, a dispersion parameter, \\(u\\), that is multiplied by the binomial or Poisson variance to obtain the residual variance: \\(var(y_i) = u n_i p_i(1 - p_i)\\), or \\(var(y_i)= u\\lambda_i\\), respectively. This inflated variance is then used to obtain the standard errors of the parameter estimates.However, the quasi-distributions are unnatural distributions (there is no physical justification for these distributions, such as number of coin flips that are tails among a defined number of coin flips). Quasi-models do not differ from the binomial or the Poisson model in any parameter except that the variance is stretched so that fits to the variance in the data. We can see quasi-models as a kind of post-hoc correction for overdispersion. Thus, it is better to use the quasi-model instead of an overdispersed model to draw inference. However, the point estimates may be highly influenced by a few extreme observations. Therefore, we prefer to use options that explicitly model the additional variance. Adding an observation-level random factor (i.e., a factor with the levels 1 to \\(n\\), the sample size) models the additional variance as a normal distribution (in the scale of the link function). Adding such an additional variance parameter to the model allows and accounts for extra variance in the data (Harrison 2014). To do that, we have to fit a generalized linear mixed model (GLMM) instead of a GLM. What do we have to do when the residual deviance is smaller than the residual degrees of freedom, that is, when we have underdispersion? Some statisticians do not bother about underdispersion, because, when the variance in the data is smaller than assumed by the model, uncertainty is overestimated. This means that conclusions will be conservative (i.e., on the safe side). However, we think that underdispersion should bother us as biologists (or other applied scientists). In most cases, underdispersion means that the variance in the data is smaller than expected by a random process, that is, the variance may be constrained by something. Thus, we should be interested in thinking about the factors that constrain the variance in the data. An example is the number of surviving young in some raptor species, (e.g., in the lesser spotted eagle Aquila pomarina). Most of the time two eggs are laid, but the first hatched young will usually kill the second (which was only a backup in case the first egg does not yield a healthy young). Because of this behavior, the number of survivors among the number of eggs laid will show much less variance than expected from \\(n_i\\) and \\(p_i\\), leading to underdispersion. Clutch size is another example of data that often produces underdispersion (but it is a Poisson rather than a binomial process, because there is no \\(n_i\\)). Sometimes, apparent under- or overdispersion can be caused by too many 0s in the data than assumed by the binomial or Poisson model. For example, the number of black stork \\(Ciconia nigra\\) nestlings that survived the nestling phase is very often 0, because the whole nest was depredated or fell from the tree (black storks nest in trees). If the nest survives, the number of survivors varies between 0 and 5 depending on other factors such as food availability or weather conditions. A histogram of these data shows a bimodal distribution with one peak at 0 and another peak around 2.5. It looks like a Poisson distribution, but with a lot of additional 0 values. This is called zero-inflation. Zero-inflation is often the result of two different processes being involved in producing the data. The process that determines whether a nest survives differs from the process that determines how many nestlings survive, given the nest survives. When we analyze such data using a model that assumes only one single process it will be very hard to understand the system and the results are likely to be biased because the distributional assumptions are violated. In such cases, we will be more successful when our model explicitly models the two different processes. Such models are zero-inflated binomial or zero-inflated Poisson models. We normally check whether zero-inflation may be an issue by posterior predictive model checking. If we find zero-inflation in binomial data, we try using a zero-inflated binomial model as provided by Paul Buerkner in the package brms. 14.3.4 Visualising the results For the moment, we use the binomial GLM to analyze the tree sprout data. This model suffers from overdispersion and thus, the uncertainty intervals will be too small. We will provide a more appropriate analyses in a later chapter. We simulate 2000 values from the joint posterior distribution of the model parameters. mod <- glm(cbind(succ,fail) ~ treatment, data=resprouts, family=binomial) nsim <- 2000 bsim <- sim(mod, n.sim=nsim) # simulate from the posterior distr. For each set of simulated model parameters, we derive the linear predictor by multiplying the model matrix with the corresponding set of model parameters. Then, the inverse logit function (\\(logit^{-1}(x) = \\frac{e^x}{(1+ e^x)}\\); R function plogis) is used to obtain the fitted value for each fire lag treatment. Lastly, we extract, for each treatment level, the 2.5% and 97.5% quantile of the posterior distribution of the fitted values and plot it together with the estimates (the fitted values) per treatment and the raw data. newdat <- data.frame(treatment=factor(c("4m","9m","12m"),levels=c("4m","9m","12m"))) Xmat <- model.matrix(~treatment, newdat) fitmat <- matrix(nrow=nrow(newdat), ncol=nsim) for(i in 1:nsim) fitmat[,i] <- plogis(Xmat %*% bsim@coef[i,]) newdat$lwr <- apply(fitmat, 1, quantile, prob=0.025) newdat$upr <- apply(fitmat, 1, quantile, prob=0.975) newdat$fit <- plogis(Xmat%*%coef(mod)) newdat$lag <- c(4,9,12) # used for plotting resprouts$lag <- c(4,9,12)[match(resprouts$treatment,c("4m","9m","12m"))] # used for plotting plot(newdat$lag, newdat$fit, type="n", xlab="Fire lag [months]", ylab="Tree survival", las=1, cex.lab=1.4, cex.axis=1, xaxt="n", xlim=c(0, 13), ylim=c(0,0.6)) axis(1, at=c(0,4,9,12), labels=c("0","4","9","12")) segments(newdat$lag, newdat$lwr, newdat$lag, newdat$upr, lwd=2) points(newdat$lag, newdat$fit, pch=21, bg="gray") points(resprouts$lag+0.3,resprouts$succ/resprouts$pre, cex=0.7) # adds the raw data to the plot Figure 14.6: Proportion of surviving trees (circles) for three fire lag treatments with estimated mean proportion of survivors using an inappropriate binomial model. Because of overdispersion, the 95% compatibility intervals are way too small. Gray dots = fitted values. Vertical bars = 95% compatibility intervals. 14.4 Poisson model 14.4.1 Background The Poisson distribution is a discrete probability distribution that naturally describes the distribution of count data. If we know how many times something happened, but we do not know how many times it did not happen (in contrast to the binomial model, where we know the number of trials), such counts usually follow a Poisson distribution. Count data are positive integers ranging from 0 to \\(+\\infty\\). A Poisson distribution is positive-skewed (long tail to the right) if the mean \\(\\lambda\\) is small and it approximates a normal distribution for large \\(\\lambda\\). The Poisson distribution constitutes the stochastic part of a Poisson model. The deterministic part describes how \\(\\lambda\\) is related to predictors. \\(\\lambda\\) can only take on positive values. Therefore, we need a link function that transforms \\(\\lambda\\) into the scale of the linear predictor (or, alternatively, an inverse link function that transforms the value from the linear predictor to nonnegative values). The most often used link function is the natural logarithm (log-link function). This link function transforms all \\(\\lambda\\)-values between 0 and 1 to the interval \\(-\\infty\\) to 0, and all \\(\\lambda\\)-values higher than 1 are projected into the interval 0 to \\(+\\infty\\). Sometimes, the identity link function is used instead of the log-link function, particularly when the predictor variable only contains positive values and the effect of the predictor is additive rather than multiplicative, that is, when a change in the predictor produces an addition of a specific value in the outcome rather than a multiplication by a specific value. Further, the cauchit function can also be used as a link function for Poisson models. 14.4.2 Fitting a Poisson model in R The same R functions that fit binomial models also fit Poisson models. As an example, we fit a Poisson model with log-link function to a simulated data set containing the number of (virtual) aphids on a square centimeter (\\(y\\)) and a numeric predictor variable representing, for example, an aridity index (\\(x\\)). Real ecological data without overdispersion or zeroinflation and with no random structure are rather rare. Therefore, we illustrate this model, which is the basis for more complex models, with simulated data. The model is: \\[y_i \\sim Poisson(\\lambda_i)\\] \\[log(\\lambda_i = \\bf X_i \\boldsymbol \\beta)\\] We use, similar to the R function log, the notation \\(log\\) for the natural logarithm. We fit the model in R using the function glm and use the argument family to specify that we assume a Poisson distribution as the error distribution. The log-link is used as the default link function. Then we add the regression line to the plot using the function curve. Further add the compatibility interval to the plot (of course only after having checked the model assumptions). set.seed(196855) n <- 50 # simulate 50 sampling sites, where we count aphids x <- rnorm(n) # the number of aphids depends, among others, on the aridity index x b0 <- 1 # intercept and b1 <- 0.5 # slope of the linear predictor y <- rpois(n, lambda=exp(b0+b1*x)) mod <- glm(y~x, family="poisson") n.sim <- 2000 bsim <- sim(mod, n.sim=n.sim) par(mar=c(4,4,1,1)) plot(x,y, pch=16, las=1, cex.lab=1.4, cex.axis=1.2) curve(exp(coef(mod)[1] + coef(mod)[2]*x), add=TRUE, lwd=2) newdat <- data.frame(x=seq(-3, 2.5, length=100)) Xmat <- model.matrix(~x, data=newdat) b <- coef(mod) newdat$fit <- exp(Xmat%*%b) fitmat <- matrix(ncol=n.sim, nrow=nrow(newdat)) for(i in 1:n.sim) fitmat[,i] <- exp(Xmat%*%bsim@coef[i,]) newdat$lwr <- apply(fitmat, 1, quantile, prob=0.025) newdat$upr <- apply(fitmat, 1, quantile, prob=0.975) lines(newdat$x, newdat$fit, lwd=2) lines(newdat$x, newdat$lwr, lty=3) lines(newdat$x, newdat$upr, lty=3) Figure 14.7: Simulated data (dots) with a Poisson regression line (solid) and the lower and upper bound of the 95% compatibility interval. 14.4.3 Assessing model assumptions Because the residual variance in the Poisson model is defined by \\(\\lambda\\) (the fitted value), it is not estimated as a separate parameter from the data. Therefore, we always have to check whether overdispersion is present. Ecological data are often overdispersed because not all influencing factors can be measured and included in the model. As with the binomial model, in a Poisson model overdispersion is present when the residual deviance is larger than the residual degrees of freedom. This is because if we add one independent observation to the data, the deviance increases, on average, by one if the variance equals \\(\\lambda\\). If the variance is larger, the contribution of each observation to the deviance is, on average, larger than one. We can check this in the model output: mod ## ## Call: glm(formula = y ~ x, family = "poisson") ## ## Coefficients: ## (Intercept) x ## 1.1329 0.4574 ## ## Degrees of Freedom: 49 Total (i.e. Null); 48 Residual ## Null Deviance: 85.35 ## Residual Deviance: 52.12 AIC: 198.9 The residual deviance is 52 compared to 48 degrees of freedom. This is perfect (of course, because the model is fit to simulated data). If we are not sure, we could do a posterior predictive model checking and compare the variance in the data with the variance in data that were simulated from the model. If there is substantial overdispersion, we could fit a quasi-Poisson model that includes a dispersion parameter. However, as explained previously, we prefer to explicitly model the variance. A good alternative for overdispersed count data that we now like very much (in contrast to what we wrote in the first printed edition of this book) is the negative binomial model. The standard residual plots (Figure 14.8) are obtained in the usual way. par(mfrow=c(2,2)) plot(mod) Figure 14.8: Standard residual plots for the Poisson model fitted to simulated data, thus they fit perfectly. Of course, again, they look perfect because we used simulated data. In a Poisson model, as for the binomial model, it is easier to detect lack of model fit using posterior predictive model checking. For example, data could be simulated from the model and the proportion of 0 values in the simulated data could be compared to the proportion of 0 values in the observations to assess whether zero-inflation is present or not. 14.4.4 Visualising results We can look at the posterior distributions of the model parameters. apply(bsim@coef, 2, quantile, prob=c(0.5, 0.025, 0.975)) ## (Intercept) x ## 50% 1.1370430 0.4569485 ## 2.5% 0.9692098 0.2974446 ## 97.5% 1.3000149 0.6143244 The 95% compatibility interval of \\(\\beta_1\\) is 0.3-0.6. Given that an effect of 0.2 or larger on the aridity scale would be considered biologically relevant, we can be quite confident that aridity has a relevant effect on aphid abundance given our data and our model. With the simulations from the posterior distributions of the model parameters (stored in the object bsim) we obtained samples of the posterior distributions of fitted values for each of 100 x-values along the x-axis and we have drawn the 95% compatibility interval of the regression line in Figure 14.7. 14.4.5 Modeling rates and densities: Poisson model with an offset Many count data are measured in relation to a reference, such as an area or a time period or a population. For example, when we count animals on plots of different sizes, the most important predictor variable will likely be the size of the plot. Or, in other words, the absolute counts do not make much sense when they are not corrected for plot size: the relevant measure is animal density. Similarly, when we count how many times a specific behavior occurs and we follow the focal animals during time periods of different lengths, then the interest is in the rate of occurrence rather than in the absolute number counted. One way to analyze rates and densities is to divide the counts by the reference value and assume that this rate (or a transformation thereof) is normally distributed. However, it is usually hard to obtain normally distributed residuals using rates or densities as dependent variables. A more natural approach to describe rates and densities is to use a Poisson model that takes the reference into account within the model. This is called an offset. To do so, \\(\\lambda\\) is multiplied by the reference \\(T\\) (e.g., time interval, area, population). Therefore, \\(log(T)\\) has to be added to the linear predictor. Adding \\(log(T)\\) to the linear predictor is like adding a new predictor variable (the log of \\(T\\)) to the model with its model parameter (the slope) fixed to 1. The term offset says that we add a predictor but do not estimate its effect because it is fixed to 1. \\[y_i \\sim Poisson(\\lambda_i T_i)\\] \\[ log(\\boldsymbol \\lambda \\boldsymbol T) = log(\\boldsymbol \\lambda) + log(\\boldsymbol T) = \\boldsymbol X \\boldsymbol \\beta + log(\\boldsymbol T)\\] In R, we can use the argument offset within the function glm to specify an offset. We illustrate this using a breeding bird census on wildflower fields in Switzerland in 2007 conducted by Zollinger et al. (2013). We focus on the common whitethroat Silvia communis, a bird of field margins and fallow lands that has become rare in the intensively used agricultural landscape. Wildflower fields are an ecological compensation measure to provide food and nesting grounds for species such as the common whitethroat. Such fields are sown and then left unmanaged for several years except for the control of potentially problematic species (e.g., some thistle species, Carduus spp.). The plant composition and the vegetation structure in the field gradually changes over the years, hence the interest in this study was to determine the optimal age of a wildflower field for use by the common whitethroat. We use the number of breeding pairs (bp) as the outcome variable and field size as an offset, which means that we model breeding pair density. We include the age of the field (age) as a linear and quadratic term because we expect there to be an optimal age of the field (i.e., a curvilinear relationship between the breeding pair density and age). We also include field size as a covariate (in addition to using it as the offset) because the size of the field may have an effect on the density; for example, small fields may have a higher density if the whitethroat can also use surrounding areas but uses the field to breed. Size (in hectares) was z-transformed before the model fit. data(wildflowerfields) # in the package blmeco dat <- wildflowerfields[wildflowerfields$year==2007,] # select data dat$size.ha <- dat$size/100 # change unit to ha dat$size.ha.z <- scale(dat$size.ha) mod <- glm(bp ~ age + I(age^2) + size.ha.z, offset=log(size.ha), data=dat, family=poisson) mod ## ## Call: glm(formula = bp ~ age + I(age^2) + size.ha.z, family = poisson, ## data = dat, offset = log(size.ha)) ## ## Coefficients: ## (Intercept) age I(age^2) size.ha.z ## -4.2294 1.5241 -0.1408 -0.5397 ## ## Degrees of Freedom: 40 Total (i.e. Null); 37 Residual ## Null Deviance: 48.5 ## Residual Deviance: 27.75 AIC: 70.2 For the residual analysis and for drawing conclusions, we can proceed in the same way we did in the Poisson model. From the model output we see that the residual deviance is smaller than the corresponding degrees of freedom, thus we have some degree of underdispersion. But the degree of underdispersion is not very extreme so we accept that the compatibility intervals will be a bit larger than necessary and proceed in this case. After residual analyses, we can produce an effect plot of the estimated whitethroat density against the age of the wildflower field (Figure 14.9). And we see that the expected whitethroat density is largest on wildflower fields of age 4 to 7 years. n.sim <- 5000 bsim <- sim(mod, n.sim=n.sim) apply(bsim@coef, 2, quantile, prob=c(0.025,0.5,0.975)) ## (Intercept) age I(age^2) size.ha.z ## 2.5% -7.006715 0.3158791 -0.26708865 -1.14757192 ## 50% -4.196504 1.5118620 -0.14034083 -0.54749587 ## 97.5% -1.445242 2.7196036 -0.01837473 0.02976658 par(mar=c(4,4,1,1)) plot(jitter(dat$age,amount=0.1),jitter(dat$bp/dat$size.ha,amount=0.1), pch=16, las=1, cex.lab=1.2, cex.axis=1, cex=0.7, xlab="Age of wildflower field [yrs]", ylab="Density of Whitethroat [bp/ha]") # add credible/compatibility interval newdat <- data.frame(age=seq(1, 9, length=100), size.ha.z=0) Xmat <- model.matrix(~age + I(age^2) + size.ha.z, data=newdat) b <- coef(mod) newdat$fit <- exp(Xmat%*%b) fitmat <- matrix(ncol=n.sim, nrow=nrow(newdat)) for(i in 1:n.sim) fitmat[,i] <- exp(Xmat%*%bsim@coef[i,]) newdat$lwr <- apply(fitmat, 1, quantile, prob=0.025) newdat$upr <- apply(fitmat, 1, quantile, prob=0.975) lines(newdat$age, newdat$fit, lwd=2) lines(newdat$age, newdat$lwr, lty=3) lines(newdat$age, newdat$upr, lty=3) Figure 14.9: Whitethroat densities are highest in wildflower fields that are around 4 to 6 years old. Dots are the raw data, the bold line give the fitted values (with the 95% compatibility interval given with dotted lines) for wildflower fields of different ages (years). The fitted values are given for average field sizes of 1.4 ha. "],["glmm.html", "15 Generalized linear mixed models 15.1 Introduction 15.2 Summary", " 15 Generalized linear mixed models 15.1 Introduction In chapter 13 on linear mixed effect models we have introduced how to analyze metric outcome variables for which a normal error distribution can be assumed (potentially after transformation), when the data have a hierarchical structure and, as a consequence, observations are not independent. In chapter 14 on generalized linear models we have introduced how to analyze outcome variables for which a normal error distribution can not be assumed, as for example binary outcomes or count data. More precisely, we have extended modelling outcomes with normal error to modelling outcomes with error distributions from the exponential family (e.g., binomial or Poisson). Generalized linear mixed models (GLMM) combine the two complexities and are used to analyze outcomes with a non-normal error distribution when the data have a hierarchical structure. In this chapter, we will show how to analyze such data. Remember, a hierarchical structure of the data means that the data are collected at different levels, for example smaller and larger spatial units, or include repeated measurements in time on a specific subject. Typically, the outcome variable is measured/observed at the lowest level but other variables may be measured at different levels. A first example is introduced in the next section. 15.1.1 Binomial Mixed Model 15.1.1.1 Background To illustrate the binomial mixed model we use a subset of a data set used by Grüebler, Korner-Nievergelt, and Von Hirschheydt (2010) on barn swallow Hirundo rustica nestling survival (we selected a nonrandom sample to be able to fit a simple model; hence, the results do not add unbiased knowledge about the swallow biology!). For 63 swallow broods, we know the clutch size and the number of the nestlings that fledged. The broods came from 51 farms (larger unit), thus some of the farms had more than one brood. Note that each farm can harbor one or several broods, and the broods are nested within farms (as opposed to crossed, see chapter 13), i.e., each brood belongs to only one farm. There are three predictors measured at the level of the farm: colony size (the number of swallow broods on that farm), cow (whether there are cows on the farm or not), and dung heap (the number of dung heaps, piles of cow dung, within 500 m of the farm). The aim was to assess how swallows profit from insects that are attracted by livestock on the farm and by dung heaps. Broods from the same farm are not independent of each other because they belong to the same larger unit (farm), and thus share the characteristics of the farm (measured or unmeasured). Predictor variables were measured at the level of the farm, and are thus the same for all broods from a farm. In the model described and fitted below, we account for the non-independence of these clutches when building the model by including a random intercept per farm to model random variation between farms. The outcome variable is a proportion (proportion fledged from clutch) and thus consists of two values for each observation, as seen with the binomial model without random factors (Section 14.2.2): the number of chicks that fledged (successes) and the number of chicks that died (failures), i.e., the clutch size minus number that fledged. The random factor farm adds a farm-specific deviation \\(b_g\\) to the intercept in the linear predictor. These deviations are modeled as normally distributed with mean \\(0\\) and standard deviation \\(\\sigma_g\\). \\[ y_i \\sim binomial\\left(p_i, n_i\\right)\\\\ logit\\left(p_i\\right) = \\beta_0 + b_{g[i]} + \\beta_1\\;colonysize_i + \\beta_2\\;I\\left(cow_i = 1\\right) + \\beta_3\\;dungheap_i\\\\ b_g \\sim normal\\left(0, \\sigma_g\\right) \\] # Data on Barn Swallow (Hirundo rustica) nestling survival on farms # (a part of the data published in Grüebler et al. 2010, J Appl Ecol 47:1340-1347) library(blmeco) data(swallowfarms) #?swallowfarms # to see the documentation of the data set dat <- swallowfarms str(dat) ## 'data.frame': 63 obs. of 6 variables: ## $ farm : int 1001 1002 1002 1002 1004 1008 1008 1008 1010 1016 ... ## $ colsize: int 1 4 4 4 1 11 11 11 3 3 ... ## $ cow : int 1 1 1 1 1 1 1 1 0 1 ... ## $ dung : int 0 0 0 0 1 1 1 1 2 2 ... ## $ clutch : int 8 9 8 7 13 7 9 16 10 8 ... ## $ fledge : int 8 0 6 5 9 3 7 4 9 8 ... # check number of farms in the data set length(unique(dat$farm)) ## [1] 51 15.1.1.2 Fitting a Binomial Mixed Model in R 15.1.1.2.1 Using the glmer function dat$colsize.z <- scale(dat$colsize) # z-transform values for better model convergence dat$dung.z <- scale(dat$dung) dat$die <- dat$clutch - dat$fledge dat$farm.f <- factor(dat$farm) # for clarity we define farm as a factor The glmer function uses the standard way to formulate a statistical model in R, with the outcome on the left, followed by the ~ symbol, meaning explained by, followed by the predictors, which are separated by +. The notation for the random factor with only a random intercept was introduced in chapter 13 and is (1|farm.f) here. Remember that for fitting a binomial model we have to provide the number of successful events (number of fledglings that survived) and the number of failures (those that died) within a two-column matrix that we create using the function cbind. # fit GLMM using glmer function from lme4 package library(lme4) mod.glmer <- glmer(cbind(fledge,die) ~ colsize.z + cow + dung.z + (1|farm.f) , data=dat, family=binomial) 15.1.1.2.2 Assessing Model Assumptions for the glmer fit The residuals of the model look fairly normal (top left panel of Figure 15.1 with slightly wider tails. The random intercepts for the farms look perfectly normal as they should. The plot of the residuals vs. fitted values (bottom left panel) shows a slight increase in the residuals with increasing fitted values. Positive correlations between the residuals and the fitted values are common in mixed models due to the shrinkage effect (chapter 13). Due to the same reason the fitted proportions slightly overestimate the observed proportions when these are large, but underestimate them when small (bottom right panel). What is looking like a lack of fit here can be seen as preventing an overestimation of the among farm variance based on the assumption that the farms in the data are a random sample of farms belonging to the same population. The mean of the random effects is close to zero as it should. We check that because sometimes the glmer function fails to correctly separate the farm-specific intercepts from the overall intercept. A non-zero mean of random effects does not mean a lack of fit, but a failure of the model fitting algorithm. In such a case, we recommend using a different fitting algorithm, e.g. brm (see below) or stan_glmer from the rstanarm package. A slight overdispersion (approximated dispersion parameter >1) seems to be present, but nothing to worry about. par(mfrow=c(2,2), mar=c(3,5,1,1)) # check normal distribution of residuals qqnorm(resid(mod.glmer), main="qq-plot residuals") qqline(resid(mod.glmer)) # check normal distribution of random intercepts qqnorm(ranef(mod.glmer)$farm.f[,1], main="qq-plot, farm") qqline(ranef(mod.glmer)$farm.f[,1]) # residuals vs fitted values to check homoscedasticity plot(fitted(mod.glmer), resid(mod.glmer)) abline(h=0) # plot data vs. predicted values dat$fitted <- fitted(mod.glmer) plot(dat$fitted,dat$fledge/dat$clutch) abline(0,1) Figure 15.1: Diagnostic plots to assess model assumptions for mod.glmer. Uppper left: quantile-quantile plot of the residuals vs. theoretical quantiles of the normal distribution. Upper rihgt: quantile-quantile plot of the random effects farm. Lower left: residuals vs. fitted values. Lower right: observed vs. fitted values. # check distribution of random effects mean(ranef(mod.glmer)$farm.f[,1]) ## [1] -0.001690303 # check for overdispersion dispersion_glmer(mod.glmer) ## [1] 1.192931 detach(package:lme4) 15.1.1.2.3 Using the brm function Now we fit the same model using the function brm from the R package brms. This function allows fitting Bayesian generalized (non-)linear multivariate multilevel models using Stan (Betancourt 2013) for full Bayesian inference. We shortly introduce the fitting algorithm used by Stan, Hamiltonian Monte Carlo, in chapter 18. When using the function brm there is no need to install rstan or write the model in Stan-language. A wide range of distributions and link functions are supported, and the function offers many things more. Here we use it to fit the model as specified by the formula object above. Note that brm requires that a binomial outcome is specified in the format successes|trials(), which is the number of fledged nestlings out of the total clutch size in our case. In contrast, the glmer function required to specify the number of nestlings that fledged and died (which together sum up to clutch size), in the format cbind(successes, failures). The family is also called binomial in brm, but would be bernoulli for a binary outcome, whereas glmer would use binomial in both situations (Bernoulli distribution is a special case of the binomial). However, it is slightly confusing that (at the time of writing this chapter) the documentation for brmsfamily did not mention the binomial family under Usage, where it probably went missing, but it is mentioned under Arguments for the argument family. Prior distributions are an integral part of a Bayesian model, therefore we need to specify prior distributions. We can see what default prior distributions brm is using by applying the get_prior function to the model formula. The default prior for the effect sizes is a flat prior which gives a density of 1 for any value between minus and plus infinity. Because this is not a proper probability distribution it is also called an improper distribution. The intercept gets a t-distribution with mean of 0, standard deviation of 2.5 and 3 degrees of freedoms. Transforming this t-distribution to the proportion scale (using the inverse-logit function) becomes something similar to a uniform distribution between 0 and 1 that can be seen as non-informative for a probability. For the among-farm standard deviation, it uses the same t-distribution as for the intercept. However, because variance parameters such as standard deviations only can take on positive numbers, it will use only the positive half of the t-distribution (this is not seen in the output of get_prior). When we have no prior information on any parameter, or if we would like to base the results solely on the information in the data, we specify weakly informative prior distributions that do not noticeably affect the results but they will facilitate the fitting algorithm. This is true for the priors of the intercept and among-farm standard deviation. However, for the effect sizes, we prefer specifying more narrow distributions (see chapter 10). To do so, we use the function prior. To apply MCMC sampling we need some more arguments: warmup specifies the number of iterations during which we allow the algorithm to be adapted to our specific model and to converge to the posterior distribution. These iterations should be discarded (similar to the burn-in period when using, e.g., Gibbs sampling); iter specifies the total number of iterations (including those discarded); chains specifies the number of chains; init specifies the starting values of the iterations. By default (init=NULL) or by setting init=\"random\" the initial values are randomly chosen which is recommended because then different initial values are chosen for each chain which helps to identify non-convergence. However, sometimes random initial values cause the Markov chains to behave badly. Then you can either use the maximum likelihood estimates of the parameters as starting values, or simply ask the algorithm to start with zeros. thin specifies the thinning of the chain, i.e., whether all iterations should be kept (thin=1) or for example every 4th only (thin=4); cores specifies the number of cores used for the algorithm; seed specifies the random seed, allowing for replication of results. library(brms) # check which parameters need a prior get_prior(fledge|trials(clutch) ~ colsize.z + cow + dung.z + (1|farm.f), data=dat, family=binomial(link="logit")) ## prior class coef group resp dpar nlpar lb ub ## (flat) b ## (flat) b colsize.z ## (flat) b cow ## (flat) b dung.z ## student_t(3, 0, 2.5) Intercept ## student_t(3, 0, 2.5) sd 0 ## student_t(3, 0, 2.5) sd farm.f 0 ## student_t(3, 0, 2.5) sd Intercept farm.f 0 ## source ## default ## (vectorized) ## (vectorized) ## (vectorized) ## default ## default ## (vectorized) ## (vectorized) # specify own priors myprior <- prior(normal(0,5), class="b") mod.brm <- brm(fledge|trials(clutch) ~ colsize.z + cow + dung.z + (1|farm.f) , data=dat, family=binomial(link="logit"), prior=myprior, warmup = 500, iter = 2000, chains = 2, init = "random", cores = 2, seed = 123) # note: thin=1 is default and we did not change this here. 15.1.1.2.4 Checking model convergence for the brm fit We first check whether we find warnings in the R console about problems of the fitting algorithm. Warnings should be taken seriously. Often, we find help in the Stan online documentation (or when typing launch_shinystan(mod.brm) into the R-console) what to change when calling the brm function to get a fit that is running smoothly. Once, we get rid of all warnings, we need to check how well the Markov chains mixed. We can either do that by scanning through the many diagnostic plots given by launch_shinystan(mod) or create the most important plots ourselves such as the traceplot (Figure 15.2). par(mar=c(2,2,2,2)) mcmc_plot(mod.brm, type = "trace") Figure 15.2: Traceplot of the Markov chains. After convergence, both Markov chains should sample from the same stationary distribution. Indications of non-convergence would be, if the two chains diverge or vary around different means. 15.1.1.2.5 Checking model fit by posterior predictive model checking To assess how well the model fits to the data we do posterior predictive model checking (Chapter 16). For binomial as well as for Poisson models comparing the standard deviation of the data with those of replicated data from the model is particularly important. If the standard deviation of the real data would be much higher compared to the ones of the replicated data from the model, overdispersion would be an issue. However, here, the model is able to capture the variance in the data correctly (Figure 15.3). The fitted vs observed plot also shows a good fit. yrep <- posterior_predict(mod.brm) sdyrep <- apply(yrep, 1, sd) par(mfrow=c(1,3), mar=c(3,4,1,1)) hist(yrep, freq=FALSE, main=NA, xlab="Number of fledglings") hist(dat$fledge, add=TRUE, col=rgb(1,0,0,0.3), freq=FALSE) legend(10, 0.15, fill=c("grey",rgb(1,0,0,0.3)), legend=c("yrep", "y")) hist(sdyrep) abline(v=sd(dat$fledge), col="red", lwd=2) plot(fitted(mod.brm)[,1], dat$fledge, pch=16, cex=0.6) abline(0,1) Figure 15.3: Posterior predictive model checking: Histogram of the number of fledglings simulated from the model together with a histogram of the real data, and a histogram of the standard deviations of replicated data from the model together with the standard deviation of the data (vertical line in red). The third plot gives the fitted vs. observed values. After checking the diagnostic plots, the posterior predictive model checking and the general model fit, we assume that the model describes the data generating process reasonably well, so that we can proceed to drawing conclusions. 15.1.1.3 Drawing Conclusions The generic summary function gives us the results for the model object containing the fitted model, and works for both the model fitted with glmer and brm. Lets start having a look at the summary from mod.glmer. The summary provides the fitting method, the model formula, statistics for the model fit including the Akaike information criterion (AIC), the Bayesian information criterion (BIC), the scaled residuals, the random effects variance and information about observations and groups, a table with coefficient estimates for the fixed effects (with standard errors and a z-test for the coefficient) and correlations between fixed effects. We recommend to always check if the number of observations and groups, i.e., 63 barn swallow nests from 51 farms here, is correct. This information shows if the glmer function has correctly recognized the hierarchical structure in the data. Here, this is correct. To assess the associations between the predictor variables and the outcome analyzed, we need to look at the column Estimate in the table of fixed effects. This column contains the estimated model coefficients, and the standard error for these estimates is given in the column Std. Error, along with a z-test for the null hypothesis of a coefficient of zero. In the random effects table, the among farm variance and standard deviation (square root of the variance) are given. The function confint shows the 95% confidence intervals for the random effects (.sig01) and fixed effects estimates. In the summary output from mod.brm we see the model formula and some information on the Markov chains after the warm-up. In the group-level effects (between group standard deviations) and population-level effects (effect sizes, model coefficients) tables some summary statistics of the posterior distribution of each parameter are given. The Estimate is the mean of the posterior distribution, the Est.Error is the standard deviation of the posterior distribution (which is the standard error of the parameter estimate). Then we see the lower and upper limit of the 95% credible interval. Also, some statistics for measuring how well the Markov chains converged are given: the Rhat and the effective sample size (ESS). The bulk ESS tells us how many independent samples we have to describe the posterior distribution, and the tail ESS tells us on how many samples the limits of the 95% credible interval is based on. Because we used the logit link function, the coefficients are actually on the logit scale and are a bit difficult to interpret. What we can say is that positive coefficients indicate an increase and negative coefficients indicate a decrease in the proportion of nestlings fledged. For continuous predictors, as colsize.z and dung.z, this coefficient refers to the change in the logit of the outcome with a change of one in the predictor (e.g., for colsize.z an increase of one corresponds to an increase of a standard deviation of colsize). For categorical predictors, the coefficients represent a difference between one category and another (reference category is the one not shown in the table). To visualize the coefficients we could draw effect plots. # glmer summary(mod.glmer) ## Generalized linear mixed model fit by maximum likelihood (Laplace ## Approximation) [glmerMod] ## Family: binomial ( logit ) ## Formula: cbind(fledge, die) ~ colsize.z + cow + dung.z + (1 | farm.f) ## Data: dat ## ## AIC BIC logLik deviance df.resid ## 282.5 293.2 -136.3 272.5 58 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -3.2071 -0.4868 0.0812 0.6210 1.8905 ## ## Random effects: ## Groups Name Variance Std.Dev. ## farm.f (Intercept) 0.2058 0.4536 ## Number of obs: 63, groups: farm.f, 51 ## ## Fixed effects: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.09533 0.19068 -0.500 0.6171 ## colsize.z 0.05087 0.11735 0.434 0.6646 ## cow 0.39370 0.22692 1.735 0.0827 . ## dung.z -0.14236 0.10862 -1.311 0.1900 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Correlation of Fixed Effects: ## (Intr) clsz.z cow ## colsize.z 0.129 ## cow -0.828 -0.075 ## dung.z 0.033 0.139 -0.091 confint.95 <- confint(mod.glmer); confint.95 ## 2.5 % 97.5 % ## .sig01 0.16809483 0.7385238 ## (Intercept) -0.48398346 0.2863200 ## colsize.z -0.18428769 0.2950063 ## cow -0.05360035 0.8588134 ## dung.z -0.36296714 0.0733620 # brm summary(mod.brm) ## Family: binomial ## Links: mu = logit ## Formula: fledge | trials(clutch) ~ colsize.z + cow + dung.z + (1 | farm.f) ## Data: dat (Number of observations: 63) ## Draws: 2 chains, each with iter = 2000; warmup = 500; thin = 1; ## total post-warmup draws = 3000 ## ## Group-Level Effects: ## ~farm.f (Number of levels: 51) ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## sd(Intercept) 0.55 0.16 0.26 0.86 1.00 910 1284 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## Intercept -0.10 0.21 -0.52 0.32 1.00 2863 2165 ## colsize.z 0.05 0.14 -0.21 0.34 1.00 2266 1794 ## cow 0.41 0.25 -0.06 0.90 1.00 3069 2117 ## dung.z -0.15 0.12 -0.38 0.09 1.00 3254 2241 ## ## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1). From the results we conclude that in farms without cows (when cow=0) and for average colony sizes (when colsize.z=0) and average number of dung heaps (when dung.z=0) the average nestling survival of Barn swallows is the inverse-logit function of the Intercept, thus, plogis(-0.1) = 0.47 with a 95% uncertainty interval of 0.37 - 0.58. We further see that colony size and number of dung heaps are less important than whether cows are present or not. Their estimated partial effect is small and their uncertainty interval includes only values close to zero. However, whether cows are present or not may be important for the survival of nestlings. The average nestling survival in farms with cows is plogis(-0.1+0.41) = 0.58. For getting the uncertainty interval of this survival estimate, we need to do the calculation for every simulation from the posterior distribution of both parameters. bsim <- posterior_samples(mod.brm) # survival of nestlings on farms with cows: survivalest <- plogis(bsim$b_Intercept + bsim$b_cow) quantile(survivalest, probs=c(0.025, 0.975)) # 95% uncertainty interval ## 2.5% 97.5% ## 0.5126716 0.6412675 In medical research, it is standard to report the fixed-effects coefficients from GLMM with binomial or Bernoulli error as odds ratios by taking the exponent (R function exp for \\(e^{()}\\)) of the coefficient on the logit-scale. For example, the coefficient for cow from mod.glmer, 0.39 (95% CI from -0.05 to -0.05), represents an odds ratio of exp( 0.39)=1.48 (95% CI from 0.95 to 0.95). This means that the odds for fledging (vs. not fledging) from a clutch from a farm with livestock present is about 1.5 times larger than the odds for fledging if no livestock is present (relative effect). 15.2 Summary "],["modelchecking.html", "16 Posterior predictive model checking 16.1 Introduction 16.2 Summary", " 16 Posterior predictive model checking THIS CHAPTER IS UNDER CONSTRUCTION!!! 16.1 Introduction 16.2 Summary xxx "],["model_comparison.html", "17 Model comparison and multimodel inference 17.1 Introduction 17.2 Summary", " 17 Model comparison and multimodel inference THIS CHAPTER IS UNDER CONSTRUCTION!!! 17.1 Introduction literature to refer to: Tredennick et al. (2021) 17.2 Summary xxx "],["stan.html", "18 MCMC using Stan 18.1 Background 18.2 Install rstan 18.3 Writing a Stan model 18.4 Run Stan from R Further reading", " 18 MCMC using Stan 18.1 Background Markov chain Monte Carlo (MCMC) simulation techniques were developed in the mid-1950s by physicists (Metropolis et al., 1953). Later, statisticians discovered MCMC (Hastings, 1970; Geman & Geman, 1984; Tanner & Wong, 1987; Gelfand et al., 1990; Gelfand & Smith, 1990). MCMC methods make it possible to obtain posterior distributions for parameters and latent variables (unobserved variables) of complex models. In parallel, personal computer capacities increased in the 1990s and user-friendly software such as the different programs based on the programming language BUGS (Spiegelhalter et al., 2003) came out. These developments boosted the use of Bayesian data analyses, particularly in genetics and ecology. 18.2 Install rstan In this book we use the program Stan to draw random samples from the joint posterior distribution of the model parameters given a model, the data, prior distributions, and initial values. To do so, it uses the no-U-turn sampler, which is a type of Hamiltonian Monte Carlo simulation (Hoffman and Gelman 2014; Betancourt 2013), and optimization-based point estimation. These algorithms are more efficient than the ones implemented in BUGS programs and they can handle larger data sets. Stan works particularly well for hierar- chical models (Betancourt and Girolami 2013). Stan runs on Windows, Mac, and Linux and can be used via the R interface rstan. Stan is automatically installed when the R package rstan is installed. For installing rstan, it is advised to follow closely the system-specific instructions. 18.3 Writing a Stan model The statistical model is written in the Stan language and saved in a text file. The Stan language is rather strict, forcing the user to write unambiguous models. Stan is very well documented and the Stan Documentation contains a comprehensive Language Manual, a Wiki documentation and various tutorials. We here provide a normal regression with one predictor variable as a worked example. The entire Stan model is as following (saved as linreg.stan) data { int<lower=0> n; vector[n] y; vector[n] x; } parameters { vector[2] beta; real<lower=0> sigma; } model { //priors beta ~ normal(0,5); sigma ~ cauchy(0,5); // likelihood y ~ normal(beta[1] + beta[2] * x, sigma); } A Stan model consists of different named blocks. These blocks are (from first to last): data, transformed data, parameters, trans- formed parameters, model, and generated quantities. The blocks must appear in this order. The model block is mandatory; all other blocks are optional. In the data block, the type, dimension, and name of every variable has to be declared. Optionally, the range of possible values can be specified. For example, vector[N] y; means that y is a vector (type real) of length N, and int<lower=0> N; means that N is an integer with nonnegative values (the bounds, here 0, are included). Note that the restriction to a possible range of values is not strictly necessary but this will help specifying the correct model and it will improve speed. We also see that each line needs to be closed by a column sign. In the parameters block, all model parameters have to be defined. The coefficients of the linear predictor constitute a vector of length 2, vector[2] beta;. Alternatively, real beta[2]; could be used. The sigma parameter is a one-number parameter that has to be positive, therefore real<lower=0> sigma;. The model block contains the model specification. Stan functions can handle vectors and we do not have to loop over all observations as typical for BUGS . Here, we use a Cauchy distribution as a prior distribution for sigma. This distribution can have negative values, but because we defined the lower limit of sigma to be 0 in the parameters block, the prior distribution actually used in the model is a truncated Cauchy distribution (truncated at zero). In Chapter 10.2 we explain how to choose prior distributions. Further characteristics of the Stan language that are good to know include: The variance parameter for the normal distribution is specified as the standard deviation (like in R but different from BUGS, where the precision is used). If no prior is specified, Stan uses a uniform prior over the range of possible values as specified in the parameter block. Variable names must not contain periods, for example, x.z would not be allowed, but x_z is allowed. To comment out a line, use double forward-slashes //. 18.4 Run Stan from R We fit the model to simulated data. Stan needs a vector containing the names of the data objects. In our case, x, y, and N are objects that exist in the R console. The function stan() starts Stan and returns an object containing MCMCs for every model parameter. We have to specify the name of the file that contains the model specification, the data, the number of chains, and the number of iterations per chain we would like to have. The first half of the iterations of each chain is declared as the warm-up. During the warm-up, Stan is not simulating a Markov chain, because in every step the algorithm is adapted. After the warm-up the algorithm is fixed and Stan simulates Markov chains. library(rstan) # Simulate fake data n <- 50 # sample size sigma <- 5 # standard deviation of the residuals b0 <- 2 # intercept b1 <- 0.7 # slope x <- runif(n, 10, 30) # random numbers of the covariate simresid <- rnorm(n, 0, sd=sigma) # residuals y <- b0 + b1*x + simresid # calculate y, i.e. the data # Bundle data into a list datax <- list(n=length(y), y=y, x=x) # Run STAN fit <- stan(file = "stanmodels/linreg.stan", data=datax, verbose = FALSE) ## ## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1). ## Chain 1: ## Chain 1: Gradient evaluation took 2.5e-05 seconds ## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds. ## Chain 1: Adjust your expectations accordingly! ## Chain 1: ## Chain 1: ## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup) ## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup) ## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup) ## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup) ## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup) ## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup) ## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling) ## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling) ## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling) ## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling) ## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling) ## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling) ## Chain 1: ## Chain 1: Elapsed Time: 0.055 seconds (Warm-up) ## Chain 1: 0.043 seconds (Sampling) ## Chain 1: 0.098 seconds (Total) ## Chain 1: ## ## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2). ## Chain 2: ## Chain 2: Gradient evaluation took 5e-06 seconds ## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds. ## Chain 2: Adjust your expectations accordingly! ## Chain 2: ## Chain 2: ## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup) ## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup) ## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup) ## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup) ## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup) ## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup) ## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling) ## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling) ## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling) ## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling) ## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling) ## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling) ## Chain 2: ## Chain 2: Elapsed Time: 0.049 seconds (Warm-up) ## Chain 2: 0.043 seconds (Sampling) ## Chain 2: 0.092 seconds (Total) ## Chain 2: ## ## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3). ## Chain 3: ## Chain 3: Gradient evaluation took 5e-06 seconds ## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds. ## Chain 3: Adjust your expectations accordingly! ## Chain 3: ## Chain 3: ## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup) ## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup) ## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup) ## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup) ## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup) ## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup) ## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling) ## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling) ## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling) ## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling) ## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling) ## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling) ## Chain 3: ## Chain 3: Elapsed Time: 0.049 seconds (Warm-up) ## Chain 3: 0.048 seconds (Sampling) ## Chain 3: 0.097 seconds (Total) ## Chain 3: ## ## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4). ## Chain 4: ## Chain 4: Gradient evaluation took 6e-06 seconds ## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds. ## Chain 4: Adjust your expectations accordingly! ## Chain 4: ## Chain 4: ## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup) ## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup) ## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup) ## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup) ## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup) ## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup) ## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling) ## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling) ## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling) ## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling) ## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling) ## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling) ## Chain 4: ## Chain 4: Elapsed Time: 0.051 seconds (Warm-up) ## Chain 4: 0.046 seconds (Sampling) ## Chain 4: 0.097 seconds (Total) ## Chain 4: Further reading Stan-Homepage: It contains the documentation for Stand a a lot of tutorials. "],["ridge_regression.html", "19 Ridge Regression 19.1 Introduction", " 19 Ridge Regression THIS CHAPTER IS UNDER CONSTRUCTION!!! We should provide an example in Stan. 19.1 Introduction # Settings library(R2OpenBUGS) bugslocation <- "C:/Program Files/OpenBUGS323/OpenBugs.exe" # location of OpenBUGS bugsworkingdir <- file.path(getwd(), "BUGS") # Bugs working directory #------------------------------------------------------------------------------- # Simulate fake data #------------------------------------------------------------------------------- library(MASS) n <- 50 # sample size b0 <- 1.2 b <- rnorm(5, 0, 2) Sigma <- matrix(c(10,3,3,2,1, 3,2,3,2,1, 3,3,5,3,2, 2,2,3,10,3, 1,1,2,3,15),5,5) Sigma x <- mvrnorm(n = n, rep(0, 5), Sigma) simresid <- rnorm(n, 0, sd=3) # residuals x.z <- x for(i in 1:ncol(x)) x.z[,i] <- (x[,i]-mean(x[,i]))/sd(x[,i]) y <- b0 + x.z%*%b + simresid # calculate y, i.e. the data #------------------------------------------------------------------------------- # Function to generate initial values #------------------------------------------------------------------------------- inits <- function() { list(b0=runif(1, -2, 2), b=runif(5, -2, 2), sigma=runif(1, 0.1, 2)) } #------------------------------------------------------------------------------- # Run OpenBUGS #------------------------------------------------------------------------------- parameters <- c("b0", "b", "sigma") lambda <- c(1, 2, 10, 25, 50, 100, 500, 1000, 10000) bs <- matrix(ncol=length(lambda), nrow=length(b)) bse <- matrix(ncol=length(lambda), nrow=length(b)) for(j in 1:length(lambda)){ datax <- list(y=as.numeric(y), x=x, n=n, mb=rep(0, 5), lambda=lambda[j]) fit <- bugs(datax, inits, parameters, model.file="ridge_regression.txt", n.thin=1, n.chains=2, n.burnin=5000, n.iter=10000, debug=FALSE, OpenBUGS.pgm = bugslocation, working.directory=bugsworkingdir) bs[,j] <- fit$mean$b bse[,j] <- fit$sd$b } range(bs) plot(1:length(lambda), seq(-2, 1, length=length(lambda)), type="n") colkey <- rainbow(length(b)) for(j in 1:nrow(bs)){ lines(1:length(lambda), bs[j,], col=colkey[j], lwd=2) lines(1:length(lambda), bs[j,]-2*bse[j,], col=colkey[j], lty=3) lines(1:length(lambda), bs[j,]+2*bse[j,], col=colkey[j], lty=3) } abline(h=0) round(fit$summary,2) #------------------------------------------------------------------------------- # Run WinBUGS #------------------------------------------------------------------------------- library(R2WinBUGS) bugsdir <- "C:/Users/fk/WinBUGS14" # mod <- bugs(datax, inits= inits, parameters, model.file="normlinreg.txt", n.chains=2, n.iter=1000, n.burnin=500, n.thin=1, debug=TRUE, bugs.directory=bugsdir, program="WinBUGS", working.directory=bugsworkingdir) #------------------------------------------------------------------------------- # Test convergence and make inference #------------------------------------------------------------------------------- library(blmeco) # Make Figure 12.2 par(mfrow=c(3,1)) historyplot(fit, "beta0") historyplot(fit, "beta1") historyplot(fit, "sigmaRes") # Parameter estimates print(fit$summary, 3) # Make predictions for covariate values between 10 and 30 newdat <- data.frame(x=seq(10, 30, length=100)) Xmat <- model.matrix(~x, data=newdat) predmat <- matrix(ncol=fit$n.sim, nrow=nrow(newdat)) for(i in 1:fit$n.sim) predmat[,i] <- Xmat%*%c(fit$sims.list$beta0[i], fit$sims.list$beta1[i]) newdat$lower.bugs <- apply(predmat, 1, quantile, prob=0.025) newdat$upper.bugs <- apply(predmat, 1, quantile, prob=0.975) plot(y~x, pch=16, las=1, cex.lab=1.4, cex.axis=1.2, type="n", main="") polygon(c(newdat$x, rev(newdat$x)), c(newdat$lower.bugs, rev(newdat$upper.bugs)), col=grey(0.7), border=NA) abline(c(fit$mean$beta0, fit$mean$beta1), lwd=2) box() points(x,y) "],["SEM.html", "20 Structural equation models 20.1 Introduction", " 20 Structural equation models THIS CHAPTER IS UNDER CONSTRUCTION!!! We should provide an example in Stan. 20.1 Introduction ------------------------------------------------------------------------------------------------------ # General settings #------------------------------------------------------------------------------------------------------ library(MASS) library(rjags) library(MCMCpack) #------------------------------------------------------------------------------------------------------ # Simulation #------------------------------------------------------------------------------------------------------ n <- 100 heffM <- 0.6 # effect of H on M heffCS <- 0.0 # effect of H on Clutch size meffCS <- 0.6 # effect of M on Clutch size SigmaM <- matrix(c(0.1,0.04,0.04,0.1),2,2) meffm1 <- 0.6 meffm2 <- 0.7 SigmaH <- matrix(c(0.1,0.04,0.04,0.1),2,2) meffh1 <- 0.6 meffh2 <- -0.7 # Latente Variablen H <- rnorm(n, 0, 1) M <- rnorm(n, heffM * H, 0.1) # Clutch size CS <- rnorm(n, heffCS * H + meffCS * M, 0.1) # Indicators eM <- cbind(meffm1 * M, meffm2 * M) datM <- matrix(NA, ncol = 2, nrow = n) eH <- cbind(meffh1 * H, meffh2 * H) datH <- matrix(NA, ncol = 2, nrow = n) for(i in 1:n) { datM[i,] <- mvrnorm(1, eM[i,], SigmaM) datH[i,] <- mvrnorm(1, eH[i,], SigmaH) } #------------------------------------------------------------------------------ # JAGS Model #------------------------------------------------------------------------------ dat <- list(datM = datM, datH = datH, n = n, CS = CS, #H = H, M = M, S3 = matrix(c(1,0,0,1),nrow=2)/1) # Function to create initial values inits <- function() { list( meffh = runif(2, 0, 0.1), meffm = runif(2, 0, 0.1), heffM = runif(1, 0, 0.1), heffCS = runif(1, 0, 0.1), meffCS = runif(1, 0, 0.1), tauCS = runif(1, 0.1, 0.3), tauMH = runif(1, 0.1, 0.3), tauH = rwish(3,matrix(c(.02,0,0,.04),nrow=2)), tauM = rwish(3,matrix(c(.02,0,0,.04),nrow=2)) # M = as.numeric(rep(0, n)) ) } t.n.thin <- 50 t.n.chains <- 2 t.n.burnin <- 20000 t.n.iter <- 50000 # Run JAGS jagres <- jags.model('JAGS/BUGSmod1.R',data = dat, n.chains = t.n.chains, inits = inits, n.adapt = t.n.burnin) params <- c("meffh", "meffm", "heffM", "heffCS", "meffCS") mod <- coda.samples(jagres, params, n.iter=t.n.iter, thin=t.n.thin) res <- round(data.frame(summary(mod)$quantiles[, c(3, 1, 5)]), 3) res$TRUEVALUE <- c(heffCS, heffM, meffCS, meffh1, meffh2, meffm1, meffm2) res # Traceplots post <- data.frame(rbind(mod[[1]], mod[[2]])) names(post) <- dimnames(mod[[1]])[[2]] par(mfrow = c(3,3)) param <- c("meffh[1]", "meffh[2]", "meffm[1]", "meffm[2]", "heffM", "heffCS", "meffCS") traceplot(mod[, match(param, names(post))]) "],["spatial_glmm.html", "21 Modeling spatial data using GLMM 21.1 Introduction 21.2 Summary", " 21 Modeling spatial data using GLMM THIS CHAPTER IS UNDER CONSTRUCTION!!! 21.1 Introduction 21.2 Summary xxx "],["PART-III.html", "22 Introduction to PART III 22.1 Model notations", " 22 Introduction to PART III This part is a collection of more complicated ecological models to analyse data that may not be analysed with the traditional linear models that we covered in PART I of this book. 22.1 Model notations It is unavoidable that different authors use different notations for the same thing, or that the same notation is used for different things. We try to use, whenever possible, notations that is commonly used at the International Statistical Ecology Congress ISEC. Resulting from an earlier ISEC, Thomson et al. (2009) give guidelines on what letter should be used for which parameter in order to achieve a standard notation at least among people working with classical mark-recapture models. However, the alphabet has fewer letters compared to the number of ecological parameters. Therefore, the same letter cannot stand for the same parameter across all papers, books and chapters. Here, we try to use the same letter for the same parameter within the same chapter. "],["zeroinflated-poisson-lmm.html", "23 Zero-inflated Poisson Mixed Model 23.1 Introduction 23.2 Example data 23.3 Model 23.4 Further packages and readings$", " 23 Zero-inflated Poisson Mixed Model 23.1 Introduction Usually we describe the outcome variable with a single distribution, such as the normal distribution in the case of linear (mixed) models, and Poisson or binomial distributions in the case of generalized linear (mixed) models. In life sciences, however, quite often the data are actually generated by more than one process. In such cases the distribution of the data could be the result of two or more different distributions. If we do not account for these different processes our inferences are likely to be biased. In this chapter, we introduce a mixture model that explicitly include two processes that generated the data. The zero-inflated Poisson model is a mixture of a binomial and a Poisson distribution. We belief that two (or more)-level models are very useful tools in life sciences because they can help uncover the different processes that generate the data we observe. 23.2 Example data We used the blackstork data from the blmeco-package. They contain the breeding success of Black-stork in Latvia. The data was collected and kindly provided by Maris Stradz. The data contains the number of nestlings of more then 300 Black-stork nests in different years. Counting animals or plants is a typical example of data that contain a lot of zero counts. For example, the number of nestlings produced by a breeding pair is often zero because the whole nest was depredated or because a catastrophic event occurred such as a flood. However, when the nest succeeds, the number of nestlings varies among the successful nests depending on how many eggs the female has laid, how much food the parents could bring to the nest, or other factors that affect the survival of a nestling in an intact nest. Thus the factors that determine how many zero counts there are in the data differ from the factors that determine how many nestlings there are, if a nest survives. Count data that are produced by two different processesone produces the zero counts and the other the variance in the count for the ones that were not zero in the first processare called zero-inflated data. Histograms of zero-inflated data look bimodal, with one peak at zero (Figure 23.1). Figure 23.1: Histogram of the number of nestlings counted in black stork nests Ciconia nigra in Latvia (n = 1130 observations of 279 nests). 23.3 Model The Poisson distribution does not fit well to such data, because the data contain more zero counts than expected under the Poisson distribution. Mullahy (1986) and Lambert (1992) formulated two different types of models that combine the two processes in one model and therefore account for the zero excess in the data and allow the analysis of the two processes separately. The hurdle model (Mullahy, 1986) combines a left-truncated count data model (Poisson or negative binomial distribution that only describes the distribution of data larger than zero) with a zero-hurdle model that describes the distribution of the data that are either zero or nonzero. In other words, the hurdle model divides the data into two data subsets, the zero counts and the nonzero counts, and fits two separate models to each subset of the data. To account for this division of the data, the two models assume left truncation (all measurements below 1 are missing in the data) and right censoring (all measurements larger than 1 have the value 1), respectively, in their error distributions. A hurdle model can be fitted in R using the function hurdle from the package pscl (Jackman, 2008). See the tutorial by Zeileis et al. (2008) for an introduction. In contrast to the hurdle model, the zero-inflated models (Mullahy, 1986; Lambert, 1992) combine a Bernoulli model (zero vs. nonzero) with a conditional Poisson model; conditional on the Bernoulli process being nonzero. Thus this model allows for a mixture of zero counts: some zero counts are zero because the outcome of the Bernoulli process was zero (these zero counts are sometimes called structural zero values), and others are zero because their outcome from the Poisson process was zero. The function `zeroinfl from the package pscl fits zero-inflated models (Zeileis et al., 2008). The zero-inflated model may seem to reflect the true process that has generated the data closer than the hurdle model. However, sometimes the fit of zero-inflated models is impeded because of high correlation of the model parameters between the zero model and the count model. In such cases, a hurdle model may cause less troubles. Both functions (hurdle and zeroinfl) from the package pscl do not allow the inclusion of random factors. The functions MCMCglmm from the package MCMCglmm (Hadfield, 2010) and glmmadmb from the package glmmADMB (http://glmmadmb.r-forge.r-project.org/) provide the possibility to account for zero-inflation with a GLMM. However, these functions are not very flexible in the types of zero-inflated models they can fit; for example, glmmadmb only includes a constant proportion of zero values. A zero-inflation model using BUGS is described in Ke ry and Schaub (2012). Here we use Stan to fit a zero- inflated model. Once we understand the basic model code, it is easy to add predictors and/or random effects to both the zero and the count model. The example data contain numbers of nestlings in black stork Ciconia nigra nests in Latvia collected by Maris Stradz and collaborators at 279 nests be- tween 1979 and 2010. Black storks build solid and large aeries on branches of large trees. The same aerie is used for up to 17 years until it collapses. The black stork population in Latvia has drastically declined over the last decades. Here, we use the nestling data as presented in Figure 14-2 to describe whether the number of black stork nestlings produced in Latvia decreased over time. We use a zero-inflated Poisson model to separately estimate temporal trends for nest survival and the number of nestlings in successful nests. Since the same nests have been measured repeatedly over 1 to 17 years, we add nest ID as a random factor to both models, the Bernoulli and the Poisson model. After the first model fit, we saw that the between-nest variance in the number of nest- lings for the successful nests was close to zero. Therefore, we decide to delete the random effect from the Poisson model. Here is our final model: zit is a latent (unobserved) variable that takes the values 0 or 1 for each nest i during year t. It indicates a structural zero, that is, if zit 14 1 the number of nestlings yit always is zero, because the expected value in the Poisson model lit(1 zit) becomes zero. If zit 14 0, the expected value in the Poisson model becomes lit. To fit this model in Stan, we first write the Stan model code and save it in a separated text-file with name zeroinfl.stan. Here is a handy package: https://cran.r-project.org/web/packages/GLMMadaptive/vignettes/ZeroInflated_and_TwoPart_Models.html 23.4 Further packages and readings$ If the model does not contain any random factor, the R functions from the package pscl can be used to fit zeroinflated binomial or Poisson models (Zeileis, Kleiber, and Jackman 2008). Zero-inflation typically occurs in count data. However, it can also occur in continuous measurements. For example, the amount of rain per day measured in mm is very often zero, and, when it is not zero, it is a number following a specific (possibly normal) continuous distribution. Such data may be analyzed using tobit models (Tobin, 1958). Several R packages provide tobit models, such as censReg (Henningsen, 2013), AER (Kleiber & Zeileis, 2008), and MCMCpack (Martin et al., 2011). "],["dailynestsurv.html", "24 Daily nest survival 24.1 Background 24.2 Models for estimating daily nest survival 24.3 Known fate model 24.4 The Stan model 24.5 Prepare data and run Stan 24.6 Check convergence 24.7 Look at results 24.8 Known fate model for irregular nest controls Further reading", " 24 Daily nest survival 24.1 Background Analyses of nest survival is important for understanding the mechanisms of population dynamics. The life-span of a nest could be used as a measure of nest survival. However, this measure very often is biased towards nests that survived longer because these nests are detected by the ornithologists with higher probability (Mayfield 1975). In order not to overestimate nest survival, daily nest survival conditional on survival to the previous day can be estimated. 24.2 Models for estimating daily nest survival What model is best used depends on the type of data available. Data may look: Regular (e.g. daily) nest controls, all nests monitored from their first egg onward Regular nest controls, nests found during the course of the study at different stages and nestling ages Irregular nest controls, all nests monitored from their first egg onward Irregular nest controls, nests found during the course of the study at different stages and nestling ages Table 24.1: Models useful for estimating daily nest survival. Data numbers correspond to the descriptions above. Model Data Software, R-code Binomial or Bernoulli model 1, (3) glm, glmer, Cox proportional hazard model 1,2,3,4 brm, soon: stan_cox Known fate model 1, 2 Stan code below Known fate model 3, 4 Stan code below Logistic exposure model 1,2,3,4 glm, glmerusing a link function that depends on exposure time Shaffer (2004) explains how to adapt the link function in a Bernoulli model to account for having found the nests at different nest ages (exposure time). Ben Bolker explains how to implement the logistic exposure model in R here. 24.3 Known fate model A natural model that allows estimating daily nest survival is the known-fate survival model. It is a Markov model that models the state of a nest \\(i\\) at day \\(t\\) (whether it is alive, \\(y_{it}=1\\) or not \\(y_{it}=0\\)) as a Bernoulli variable dependent on the state of the nest the day before. \\[ y_{it} \\sim Bernoulli(y_{it-1}S_{it})\\] The daily nest survival \\(S_{it}\\) can be linearly related to predictor variables that are measured on the nest or on the day level. \\[logit(S_{it}) = \\textbf{X} \\beta\\] It is also possible to add random effects if needed. 24.4 The Stan model The following Stan model code is saved as daily_nest_survival.stan. data { int<lower=0> Nnests; // number of nests int<lower=0> last[Nnests]; // day of last observation (alive or dead) int<lower=0> first[Nnests]; // day of first observation (alive or dead) int<lower=0> maxage; // maximum of last int<lower=0> y[Nnests, maxage]; // indicator of alive nests real cover[Nnests]; // a covariate of the nest real age[maxage]; // a covariate of the date } parameters { vector[3] b; // coef of linear pred for S } model { real S[Nnests, maxage-1]; // survival probability for(i in 1:Nnests){ for(t in first[i]:(last[i]-1)){ S[i,t] = inv_logit(b[1] + b[2]*cover[i] + b[3]*age[t]); } } // priors b[1]~normal(0,5); b[2]~normal(0,3); b[3]~normal(0,3); // likelihood for (i in 1:Nnests) { for(t in (first[i]+1):last[i]){ y[i,t]~bernoulli(y[i,t-1]*S[i,t-1]); } } } 24.5 Prepare data and run Stan Data is from (Grendelmeier2018?). load("RData/nest_surv_data.rda") str(datax) ## List of 7 ## $ y : int [1:156, 1:31] 1 NA 1 NA 1 NA NA 1 1 1 ... ## $ Nnests: int 156 ## $ last : int [1:156] 26 30 31 27 31 30 31 31 31 31 ... ## $ first : int [1:156] 1 14 1 3 1 24 18 1 1 1 ... ## $ cover : num [1:156] -0.943 -0.215 0.149 0.149 -0.215 ... ## $ age : num [1:31] -1.65 -1.54 -1.43 -1.32 -1.21 ... ## $ maxage: int 31 datax$y[is.na(datax$y)] <- 0 # Stan does not allow for NA's in the outcome # Run STAN library(rstan) mod <- stan(file = "stanmodels/daily_nest_survival.stan", data=datax, chains=5, iter=2500, control=list(adapt_delta=0.9), verbose = FALSE) 24.6 Check convergence We love exploring the performance of the Markov chains by using the function launch_shinystan from the package shinystan. 24.7 Look at results It looks like cover does not affect daily nest survival, but daily nest survival decreases with the age of the nestlings. #launch_shinystan(mod) print(mod) ## Inference for Stan model: anon_model. ## 5 chains, each with iter=2500; warmup=1250; thin=1; ## post-warmup draws per chain=1250, total post-warmup draws=6250. ## ## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat ## b[1] 4.04 0.00 0.15 3.76 3.94 4.04 4.14 4.35 3828 1 ## b[2] 0.00 0.00 0.13 -0.25 -0.09 -0.01 0.08 0.25 4524 1 ## b[3] -0.70 0.00 0.16 -1.02 -0.81 -0.69 -0.59 -0.39 3956 1 ## lp__ -298.98 0.03 1.30 -302.39 -299.52 -298.65 -298.05 -297.53 2659 1 ## ## Samples were drawn using NUTS(diag_e) at Thu Jan 19 22:33:33 2023. ## For each parameter, n_eff is a crude measure of effective sample size, ## and Rhat is the potential scale reduction factor on split chains (at ## convergence, Rhat=1). # effect plot bsim <- as.data.frame(mod) nsim <- nrow(bsim) newdat <- data.frame(age=seq(1, datax$maxage, length=100)) newdat$age.z <- (newdat$age-mean(1:datax$maxage))/sd((1:datax$maxage)) Xmat <- model.matrix(~age.z, data=newdat) fitmat <- matrix(ncol=nsim, nrow=nrow(newdat)) for(i in 1:nsim) fitmat[,i] <- plogis(Xmat%*%as.numeric(bsim[i,c(1,3)])) newdat$fit <- apply(fitmat, 1, median) newdat$lwr <- apply(fitmat, 1, quantile, prob=0.025) newdat$upr <- apply(fitmat, 1, quantile, prob=0.975) plot(newdat$age, newdat$fit, ylim=c(0.8,1), type="l", las=1, ylab="Daily nest survival", xlab="Age [d]") lines(newdat$age, newdat$lwr, lty=3) lines(newdat$age, newdat$upr, lty=3) Figure 24.1: Estimated daily nest survival probability in relation to nest age. Dotted lines are 95% uncertainty intervals of the regression line. 24.8 Known fate model for irregular nest controls When nest are controlled only irregularly, it may happen that a nest is found predated or dead after a longer break in controlling. In such cases, we know that the nest was predated or it died due to other causes some when between the last control when the nest was still alive and when it was found dead. In such cases, we need to tell the model that the nest could have died any time during the interval when we were not controlling. To do so, we create a variable that indicates the time (e.g. day since first egg) when the nest was last seen alive (lastlive). A second variable indicates the time of the last check which is either the equal to lastlive when the nest survived until the last check, or it is larger than lastlive when the nest failure has been recorded. A last variable, gap, measures the time interval in which the nest failure occurred. A gap of zero means that the nest was still alive at the last control, a gapof 1 means that the nest failure occurred during the first day after lastlive, a gap of 2 means that the nest failure either occurred at the first or second day after lastlive. # time when nest was last observed alive lastlive <- apply(datax$y, 1, function(x) max(c(1:length(x))[x==1])) # time when nest was last checked (alive or dead) lastcheck <- lastlive+1 # here, we turn the above data into a format that can be used for # irregular nest controls. WOULD BE NICE TO HAVE A REAL DATA EXAMPLE! # when nest was observed alive at the last check, then lastcheck equals lastlive lastcheck[lastlive==datax$last] <- datax$last[lastlive==datax$last] datax1 <- list(Nnests=datax$Nnests, lastlive = lastlive, lastcheck= lastcheck, first=datax$first, cover=datax$cover, age=datax$age, maxage=datax$maxage) # time between last seen alive and first seen dead (= lastcheck) datax1$gap <- datax1$lastcheck-datax1$lastlive In the Stan model code, we specify the likelihood for each gap separately. data { int<lower=0> Nnests; // number of nests int<lower=0> lastlive[Nnests]; // day of last observation (alive) int<lower=0> lastcheck[Nnests]; // day of observed death or, if alive, last day of study int<lower=0> first[Nnests]; // day of first observation (alive or dead) int<lower=0> maxage; // maximum of last real cover[Nnests]; // a covariate of the nest real age[maxage]; // a covariate of the date int<lower=0> gap[Nnests]; // obsdead - lastlive } parameters { vector[3] b; // coef of linear pred for S } model { real S[Nnests, maxage-1]; // survival probability for(i in 1:Nnests){ for(t in first[i]:(lastcheck[i]-1)){ S[i,t] = inv_logit(b[1] + b[2]*cover[i] + b[3]*age[t]); } } // priors b[1]~normal(0,1.5); b[2]~normal(0,3); b[3]~normal(0,3); // likelihood for (i in 1:Nnests) { for(t in (first[i]+1):lastlive[i]){ 1~bernoulli(S[i,t-1]); } if(gap[i]==1){ target += log(1-S[i,lastlive[i]]); // } if(gap[i]==2){ target += log((1-S[i,lastlive[i]]) + S[i,lastlive[i]]*(1-S[i,lastlive[i]+1])); // } if(gap[i]==3){ target += log((1-S[i,lastlive[i]]) + S[i,lastlive[i]]*(1-S[i,lastlive[i]+1]) + prod(S[i,lastlive[i]:(lastlive[i]+1)])*(1-S[i,lastlive[i]+2])); // } if(gap[i]==4){ target += log((1-S[i,lastlive[i]]) + S[i,lastlive[i]]*(1-S[i,lastlive[i]+1]) + prod(S[i,lastlive[i]:(lastlive[i]+1)])*(1-S[i,lastlive[i]+2]) + prod(S[i,lastlive[i]:(lastlive[i]+2)])*(1-S[i,lastlive[i]+3])); // } } } # Run STAN mod1 <- stan(file = "stanmodels/daily_nest_survival_irreg.stan", data=datax1, chains=5, iter=2500, control=list(adapt_delta=0.9), verbose = FALSE) Further reading Helpful links: https://deepai.org/publication/bayesian-survival-analysis-using-the-rstanarm-r-package (Brilleman et al. 2020) https://www.hammerlab.org/2017/06/26/introducing-survivalstan/ "],["cjs_with_mix.html", "25 Capture-mark recapture model with a mixture structure to account for missing sex-variable for parts of the individuals 25.1 Introduction 25.2 Data description 25.3 Model description 25.4 The Stan code 25.5 Call Stan from R, check convergence and look at results", " 25 Capture-mark recapture model with a mixture structure to account for missing sex-variable for parts of the individuals 25.1 Introduction In some species the identification of the sex is not possible for all individuals without sampling DNA. For example, morphological dimorphism is absent or so weak that parts of the individuals cannot be assigned to one of the sexes. Particularly in ornithological long-term capture recapture data sets that typically are obtained by voluntary bird ringers who do normaly not have the possibilities to analyse DNA, often the sex identification is missing in parts of the individuals. For estimating survival, it would nevertheless be valuable to include data of all individuals, use the information on sex-specific effects on survival wherever possible but account for the fact that of parts of the individuals the sex is not known. We here explain how a Cormack-Jolly-Seber model can be integrated with a mixture model in oder to allow for a combined analyses of individuals with and without sex identified. An introduction to the Cormack-Jolly-Seber model we gave in Chapter 14.5 of the book Korner-Nievergelt et al. (2015). We here expand this model by a mixture structure that allows including individuals with a missing categorical predictor variable, such as sex. 25.2 Data description ## simulate data # true parameter values theta <- 0.6 # proportion of males nocc <- 15 # number of years in the data set b0 <- matrix(NA, ncol=nocc-1, nrow=2) b0[1,] <- rbeta((nocc-1), 3, 4) # capture probability of males b0[2,] <- rbeta((nocc-1), 2, 4) # capture probability of females a0 <- matrix(NA, ncol=2, nrow=2) a1 <- matrix(NA, ncol=2, nrow=2) a0[1,1]<- qlogis(0.7) # average annual survival for adult males a0[1,2]<- qlogis(0.3) # average annual survival for juveniles a0[2,1] <- qlogis(0.55) # average annual survival for adult females a0[2,2] <- a0[1,2] a1[1,1] <- 0 a1[1,2] <- -0.5 a1[2,1] <- -0.8 a1[2,2] <- a1[1,2] nindi <- 1000 # number of individuals with identified sex nindni <- 1500 # number of individuals with non-identified sex nind <- nindi + nindni # total number of individuals y <- matrix(ncol=nocc, nrow=nind) z <- matrix(ncol=nocc, nrow=nind) first <- sample(1:(nocc-1), nind, replace=TRUE) sex <- sample(c(1,2), nind, prob=c(theta, 1-theta), replace=TRUE) juvfirst <- sample(c(0,1), nind, prob=c(0.5, 0.5), replace=TRUE) juv <- matrix(0, nrow=nind, ncol=nocc) for(i in 1:nind) juv[i,first[i]] <- juv[i] x <- runif(nocc-1, -2, 2) # a time dependent covariate covariate p <- b0 # recapture probability phi <- array(NA, dim=c(2, 2, nocc-1)) # for ad males phi[1,1,] <- plogis(a0[1,1]+a1[1,1]*x) # for ad females phi[2,1,] <- plogis(a0[2,1]+a1[2,1]*x) # for juvs phi[1,2,] <- phi[2,2,] <- plogis(a0[2,2]+a1[2,2]*x) for(i in 1:nind){ z[i,first[i]] <- 1 y[i, first[i]] <- 1 for(t in (first[i]+1):nocc){ z[i, t] <- rbinom(1, size=1, prob=z[i,t-1]*phi[sex[i],juv[i,t-1]+1, t-1]) y[i, t] <- rbinom(1, size=1, prob=z[i,t]*p[sex[i],t-1]) } } y[is.na(y)] <- 0 The mark-recapture data set consists of capture histories of 2500 individuals over 15 time periods. For each time period \\(t\\) and individual \\(i\\) the capture history matrix \\(y\\) contains \\(y_{it}=1\\) if the individual \\(i\\) is captured during time period \\(t\\), or \\(y_{it}=0\\) if the individual \\(i\\) is not captured during time period \\(t\\). The marking time period varies between individuals from 1 to 14. At the marking time period, the age of the individuals was classified either as juvenile or as adult. Juveniles turn into adults after one time period, thus age is known for all individuals during all time periods after marking. For 1000 individuals of the 2500 individuals, the sex is identified, whereas for 1500 individuals, the sex is unknown. The example data contain one covariate \\(x\\) that takes on one value for each time period. # bundle the data for Stan i <- 1:nindi ni <- (nindi+1):nind datax <- list(yi=y[i,], nindi=nindi, sex=sex[i], nocc=nocc, yni=y[ni,], nindni=nindni, firsti=first[i], firstni=first[ni], juvi=juv[i,]+1, juvni=juv[ni,]+1, year=1:nocc, x=x) 25.3 Model description The observations \\(y_{it}\\), an indicator of whether individual i was recaptured during time period \\(t\\) is modelled conditional on the latent true state of the individual birds \\(z_{it}\\) (0 = dead or permanently emigrated, 1 = alive and at the study site) as a Bernoulli variable. The probability \\(P(y_{it} = 1)\\) is the product of the probability that an alive individual is recaptured, \\(p_{it}\\), and the state of the bird \\(z_{it}\\) (alive = 1, dead = 0). Thus, a dead bird cannot be recaptured, whereas for a bird alive during time period \\(t\\), the recapture probability equals \\(p_{it}\\): \\[y_{it} \\sim Bernoulli(z_{it}p_{it})\\] The latent state variable \\(z_{it}\\) is a Markovian variable with the state at time \\(t\\) being dependent on the state at time \\(t-1\\) and the apparent survival probability \\[\\phi_{it}\\]: \\[z_{it} \\sim Bernoulli(z_{it-1}\\phi_{it})\\] We use the term apparent survival in order to indicate that the parameter \\(\\phi\\) is a product of site fidelity and survival. Thus, individuals that permanently emigrated from the study area cannot be distinguished from dead individuals. In both models, the parameters \\(\\phi\\) and \\(p\\) were modelled as sex-specific. However, for parts of the individuals, sex could not be identified, i.e. sex was missing. Ignoring these missing values would most likely lead to a bias because they were not missing at random. The probability that sex can be identified is increasing with age and most likely differs between sexes. Therefore, we included a mixture model for the sex: \\[Sex_i \\sim Categorical(q_i)\\] where \\(q_i\\) is a vector of length 2, containing the probability of being a male and a female, respectively. In this way, the sex of the non-identified individuals was assumed to be male or female with probability \\(q[1]\\) and \\(q[2]=1-q[1]\\), respectively. This model corresponds to the finite mixture model introduced by Pledger, Pollock, and Norris (2003) in order to account for unknown classes of birds (heterogeneity). However, in our case, for parts of the individuals the class (sex) was known. In the example model, we constrain apparent survival to be linearly dependent on a covariate x with different slopes for males, females and juveniles using the logit link function. \\[logit(\\phi_{it}) = a0_{sex-age-class[it]} + a1_{sex-age-class[it]}x_i\\] Annual recapture probability was modelled for each year and age and sex class independently: \\[p_{it} = b0_{t,sex-age-class[it]}\\] Uniform prior distributions were used for all parameters with a parameter space limited to values between 0 and 1 (probabilities) and a normal distribution with a mean of 0 and a standard deviation of 1.5 for the intercept \\(a0\\), and a standard deviation of 5 was used for \\(a1\\). 25.4 The Stan code The trick for coding the CMR-mixture model in Stan is to formulate the model 3 times: 1. For the individuals with identified sex 2. For the males that were not identified 3. For the females that were not identified Then for the non-identified individuals a mixture model is formulated that assigns a probability of being a female or a male to each individual. data { int<lower=2> nocc; // number of capture events int<lower=0> nindi; // number of individuals with identified sex int<lower=0> nindni; // number of individuals with non-identified sex int<lower=0,upper=2> yi[nindi,nocc]; // CH[i,k]: individual i captured at k int<lower=0,upper=nocc-1> firsti[nindi]; // year of first capture int<lower=0,upper=2> yni[nindni,nocc]; // CH[i,k]: individual i captured at k int<lower=0,upper=nocc-1> firstni[nindni]; // year of first capture int<lower=1, upper=2> sex[nindi]; int<lower=1, upper=2> juvi[nindi, nocc]; int<lower=1, upper=2> juvni[nindni, nocc]; int<lower=1> year[nocc]; real x[nocc-1]; // a covariate } transformed data { int<lower=0,upper=nocc+1> lasti[nindi]; // last[i]: ind i last capture int<lower=0,upper=nocc+1> lastni[nindni]; // last[i]: ind i last capture lasti = rep_array(0,nindi); lastni = rep_array(0,nindni); for (i in 1:nindi) { for (k in firsti[i]:nocc) { if (yi[i,k] == 1) { if (k > lasti[i]) lasti[i] = k; } } } for (ii in 1:nindni) { for (kk in firstni[ii]:nocc) { if (yni[ii,kk] == 1) { if (kk > lastni[ii]) lastni[ii] = kk; } } } } parameters { real<lower=0, upper=1> theta[nindni]; // probability of being male for non-identified individuals real<lower=0, upper=1> b0[2,nocc-1]; // intercept of p real a0[2,2]; // intercept for phi real a1[2,2]; // coefficient for phi } transformed parameters { real<lower=0,upper=1>p_male[nindni,nocc]; // capture probability real<lower=0,upper=1>p_female[nindni,nocc]; // capture probability real<lower=0,upper=1>p[nindi,nocc]; // capture probability real<lower=0,upper=1>phi_male[nindni,nocc-1]; // survival probability real<lower=0,upper=1>chi_male[nindni,nocc+1]; // probability that an individual // is never recaptured after its // last capture real<lower=0,upper=1>phi_female[nindni,nocc-1]; // survival probability real<lower=0,upper=1>chi_female[nindni,nocc+1]; // probability that an individual // is never recaptured after its // last capture real<lower=0,upper=1>phi[nindi,nocc-1]; // survival probability real<lower=0,upper=1>chi[nindi,nocc+1]; // probability that an individual // is never recaptured after its // last capture { int k; int kk; for(ii in 1:nindi){ if (firsti[ii]>1) { for (z in 1:(firsti[ii]-1)){ phi[ii,z] = 1; } } for(tt in firsti[ii]:(nocc-1)) { // linear predictor for phi: phi[ii,tt] = inv_logit(a0[sex[ii], juvi[ii,tt]] + a1[sex[ii], juvi[ii,tt]]*x[tt]); } } for(ii in 1:nindni){ if (firstni[ii]>1) { for (z in 1:(firstni[ii]-1)){ phi_female[ii,z] = 1; phi_male[ii,z] = 1; } } for(tt in firstni[ii]:(nocc-1)) { // linear predictor for phi: phi_male[ii,tt] = inv_logit(a0[1, juvni[ii,tt]] + a1[1, juvni[ii,tt]]*x[tt]); phi_female[ii,tt] = inv_logit(a0[2, juvni[ii,tt]]+ a1[2, juvni[ii,tt]]*x[tt]); } } for(i in 1:nindi) { // linear predictor for p for identified individuals for(w in 1:firsti[i]){ p[i,w] = 1; } for(kkk in (firsti[i]+1):nocc) p[i,kkk] = b0[sex[i],year[kkk-1]]; chi[i,nocc+1] = 1.0; k = nocc; while (k > firsti[i]) { chi[i,k] = (1 - phi[i,k-1]) + phi[i,k-1] * (1 - p[i,k]) * chi[i,k+1]; k = k - 1; } if (firsti[i]>1) { for (u in 1:(firsti[i]-1)){ chi[i,u] = 0; } } chi[i,firsti[i]] = (1 - p[i,firsti[i]]) * chi[i,firsti[i]+1]; }// close definition of transformed parameters for identified individuals for(i in 1:nindni) { // linear predictor for p for non-identified individuals for(w in 1:firstni[i]){ p_male[i,w] = 1; p_female[i,w] = 1; } for(kkkk in (firstni[i]+1):nocc){ p_male[i,kkkk] = b0[1,year[kkkk-1]]; p_female[i,kkkk] = b0[2,year[kkkk-1]]; } chi_male[i,nocc+1] = 1.0; chi_female[i,nocc+1] = 1.0; k = nocc; while (k > firstni[i]) { chi_male[i,k] = (1 - phi_male[i,k-1]) + phi_male[i,k-1] * (1 - p_male[i,k]) * chi_male[i,k+1]; chi_female[i,k] = (1 - phi_female[i,k-1]) + phi_female[i,k-1] * (1 - p_female[i,k]) * chi_female[i,k+1]; k = k - 1; } if (firstni[i]>1) { for (u in 1:(firstni[i]-1)){ chi_male[i,u] = 0; chi_female[i,u] = 0; } } chi_male[i,firstni[i]] = (1 - p_male[i,firstni[i]]) * chi_male[i,firstni[i]+1]; chi_female[i,firstni[i]] = (1 - p_female[i,firstni[i]]) * chi_female[i,firstni[i]+1]; } // close definition of transformed parameters for non-identified individuals } // close block of transformed parameters exclusive parameter declarations } // close transformed parameters model { // priors theta ~ beta(1, 1); for (g in 1:(nocc-1)){ b0[1,g]~beta(1,1); b0[2,g]~beta(1,1); } a0[1,1]~normal(0,1.5); a0[1,2]~normal(0,1.5); a1[1,1]~normal(0,3); a1[1,2]~normal(0,3); a0[2,1]~normal(0,1.5); a0[2,2]~normal(a0[1,2],0.01); // for juveniles, we assume that the effect of the covariate is independet of sex a1[2,1]~normal(0,3); a1[2,2]~normal(a1[1,2],0.01); // likelihood for identified individuals for (i in 1:nindi) { if (lasti[i]>0) { for (k in firsti[i]:lasti[i]) { if(k>1) target+= (log(phi[i, k-1])); if (yi[i,k] == 1) target+=(log(p[i,k])); else target+=(log1m(p[i,k])); } } target+=(log(chi[i,lasti[i]+1])); } // likelihood for non-identified individuals for (i in 1:nindni) { real log_like_male = 0; real log_like_female = 0; if (lastni[i]>0) { for (k in firstni[i]:lastni[i]) { if(k>1){ log_like_male += (log(phi_male[i, k-1])); log_like_female += (log(phi_female[i, k-1])); } if (yni[i,k] == 1){ log_like_male+=(log(p_male[i,k])); log_like_female+=(log(p_female[i,k])); } else{ log_like_male+=(log1m(p_male[i,k])); log_like_female+=(log1m(p_female[i,k])); } } } log_like_male += (log(chi_male[i,lastni[i]+1])); log_like_female += (log(chi_female[i,lastni[i]+1])); target += log_mix(theta[i], log_like_male, log_like_female); } } 25.5 Call Stan from R, check convergence and look at results # Run STAN library(rstan) fit <- stan(file = "stanmodels/cmr_mixture_model.stan", data=datax, verbose = FALSE) # for above simulated data (25000 individuals x 15 time periods) # computing time is around 48 hours on an intel corei7 laptop # for larger data sets, we recommed moving the transformed parameters block # to the model block in order to avoid monitoring of p_male, p_female, # phi_male and phi_female producing memory problems # launch_shinystan(fit) # diagnostic plots summary(fit) ## mean se_mean sd 2.5% 25% ## b0[1,1] 0.60132367 0.0015709423 0.06173884 0.48042366 0.55922253 ## b0[1,2] 0.70098709 0.0012519948 0.04969428 0.60382019 0.66806698 ## b0[1,3] 0.50293513 0.0010904085 0.04517398 0.41491848 0.47220346 ## b0[1,4] 0.28118209 0.0008809447 0.03577334 0.21440931 0.25697691 ## b0[1,5] 0.34938289 0.0009901335 0.03647815 0.27819918 0.32351323 ## b0[1,6] 0.13158569 0.0006914740 0.02627423 0.08664129 0.11286629 ## b0[1,7] 0.61182981 0.0010463611 0.04129602 0.53187976 0.58387839 ## b0[1,8] 0.48535193 0.0010845951 0.04155762 0.40559440 0.45750793 ## b0[1,9] 0.52531291 0.0008790063 0.03704084 0.45247132 0.50064513 ## b0[1,10] 0.87174780 0.0007565552 0.03000936 0.80818138 0.85259573 ## b0[1,11] 0.80185454 0.0009425675 0.03518166 0.73173810 0.77865187 ## b0[1,12] 0.33152443 0.0008564381 0.03628505 0.26380840 0.30697293 ## b0[1,13] 0.42132288 0.0012174784 0.04140382 0.34062688 0.39305210 ## b0[1,14] 0.65180372 0.0015151039 0.05333953 0.55349105 0.61560493 ## b0[2,1] 0.34237039 0.0041467200 0.12925217 0.12002285 0.24717176 ## b0[2,2] 0.18534646 0.0023431250 0.07547704 0.05924694 0.12871584 ## b0[2,3] 0.61351083 0.0024140550 0.07679100 0.46647727 0.56242546 ## b0[2,4] 0.37140208 0.0024464965 0.06962399 0.24693888 0.32338093 ## b0[2,5] 0.19428215 0.0034618302 0.11214798 0.02800056 0.11146326 ## b0[2,6] 0.27371336 0.0026553769 0.09054020 0.11827243 0.20785316 ## b0[2,7] 0.18611173 0.0014387436 0.05328492 0.09122869 0.14789827 ## b0[2,8] 0.25648337 0.0018258589 0.05287800 0.16255769 0.21913271 ## b0[2,9] 0.20378754 0.0021367769 0.07380004 0.07777998 0.15215845 ## b0[2,10] 0.52679548 0.0024625568 0.08696008 0.36214334 0.46594844 ## b0[2,11] 0.47393354 0.0032593161 0.10555065 0.28843967 0.39781278 ## b0[2,12] 0.22289155 0.0017082729 0.05551514 0.12576797 0.18203335 ## b0[2,13] 0.26191486 0.0024159794 0.07016314 0.14106495 0.21234017 ## b0[2,14] 0.65111737 0.0055743944 0.18780555 0.29279480 0.50957591 ## a0[1,1] 0.95440670 0.0013771881 0.04808748 0.86301660 0.92146330 ## a0[1,2] 0.01529770 0.0469699511 1.46995922 -2.82218067 -0.95533706 ## a0[2,1] 0.16384995 0.0049928331 0.12634422 -0.06399631 0.07533962 ## a0[2,2] 0.01535679 0.0469634175 1.47006964 -2.81864060 -0.95515751 ## a1[1,1] 0.15937249 0.0028992587 0.08864790 -0.01288607 0.10017613 ## a1[1,2] 0.08055953 0.1007089857 3.02148727 -5.95525636 -1.96662599 ## a1[2,1] -0.83614134 0.0074143920 0.18655882 -1.21033848 -0.95698565 ## a1[2,2] 0.08071668 0.1006904255 3.02145647 -5.94617355 -1.96508733 ## 50% 75% 97.5% n_eff Rhat ## b0[1,1] 0.60206306 0.6431566 0.7206343 1544.5301 1.002331 ## b0[1,2] 0.70165494 0.7355204 0.7946280 1575.4617 1.001482 ## b0[1,3] 0.50367411 0.5330078 0.5898079 1716.3196 1.001183 ## b0[1,4] 0.27997512 0.3046483 0.3544592 1649.0040 1.000760 ## b0[1,5] 0.34936442 0.3751935 0.4191138 1357.3073 1.002072 ## b0[1,6] 0.12987449 0.1481661 0.1873982 1443.8040 1.003676 ## b0[1,7] 0.61203228 0.6397577 0.6933929 1557.5904 1.001458 ## b0[1,8] 0.48513822 0.5134314 0.5672066 1468.1355 1.002511 ## b0[1,9] 0.52534212 0.5501747 0.5994060 1775.7335 1.000824 ## b0[1,10] 0.87324112 0.8934047 0.9258033 1573.3747 1.000719 ## b0[1,11] 0.80300311 0.8261868 0.8675033 1393.1817 1.001172 ## b0[1,12] 0.33044476 0.3552199 0.4052902 1794.9956 1.000566 ## b0[1,13] 0.42116690 0.4492297 0.5026942 1156.5339 1.000289 ## b0[1,14] 0.64956850 0.6864706 0.7607107 1239.4056 1.004061 ## b0[2,1] 0.33493631 0.4251416 0.6150923 971.5524 1.004049 ## b0[2,2] 0.17981663 0.2358847 0.3446097 1037.6210 1.001474 ## b0[2,3] 0.61326419 0.6644156 0.7628427 1011.8737 1.005727 ## b0[2,4] 0.36837778 0.4158585 0.5190457 809.8949 1.003803 ## b0[2,5] 0.17910449 0.2591418 0.4533117 1049.4733 1.001499 ## b0[2,6] 0.26739172 0.3299594 0.4685139 1162.6006 1.001170 ## b0[2,7] 0.18254607 0.2198969 0.3003156 1371.6455 1.000878 ## b0[2,8] 0.25280556 0.2895585 0.3704113 838.7174 1.005624 ## b0[2,9] 0.19724053 0.2501298 0.3694806 1192.8747 1.003687 ## b0[2,10] 0.52587075 0.5845730 0.7061694 1247.0027 1.002851 ## b0[2,11] 0.46874445 0.5392302 0.7046892 1048.7425 0.999473 ## b0[2,12] 0.21961656 0.2580782 0.3397127 1056.1081 1.000907 ## b0[2,13] 0.25601959 0.3056204 0.4142888 843.3960 1.003130 ## b0[2,14] 0.65824835 0.7973674 0.9698829 1135.0669 1.003838 ## a0[1,1] 0.95368445 0.9862439 1.0515747 1219.2071 1.003898 ## a0[1,2] 0.01633534 0.9911055 2.9717839 979.4231 1.003726 ## a0[2,1] 0.15519648 0.2472483 0.4230776 640.3489 1.004625 ## a0[2,2] 0.01587281 0.9898084 2.9659552 979.8429 1.003744 ## a1[1,1] 0.15647489 0.2205720 0.3354845 934.8953 1.007190 ## a1[1,2] 0.06683287 2.1568781 6.0295208 900.1297 1.003701 ## a1[2,1] -0.83503982 -0.7075691 -0.4814539 633.1119 1.010568 ## a1[2,2] 0.06586905 2.1557247 6.0239735 900.4432 1.003704 "],["samplesize.html", "26 What sample size? 26.1 Introduction", " 26 What sample size? 26.1 Introduction What sample size is needed, is an important question when planning an empirical study? Some authorities even ask for a justification for the planned sample size of an animal experiment. "],["referenzen.html", "Referenzen", " Referenzen Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell. 2009. Statistical Modelling in r. Oxford: Oxford University Press. Almasi, B, A Roulin, S Jenni-Eiermann, C W Breuner, and L Jenni. 2009. Regulation of Free Corticosterone and CBG Capacity Under Different Environmental Conditions in Altricial Nestlings. Gen. Comp. Endocr. 164: 11724. Amrhein, Valentin, Sander Greenland, and Blake McShane. 2019. Retire Statistical Significance. Nature 567: 3057. Anderson, J A. 1974. Diagnosis by Logistic Discriminant Function: Further Practical Problems and Results. Journal of Applied Statistics 23: 397404. Betancourt, M.~J. 2013. Generalizing the No-U-Turn Sampler to Riemannian Manifolds. ArXiv e-Prints, April. https://arxiv.org/abs/1304.1920. Betancourt, M.~J., and M. Girolami. 2013. Hamiltonian Monte Carlo for Hierarchical Models. ArXiv e-Prints. https://arxiv.org/abs/1312.0906. Brilleman, Samuel L., Eren M. Elci, Jacqueline Buros Novik, and Rory Wolfe. 2020. Bayesian Survival Analysis Using the Rstanarm r Package. http://arxiv.org/pdf/2002.09633v1. Davison, A C, and E J Snell. 1991. Residuals and Diagnostics. In Statistical Theory and Modelling. In Honour of Sir David Cox, FRS, edited by D V Hinkley, N Reid, and E J Snell. London: Chapman {\\&} Hall. Efron, Bradley, and Trevor Hastie. 2016. Computer age statistical inference: Algorithms, evidence, and data science. Institute of Mathematical Statistics Monographs. Ellenberg, H. 1953. Physiologisches Und Oekologisches Verhalten Derselben Pflanzenarten. Berichte Der Deutschen Botanischen Gesellschaft 65: 350361. Gelman, A. 2006. Prior Distributions for Variance Parameters in Hierarchical Models. Bayesian Analysis 1: 51533. Gelman, A., John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2014a. Bayesian Data Analysis. Third. New York: CRC Press. Gelman, A, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. 2014b. Bayesian Data Analysis. Third. New York: CRC Press. Gelman, A, and J Hill. 2007. Data Analysis Using Regression and Multilevel / Hierarchical Models. Cambridge: Cambridge Universtiy Press. Gelman, Andrew, and Sander Greenland. 2019. Are Confidence Intervals Better Termed Uncertainty Intervals? BMJ (Clinical Research Ed.) 366: l5381. https://doi.org/10.1136/bmj.l5381. Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel / Hierarchical Models. Cambridge University Press. Gottschalk, Thomas, Klemens Ekschmitt, and Volkmar Wolters. 2011. Efficient Placement of Nest Boxes for the Little Owl (Athene Noctua). The Journal of Raptor Research 45: 114. Grüebler, Martin U, Fränzi Korner-Nievergelt, and Johann Von Hirschheydt. 2010. The Reproductive Benefits of Livestock Farming in Barn Swallows Hirundo Rustica: Quality of Nest Site or Foraging Habitat? Journal of Applied Ecology 47 (6): 134047. Harju, S. 2016. Book review:~Bayesian Data Analysis in Ecology Using Linear Models with R, BUGS, and Stan. The Journal of Wildlife Management 80: 771. Harrison, Xavier A. 2014. Using Observation-Level Random Effects to Model Overdispersion in Count Data in Ecology and Evolution. PeerJ 2: e616. https://doi.org/10.7717/peerj.616. Hastie, T, R Tibshirani, and J Friedman. 2009. The Elements of Statistical Learning, Data Mining, Inference, and Prediction. New York: Springer. Hemming, Victoria, Abbey E. Camaclang, Megan S. Adams, Mark Burgman, Katherine Carbeck, Josie Carwardine, Iadine Chadès, et al. 2022. An Introduction to Decision Science for Conservation. Conservation Biology. John Wiley; Sons Inc. https://doi.org/10.1111/cobi.13868. Hoffman, Matthew D, and Andrew Gelman. 2014. The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research 15 (1): 1593623. Hoyle, Rick H. 2012. Handbook of Structural Equation Modeling. New York: The Guildford Press. Jenni, L, and R Winkler. 1989. The Feather-Length of Small Passerines: A Measurement for Wing-Length in Live Birds and Museum Skins. Bird Study 36: 115. Korner-Nievergelt, F, T Roth, Stefanie von Felten, J Guélat, B Almasi, and P Korner-Nievergelt. 2015. Bayesian Data Analysis in Ecolog Using Linear Models with R, BUGS, and Stan. New York: Elsevier. Lemoine, Nathan P. 2019. Moving Beyond Noninformative Priors: Why and How to Choose Weakly Informative Priors in Bayesian Analyses. Oikos 128 (7): 91228. https://doi.org/10.1111/oik.05985. MacKenzie, Darryl I, James D Nichols, G B Lachman, S Droege, J A Royle, and C A Langtimm. 2002. Estimating Site Occupancy Rates When Detection Probabilities Are Less Than One. Ecology 83: 224855. Manly, Bryan F J. 1994. Multivariate Statistical Methods, A Primer. London: 2nd ed. Chapman & Hall. Mayfield, Harold F. 1975. Suggestions for Calculating Nest Success. Wilson Bulletin 87: 45666. McElreath, Richard. 2016. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. New York: Max Planck Institute for Evolutionary Anthropology; CRC Press. Nakagawa, Shinichi, and Holger Schielzeth. 2013. A General and Simple Method for Obtaining R2 from Generalized Linear Mixed-Effects Models. Methods in Ecology and Evolution 4: 13342. https://doi.org/10.1111/j.2041-210x.2012.00261.x. Pledger, S., K. H. Pollock, and James L. Norris. 2003. Open Capture-Recapture Models with Heterogeneity: I. Cormack-Jolly-Seber Model. Biometrics 59: 78694. Royle, J Andrew. 2004. N-Mixture Models for Estimating Population Size from Spatially Replicated Counts. Biometrics 60: 10815. Schano, Christian, Carole Niffenegger, Tobias Jonas, and Fränzi Korner-Nievergelt. 2021. Hatching phenology is lagging behind an advancing snowmelt pattern in a high-alpine bird. Scientific Reports 11 (1): 20130016. https://doi.org/10.1038/s41598-021-01497-8. Shaffer, Terry L. 2004. A Unified Approach to Analyzing Nest Success. The Auk 121: 52640. Shipley, Bill. 2009. Confirmatory path analysis in a generalized multilevel context. Ecology 90: 36368. Thomson, D L, M J Conroy, D R Anderson, K P Burnham, E G Cooch, C M Francis, J.-D. Lebreton, et al. 2009. Standardising Terminology and notation for the Analysis of Demographic Processes in Marked Populations. In Modeling Demographic Processes in Marked Populations, edited by D L Thomson, E G Cooch, and M J Conroy, 10991106. Environmental and Ecological Statistics 3. Berlin: Springer. Tredennick, Andrew T., Giles Hooker, Stephen P. Ellner, and Peter B. Adler. 2021. A practical guide to selecting models for exploration, inference, and prediction in ecology. Ecology 102 (6). https://doi.org/10.1002/ecy.3336. Walters, G. 2012. Customary Fire Regimes and Vegetation Structure in Gabons Bateke Plateaux. Human Ecology 40: 94355. Zbinden, Niklaus, Marco Salvioni, Fränzi Korner-Nievergelt, and Verena Keller. 2018. Evidence for an Additive Effect of Hunting Mortality in an Alpine Black Grouse Lyrurus Tetrix Population. Wildlife Biology 2018: xxxxx. Zeileis, Achim, Christian Kleiber, and Simon Jackman. 2008. Regression Models for Count Data in r. Journal of Statistical Software 27: 125. Zollinger, J.-L., S. Birrer, N. Zbinden, and F. Korner-Nievergelt. 2013. The Optimal Age of Sown Field Margins for Breeding Farmland Birds. Ibis 155 (4). https://doi.org/10.1111/ibi.12072. Zuur, Alain F, Elena N Ieno, Neil J Walker, Anatoly A Saveliev, and Graham M Smith. 2009. Mixed Effects Models and Extensions in Ecology with r. Springer. "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]]
+[["index.html", "Bayesian Data Analysis in Ecology with R and Stan Preface Why this book? About this book How to contribute? Acknowledgments", " Bayesian Data Analysis in Ecology with R and Stan Fränzi Korner-Nievergelt, Tobias Roth, Stefanie von Felten, Jerôme Guélat, Bettina Almasi, Pius Korner-Nievergelt 2024-09-30 Preface Why this book? In 2015, we wrote a statistics book for Master/PhD level Bayesian data analyses in ecology (Korner-Nievergelt et al. 2015). You can order it here. People seemed to like it (e.g. (Harju 2016)). Since then, two parallel processes happen. First, we learn more and we become more confident in what we do, or what we do not, and why we do what we do. Second, several really clever people develop software that broaden the spectrum of ecological models that now easily can be applied by ecologists used to work with R. With this e-book, we open the possibility to add new or substantially revised material. In most of the time, it should be in a state that it can be printed and used together with the book as handout for our stats courses. About this book We do not copy text from the book into the e-book. Therefore, we refer to the book (Korner-Nievergelt et al. 2015) for reading about the basic theory on doing Bayesian data analyses using linear models. However, Chapters 1 to 17 of this dynamic e-book correspond to the book chapters. In each chapter, we may provide updated R-codes and/or additional material. The following chapters contain completely new material that we think may be useful for ecologists. While we show the R-code behind most of the analyses, we sometimes choose not to show all the code in the html version of the book. This is particularly the case for some of the illustrations. An intrested reader can always consult the public GitHub repository with the rmarkdown-files that were used to generate the book. How to contribute? It is open so that everybody with a GitHub account can make comments and suggestions for improvement. Readers can contribute in two ways. One way is to add an issue. The second way is to contribute content directly through the edit button at the top of the page (i.e. a symbol showing a pencil in a square). That button is linked to the rmarkdown source file of each page. You can correct typos or add new text and then submit a GitHub pull request. We try to respond to you as quickly as possible. We are looking forward to your contribution! Acknowledgments We thank Yihui Xie for providing bookdown which makes it much fun to write open books such as ours. We thank many anonymous students and collaborators who searched information on new software, reported updates and gave feedback on earlier versions of the book. Specifically, we thank Carole Niffenegger for looking up the difference between the bulk and tail ESS in the brm output, Martin Küblbeck for using the conditional logistic regression in rstanarm, … "],["PART-I.html", "1 Introduction to PART I 1.1 Further reading", " 1 Introduction to PART I During our courses we are sometimes asked to give an introduction to some R-related stuff covering data analysis, presentation of results or rather specialist topics in ecology. In this part we present collected these introduction and try to keep them updated. This is also a commented collection of R-code that we documented for our own work. We hope this might be useful olso for other readers. 1.1 Further reading R for Data Science by Garrett Grolemund and Hadley Wickham: Introduces the tidyverse framwork. It explains how to get data into R, get it into the most useful structure, transform it, visualise it and model it. "],["basics.html", "2 Basics of statistics 2.1 Variables and observations 2.2 Displaying and summarizing data 2.3 Inferential statistics 2.4 Bayes theorem and the common aim of frequentist and Bayesian methods 2.5 Classical frequentist tests and alternatives 2.6 Summary", " 2 Basics of statistics This chapter introduces some important terms useful for doing data analyses. It also introduces the essentials of the classical frequentist tests such as t-test. Even though we will not use nullhypotheses tests later (Amrhein, Greenland, and McShane 2019), we introduce them here because we need to understand the scientific literature. For each classical test, we provide a suggestion how to present the statistical results without using null hypothesis tests. We further discuss some differences between the Bayesian and frequentist statistics. 2.1 Variables and observations Empirical research involves data collection. Data are collected by recording measurements of variables for observational units. An observational unit may be, for example, an individual, a plot, a time interval or a combination of those. The collection of all units ideally build a random sample of the entire population of units in that we are interested. The measurements (or observations) of the random sample are stored in a data table (sometimes also called data set, but a data set may include several data tables. A collection of data tables belonging to the same study or system is normally bundled and stored in a data base). A data table is a collection of variables (columns). Data tables normally are handled as objects of class data.frame in R. All measurements on a row in a data table belong to the same observational unit. The variables can be of different scales (Table 2.1). Table 2.1: Scales of measurements Scale Examples Properties Coding in R Nominal Sex, genotype, habitat Identity (values have a unique meaning) factor() Ordinal Elevational zones Identity and magnitude (values have an ordered relationship) ordered() Numeric Discrete: counts; continuous: body weight, wing length Identity, magnitude, and intervals or ratios intgeger() numeric() The aim of many studies is to describe how a variable of interest (\\(y\\)) is related to one or more predictor variables (\\(x\\)). How these variables are named differs between authors. The y-variable is called “outcome variable”, “response” or “dependent variable”. The x-variables are called “predictors”, “explanatory variables” or “independent variables”. The choose of the terms for x and y is a matter of taste. We avoid the terms “dependent” and “independent” variables because often we do not know whether the variable \\(y\\) is in fact depending on the \\(x\\) variables and also, often the x-variables are not independent of each other. In this book, we try to use “outcome” and “predictor” variables because these terms sound most neutral to us in that they refer to how the statistical model is constructed rather than to a real life relationship. 2.2 Displaying and summarizing data 2.2.1 Histogram While nominal and ordinal variables are summarized by giving the absolute number or the proportion of observations for each category, numeric variables normally are summarized by a location and a scatter statistics, such as the mean and the standard deviation or the median and some quantiles. The distribution of a numeric variable is graphically displayed in a histogram (Fig. 2.1). Figure 2.1: Histogram of the length of ell of statistics course participants. To draw a histogram, the variable is displayed on the x-axis and the \\(x_i\\)-values are assigned to classes. The edges of the classes are called ‘breaks’. They can be set with the argument breaks= within the function hist. The values given in the breaks= argument must at least span the values of the variable. If the argument breaks= is not specified, R searches for breaks-values that make the histogram look smooth. The number of observations falling in each class is given on the y-axis. The y-axis can be re-scaled so that the area of the histogram equals 1 by setting the argument density=TRUE. In that case, the values on the y-axis correspond to the density values of a probability distribution (Chapter 4). 2.2.2 Location and scatter Location statistics are mean, median or mode. A common mean is the arithmetic mean: \\(\\hat{\\mu} = \\bar{x} = \\frac{i=1}{n} x_i \\sum_{1}^{n}\\) (R function mean), where \\(n\\) is the sample size. The parameter \\(\\mu\\) is the (unknown) true mean of the entire population of which the \\(1,...,n\\) measurements are a random sample of. \\(\\bar{x}\\) is called the sample mean and used as an estimate for \\(\\mu\\). The \\(^\\) above any parameter indicates that the parameter value is obtained from a sample and, therefore, it may be different from the true value. The median is the 50% quantile. We find 50% of the measurements below and 50% above the median. If \\(x_1,..., x_n\\) are the ordered measurements of a variable, then the median is: median \\(= x_{(n+1)/2}\\) for uneven \\(n\\), and median \\(= \\frac{1}{2}(x_{n/2} + x_{n/2+1})\\) for even \\(n\\) (R function median). The mode is the value that is occurring with highest frequency or that has the highest density. Scatter also is called spread, scale or variance. Variance parameters describe how far away from the location parameter single observations can be found, or how the measurements are scattered around their mean. The variance is defined as the average squared difference between the observations and the mean: variance \\(\\hat{\\sigma^2} = s^2 = \\frac{1}{n-1}\\sum_{i=1}^{n}(x_i-\\bar{x})^2\\) The term \\((n-1)\\) is called the degrees of freedom. It is used in the denominator of the variance formula instead of \\(n\\) to prevent underestimating the variance. Because \\(\\bar{x}\\) is in average closer to \\(x_i\\) than the unknown true mean \\(\\mu\\) would be, the variance would be underestimated if \\(n\\) is used in the denominator. The variance is used to compare the degree of scatter among different groups. However, its values are difficult to interpret because of the squared unit. Therefore, the square root of the variance, the standard deviation is normally reported: standard deviation \\(\\hat{\\sigma} = s = \\sqrt{s^2}\\) (R Function sd) The standard deviation is approximately the average deviation of an observation from the sample mean. In the case of a [normal distribution][normdist], about two thirds (68%) of the data are expected within one standard deviation around the mean. The variance and standard deviation each describe the scatter with a single value. Thus, we have to assume that the observations are scattered symmetrically around their mean in order to get a picture of the distribution of the measurements. When the measurements are spread asymmetrically (skewed distribution), then it may be more precise to describe the scatter with more than one value. Such statistics could be quantiles from the lower and upper tail of the data. Quantiles inform us about both location and spread of a distribution. The \\(p\\)th-quantile is the value with the property that a proportion \\(p\\) of all values are less than or equal to the value of the quantile. The median is the 50% quantile. The 25% quantile and the 75% quantile are also called the lower and upper quartiles, respectively. The range between the 25% and the 75% quartiles is called the interquartile range. This range includes 50% of the distribution and is also used as a measure of scatter. The R function quantile extracts sample quantiles. The median, the quartiles, and the interquartile range can be graphically displayed using box and-whisker plots (boxplots in short, R function boxplot). The horizontal fat bars are the medians (Fig. 2.2). The boxes mark the interquartile range. The whiskers reach out to the last observation within 1.5 times the interquartile range from the quartile. Circles mark observations beyond 1.5 times the interquartile range from the quartile. par(mar=c(5,4,1,1)) boxplot(ell~car, data=dat, las=1, ylab="Lenght of ell [cm]", col="tomato", xaxt="n") axis(1, at=c(1,2), labels=c("Not owing a car", "Car owner")) n <- table(dat$car) axis(1, at=c(1,2), labels=paste("n=", n, sep=""), mgp=c(3,2, 0)) Figure 2.2: Boxplot of the length of ell of statistics course participants who are or ar not owner of a car. The boxplot is an appealing tool for comparing location, variance and distribution of measurements among groups. 2.2.3 Correlations A correlation measures the strength with which characteristics of two variables are associated with each other (co-occur). If both variables are numeric, we can visualize the correlation using a scatterplot. par(mar=c(5,4,1,1)) plot(temp~ell, data=dat, las=1, xlab="Lenght of ell [cm]", ylab="Comfort temperature [°C]", pch=16) Figure 2.3: Scatterplot of the length of ell and the comfort temperature of statistics course participants. The covariance between variable \\(x\\) and \\(y\\) is defined as: covariance \\(q = \\frac{1}{n-1}\\sum_{i=1}^{n}((x_i-\\bar{x})*(y_i-\\bar{y}))\\) (R function cov) As for the variance, also the units of the covariance are sqared and therefore covariance values are difficult to interpret. A standardized covariance is the Pearson correlation coefficient: Pearson correlation coefficient: \\(r=\\frac{\\sum_{i=1}^{n}(x_i-\\bar{x})(y_i-\\bar{y})}{\\sqrt{\\sum_{i=1}^{n}(x_i-\\bar{x})^2\\sum_{i=1}^{n}(y_i-\\bar{y})^2}}\\) (R function cor) Means, variances, standard deviations, covariances and correlations are sensible for outliers. Single observations containing extreme values normally have a overproportional influence on these statistics. When outliers are present in the data, we may prefer a more robust correlation measure such as the Spearman correlation or Kendall’s tau. Both are based on the ranks of the measurements instead of the measurements themselves. Spearman correlation coefficient: correlation between rank(x) and rank(y) (R function cor(x,y, method=\"spearman\")) Kendall’s tau: \\(\\tau = 1-\\frac{4I}{(n(n-1))}\\), where \\(I\\) = number of pairs \\((i,k)\\) for which \\((x_i < x_k)\\) & \\((y_i > y_k)\\) or viceversa. (R function cor(x,y, method=\"kendall\")) 2.2.4 Principal components analyses PCA The principal components analysis (PCA) is a multivariate correlation analysis. A multidimensional data set with \\(k\\) variables can be seen as a cloud of points (observations) in a \\(k\\)-dimensional space. Imagine, we could move around in the space and look at the cloud from different locations. From some locations, the data looks highly correlated, whereas from others, we cannot see the correlation. That is what PCA is doing. It is rotating the coordinate system (defined by the original variables) of the data cloud so that the correlations are no longer visible. The axes of the new coordinates system are linear combinations of the original variables. They are called principal components. There are as many principal coordinates as there are original variables, i.e. \\(k\\), \\(p_1, ..., p_k\\). The principal components meet further requirements: the first component explains most variance the second component explains most of the remaining variance and is perpendicular (= uncorrelated) to the first one third component explains most of the remaining variance and is perpendicular to the first two … For example, in a two-dimensional data set \\((x_1, x_2)\\) the principal components become \\(pc_{1i} = b_{11}x_{1i} + b_{12}x_{2i}\\) \\(pc_{2i} = b_{21}x_{1i} + b_{22}x_{2i}\\) with \\(b_{jk}\\) being loadings of principal component \\(j\\) and original variable \\(k\\). Fig. 2.4 shows the two principal components for a two-dimensional data set. They can be calculated using matrix algebra: principal components are eigenvectors of the covariance or correlation matrix. Figure 2.4: Principal components of a two dimensional data set based on the covariance matrix (green) and the correlation matrix (brown). The choice between correlation or covariance matrix is essential and important. The covariance matrix is an unstandardized correlation matrix. Therefore, the units, i.e., whether cm or m are used, influence the results of the PCA if it is based on the covariance matrix. When the PCA is based on the covariance matrix, the results will change, when we change the units of one variable, e.g., from cm to m. Basing the PCA on the covariance matrix only makes sense, when the variances are comparable among the variables, i.e., if all variables are measured in the same unit and we would like to weight each variable according to its variance. If this is not the case, the PCA must be based on the correlation matrix. pca <- princomp(cbind(x1,x2)) # PCA based on covariance matrix pca <- princomp(cbind(x1,x2), cor=TRUE) # PCA based on correlation matrix loadings(pca) ## ## Loadings: ## Comp.1 Comp.2 ## x1 0.707 0.707 ## x2 0.707 -0.707 ## ## Comp.1 Comp.2 ## SS loadings 1.0 1.0 ## Proportion Var 0.5 0.5 ## Cumulative Var 0.5 1.0 The loadings measure the correlation of each variable with the principal components. They inform about what aspects of the data each component is measuring. The signs of the loadings are arbitrary, thus we can multiplied them by -1 without changing the PCA. Sometimes this can be handy for describing the meaning of the principal component in a paper. For example, Zbinden et al. (2018) combined the number of hunting licenses, the duration of the hunting period and the number of black grouse cocks that were allowed to be hunted per hunter in a principal component in order to measure hunting pressure. All three variables had a negative loading in the first component, so that high values of the component meant low hunting pressure. Before the subsequent analyses, for which a measure of hunting pressure was of interest, the authors changed the signs of the loadings so that this component measured hunting pressure. The proportion of variance explained by each component is, beside the loadings, an important information. If the first few components explain the main part of the variance, it means that maybe not all \\(k\\) variables are necessary to describe the data, or, in other words, the original \\(k\\) variables contain a lot of redundant information. # extract the variance captured by each component summary(pca) ## Importance of components: ## Comp.1 Comp.2 ## Standard deviation 1.2679406 0.6263598 ## Proportion of Variance 0.8038367 0.1961633 ## Cumulative Proportion 0.8038367 1.0000000 Ridge regression is similar to doing a PCA within a linear model while components with low variance are shrinked to a higher degree than components with a high variance. 2.3 Inferential statistics 2.3.1 Uncertainty there is never a “yes-or-no” answer there will always be uncertainty Amrhein (2017)[https://peerj.com/preprints/26857] The decision whether an effect is important or not cannot not be done based on data alone. For making a decision we should, beside the data, carefully consider the consequences of each decision, the aims we would like to achieve, and the risk, i.e. how bad it is to make the wrong decision. Structured decision making or decision analyses provide methods to combine consequences of decisions, objectives of different stakeholders and risk attitudes of decision makers to make optimal decisions (Hemming et al. 2022, Runge2020). In most data analyses, particularly in basic research and when working on case studies, we normally do not consider consequences of decisions. However, the results will be more useful when presented in a way that other scientists can use them for a meta-analysis, or stakeholders and politicians can use them for making better decisions. Useful results always include information on the size of a parameter of interest, e.g. an effect of a drug or an average survival, together with an uncertainty measure. Therefore, statistics is describing patterns of the process that presumably has generated the data and quantifying the uncertainty of the described patterns that is due to the fact that the data is just a random sample from the larger population we would like to know the patterns of. Quantification of uncertainty is only possible if: 1. the mechanisms that generated the data are known 2. the observations are a random sample from the population of interest Most studies aim at understanding the mechanisms that generated the data, thus they are most likely not known beforehand. To overcome that problem, we construct models, e.g. statistical models, that are (strong) abstractions of the data generating process. And we report the model assumptions. All uncertainty measures are conditional on the model we used to analyze the data, i.e., they are only reliable, if the model describes the data generating process realistically. Because most statistical models do not describe the data generating process well, the true uncertainty almost always is much higher than the one we report. In order to obtain a random sample from the population under study, a good study design is a prerequisite. To illustrate how inference about a big population is drawn from a small sample, we here use simulated data. The advantage of using simulated data is that the mechanism that generated the data is known as well as the big population. Imagine there are 300000 PhD students on the world and we would like to know how many statistics courses they have taken in average before they started their PhD (Fig. 2.5). We use random number generators (rpois and rgamma) to simulate for each of the 300000 virtual students a number. We here use these 300000 numbers as the big population that in real life we almost never can sample in total. Normally, we know the number of courses students have taken just for a small sample of students. To simulate that situation we draw 12 numbers at random from the 300000 (R function sample). Then, we estimate the average number of statistics courses students take before they start a PhD from the sample of 12 students and we compare that mean to the true mean of the 300000 students. # simulate the virtual true population set.seed(235325) # set seed for random number generator # simulate fake data of the whole population # using an overdispersed Poisson distribution, # i.e. a Poisson distribution of whicht the mean # has a gamma distribution statscourses <- rpois(300000, rgamma(300000, 2, 3)) # draw a random sample from the population n <- 12 # sample size y <- sample(statscourses, 12, replace=FALSE) Figure 2.5: Histogram of the number of statistics courses of 300000 virtual PhD students have taken before their PhD started. The rugs on the x-axis indicate the random sample of 12 out of the 300000 students. The black vertical line indicates the mean of the 300000 students (true mean) and the blue line indicates the mean of the sample (sample mean). We observe the sample mean, what do we know about the population mean? There are two different approaches to answer this question. 1) We could ask us, how much the sample mean would scatter, if we repeat the study many times? This approach is called the frequentist’ statistics. 2) We could ask us for any possible value, what is the probability that it is the true population mean? To do so, we use probability theory and that is called the Bayesian statistics. Both approaches use (essentially similar) models. Only the mathematical techniques to calculate uncertainty measures differ between the two approaches. In cases when beside the data no other information is used to construct the model, then the results are approximately identical (at least for large enough sample sizes). A frequentist 95% confidence interval (blue horizontal segment in Fig. 2.6) is constructed such that, if you were to (hypothetically) repeat the experiment or sampling many many times, 95% of the intervals constructed would contain the true value of the parameter (here the mean number of courses). From the Bayesian posterior distribution (pink in Fig. 2.6) we could construct a 95% interval (e.g., by using the 2.5% and 97.5% quantiles). This interval has traditionally been called credible interval. It can be interpreted that we are 95% sure that the true mean is inside that interval. Both, confidence interval and posterior distribution, correspond to the statistical uncertainty of the sample mean, i.e., they measure how far away the sample mean could be from the true mean. In this virtual example, we know the true mean is 0.66, thus somewhere at the lower part of the 95% CI or in the lower quantiles of the posterior distribution. In real life, we do not know the true mean. The grey histogram in Fig. 2.6 shows how means of many different virtual samples of 12 students scatter around the true mean. The 95% interval of these virtual means corresponds to the 95% CI, and the variance of these virtual means correspond to the variance of the posterior distribution. This virtual example shows that posterior distribution and 95% CI correctly measure the statistical uncertainty (variance, width of the interval), however we never know exactly how far the sample mean is from the true mean. Figure 2.6: Histogram of means of repeated samples from the true populations. The scatter of these means visualize the true uncertainty of the mean in this example. The blue vertical line indicates the mean of the original sample. The blue segment shows the 95% confidence interval (obtained by fequensist methods) and the violet line shows the posterior distribution of the mean (obtained by Bayesian methods). Uncertainty intervals only are reliable if the model is a realistic abstraction of the data generating process (or if the model assumptions are realistic). Because both terms, confidence and credible interval, suggest that the interval indicates confidence or credibility but the intervals actually show uncertainty, it has been suggested to rename the interval into compatibility or uncertainty interval (Andrew Gelman and Greenland 2019). 2.3.2 Standard error The standard error SE is, beside the uncertainty interval, an alternative possibility to measure uncertainty. It measures an average deviation of the sample mean from the (unknown) true population mean. The frequentist method for obtaining the SE is based on the central limit theorem. According to the central limit theorem the sum of independent, not necessarily normally distributed random numbers are normally distributed when sample size is large enough (Chapter 4). Because the mean is a sum (divided by a constant, the sample size) it can be assumed that the distribution of many means of samples is normal. The standard deviation SD of the many means is called the standard error SE. It can be mathematically shown that the standard error SE equals the standard deviation SD of the sample divided by the square root of the sample size: frequentist SE = SD/sqrt(n) = \\(\\frac{\\hat{\\sigma}}{\\sqrt{n}}\\) Bayesian SE: Using Bayesian methods, the SE is the SD of the posterior distribution. It is very important to keep the difference between SE and SD in mind! SD measures the scatter of the data, whereas SE measures the statistical uncertainty of the mean (or of another estimated parameter, Fig. 2.7). SD is a descriptive statistics describing a characteristics of the data, whereas SE is an inferential statistics showing us how far away the sample mean possibly is from the true mean. When sample size increases, SE becomes smaller, whereas SD does not change (given the added observations are drawn at random from the same big population as the ones already in the sample). Figure 2.7: Illustration of the difference between SD and SE. The SD measures the scatter in the data (sample, tickmarks on the x-axis). The SD is an estimate for the scatter in the big population (grey histogram, normally not known). The SE measures the uncertainty of the sample mean (in blue). The SE measures approximately how far, in average the sample mean (blue) is from the true mean (brown). 2.4 Bayes theorem and the common aim of frequentist and Bayesian methods 2.4.1 Bayes theorem for discrete events The Bayes theorem describes the probability of event A conditional on event B (the probability of A after B has already occurred) from the probability of B conditional on A and the two probabilities of the events A and B: \\(P(A|B) = \\frac{P(B|A)P(A)}{P(B)}\\) Imagine, event A is “The person likes wine as a birthday present.” and event B “The person has no car.”. The conditional probability of A given B is the probability that a person not owing a car likes wine. Answers from students whether they have a car and what they like as a birthday present are summarized in Table 2.2. Table 2.2: Cross table of the student’s birthday preference and car ownership. car/birthday flowers wine sum no car 6 9 15 car 1 6 7 sum 7 15 22 We can apply the Bayes theorem to obtain the probability that the student likes wine given it has no car, \\(P(A|B)\\). Let’s assume that only the ones who prefer wine go together for having a glass of wine at the bar after the statistics course. While they drink wine, the tell each other about their cars and they obtain the probability that a student who likes wine has no car, \\(P(B|A) = 0.6\\). During the statistics class the teacher asked the students about their car ownership and birthday preference. Therefore, they know that \\(P(A) =\\) likes wine \\(= 0.68\\) and \\(P(B) =\\) no car \\(= 0.68\\). With these information, they can obtain the probability that a student likes wine given it has no car, even if not all students without cars went to the bar: \\(P(A|B) = \\frac{0.6*0.68}{0.68} = 0.6\\). 2.4.2 Bayes theorem for continuous parameters When we use the Bayes theorem for analyzing data, then the aim is to make probability statements for parameters. Because most parameters are measured at a continuous scale we use probability density functions to describe what we know about them. The Bayes theorem can be formulated for probability density functions denoted with \\(p(\\theta)\\), e.g. for a parameter \\(\\theta\\) (for example probability density functions see Chapter 4). What we are interested in is the probability of the parameter \\(\\theta\\) given the data, i.e., \\(p(\\theta|y)\\). This probability density function is called the posterior distribution of the parameter \\(\\theta\\). Here is the Bayes theorem formulated for obtaining the posterior distribution of a parameter from the data \\(y\\), the prior distribution of the parameter \\(p(\\theta)\\) and assuming a model for the data generating process. The data model is defined by the likelihood that specifies how the data \\(y\\) is distributed given the parameters \\(p(y|\\theta)\\): \\(p(\\theta|y) = \\frac{p(y|\\theta)p(\\theta)}{p(y)} = \\frac{p(y|\\theta)p(\\theta)}{\\int p(y|\\theta)p(\\theta) d\\theta}\\) The probability of the data \\(p(y)\\) is also called the scaling constant, because it is a constant. It is the integral of the likelihood over all possible values of the parameter(s) of the model. 2.4.3 Estimating a mean assuming that the variance is known For illustration, we first describe a simple (unrealistic) example for which it is almost possible to follow the mathematical steps for solving the Bayes theorem even for non-mathematicians. Even if we cannot follow all steps, this example will illustrate the principle how the Bayesian theorem works for continuous parameters. The example is unrealistic because we assume that the variance \\(\\sigma^2\\) in the data \\(y\\) is known. We construct a data model by assuming that \\(y\\) is normally distributed: \\(p(y|\\theta) = normal(\\theta, \\sigma)\\), with \\(\\sigma\\) known. The function \\(normal\\) defines the probability density function of the normal distribution (Chapter 4). The parameter, for which we would like to get the posterior distribution is \\(\\theta\\), the mean. We specify a prior distribution for \\(\\theta\\). Because the normal distribution is a conjugate prior for a normal data model with known variance, we use the normal distribution. Conjugate priors have nice mathematical properties (see Chapter 10) and are therefore preferred when the posterior distribution is obtained algebraically. That is the prior: \\(p(\\theta) = normal(\\mu_0, \\tau_0)\\) With the above data, data model and prior, the posterior distribution of the mean \\(\\theta\\) is defined by: \\(p(\\theta|y) = normal(\\mu_n, \\tau_n)\\), where \\(\\mu_n= \\frac{\\frac{1}{\\tau_0^2}\\mu_0 + \\frac{n}{\\sigma^2}\\bar{y}}{\\frac{1}{\\tau_0^2}+\\frac{n}{\\sigma^2}}\\) and \\(\\frac{1}{\\tau_n^2} = \\frac{1}{\\tau_0^2} + \\frac{n}{\\sigma^2}\\) \\(\\bar{y}\\) is the arithmetic mean of the data. Because only this value is needed in order to obtain the posterior distribution, this value is called the sufficient statistics. From the mathematical formulas above and also from Fig. 2.8 we see that the mean of the posterior distribution is a weighted average between the prior mean and \\(\\bar{y}\\) with weights equal to the precisions (\\(\\frac{1}{\\tau_0^2}\\) and \\(\\frac{n}{\\sigma^2}\\)). Figure 2.8: Hypothetical example showing the result of applying the Bayes theorem for obtaining a posterior distribution of a continuous parameter. The likelhood is defined by the data and the model, the prior is expressing the knowledge about the parameter before looking at the data. Combining the two distributions using the Bayes theorem results in the posterior distribution. 2.4.4 Estimating the mean and the variance We now move to a more realistic example, which is estimating the mean and the variance of a sample of weights of Snowfinches Montifringilla nivalis (Fig. 2.9). To analyze those data, a model with two parameters (the mean and the variance or standard deviation) is needed. The data model (or likelihood) is specified as \\(p(y|\\theta, \\sigma) = normal(\\theta, \\sigma)\\). Figure 2.9: Snowfinches stay above the treeline for winter. They come to feeders. # weight (g) y <- c(47.5, 43, 43, 44, 48.5, 37.5, 41.5, 45.5) n <- length(y) Because there are two parameters, we need to specify a two-dimensional prior distribution. We looked up in A. Gelman et al. (2014b) that the conjugate prior distribution in our case is an Normal-Inverse-Chisquare distribution: \\(p(\\theta, \\sigma) = N-Inv-\\chi^2(\\mu_0, \\sigma_0^2/\\kappa_0; v_0, \\sigma_0^2)\\) From the same reference we looked up how the posterior distribution looks like in our case: \\(p(\\theta,\\sigma|y) = \\frac{p(y|\\theta, \\sigma)p(\\theta, \\sigma)}{p(y)} = N-Inv-\\chi^2(\\mu_n, \\sigma_n^2/\\kappa_n; v_n, \\sigma_n^2)\\), with \\(\\mu_n= \\frac{\\kappa_0}{\\kappa_0+n}\\mu_0 + \\frac{n}{\\kappa_0+n}\\bar{y}\\), \\(\\kappa_n = \\kappa_0+n\\), \\(v_n = v_0 +n\\), \\(v_n\\sigma_n^2=v_0\\sigma_0^2+(n-1)s^2+\\frac{\\kappa_0n}{\\kappa_0+n}(\\bar{y}-\\mu_0)^2\\) For this example, we need the arithmetic mean \\(\\bar{y}\\) and standard deviation \\(s^2\\) from the sample for obtaining the posterior distribution. Therefore, these two statistics are the sufficient statistics. The above formula look intimidating, but we never really do that calculations. We let R doing that for us in most cases by simulating many numbers from the posterior distribution, e.g., using the function sim from the package arm (Andrew Gelman and Hill 2007). In the end, we can visualize the distribution of these many numbers to have a look at the posterior distribution. In Fig. 2.10 the two-dimensional \\((\\theta, \\sigma)\\) posterior distribution is visualized by using simulated values. The two dimensional distribution is called the joint posterior distribution. The mountain of dots in Fig. 2.10 visualize the Normal-Inverse-Chisquare that we calculated above. When all values of one parameter is displayed in a histogram ignoring the values of the other parameter, it is called the marginal posterior distribution. Algebraically, the marginal distribution is obtained by integrating one of the two parameters out over the joint posterior distribution. This step is definitively way easier when simulated values from the posterior distribution are available! Figure 2.10: Visualization of the joint posterior distribution for the mean and standard deviation of Snowfinch weights. The lower left panel shows the two-dimensional joint posterior distribution, whereas the upper and right panel show the marginal posterior distributions of each parameter separately. The marginal posterior distributions of every parameter is what we normally report in a paper to report statistical uncertainty. In our example, the marginal distribution of the mean is a t-distribution (Chapter 4). Frequentist statistical methods also use a t-distribution to describe the uncertainty of an estimated mean for the case when the variance is not known. Thus, frequentist methods came to the same solution using a completely different approach and different techniques. Doesn’t that increase dramatically our trust in statistical methods? 2.5 Classical frequentist tests and alternatives 2.5.1 Nullhypothesis testing Null hypothesis testing is constructing a model that describes how the data would look like in case of what we expect to be would not be. Then, the data is compared to how the model thinks the data should look like. If the data does not look like the model thinks they should, we reject that model and accept that our expectation may be true. To decide whether the data looks like the null-model thinks the data should look like the p-value is used. The p-value is the probability of observing the data or more extreme data given the null hypothesis is true. Small p-values indicate that it is rather unlikely to observe the data or more extreme data given the null hypothesis \\(H_0\\) is true. Null hypothesis testing is problematic. We discuss some of the problems after having introduces the most commonly used classical tests. 2.5.2 Comparison of a sample with a fixed value (one-sample t-test) In some studies, we would like to compare the data to a theoretical value. The theoretical value is a fixed value, e.g. calculated based on physical, biochemical, ecological or any other theory. The statistical task is then to compare the mean of the data including its uncertainty with the theoretical value. The result of such a comparison may be an estimate of the mean of the data with its uncertainty or an estimate of the difference of the mean of the data to the theoretical value with the uncertainty of this difference. For example, a null hypothesis could be \\(H_0:\\)“The mean of Snowfinch weights is exactly 40g.” A normal distribution with a mean of \\(\\mu_0=40\\) and a variance equal to the estimated variance in the data \\(s^2\\) is then assumed to describe how we would expect the data to look like given the null hypothesis was true. From that model it is possible to calculate the distribution of hypothetical means of many different hypothetical samples of sample size \\(n\\). The result is a t-distribution (Fig. 2.11). In classical tests, the distribution is standardized so that its variance was one. Then the sample mean, or in classical tests a standardized difference between the mean and the hypothetical mean of the null hypothesis (here 40g), called test statistics \\(t = \\frac{\\bar{y}-\\mu_0}{\\frac{s}{\\sqrt{n}}}\\), is compared to that (standardized) t-distribution. If the test statistics falls well within the expected distribution the null hypothesis is accepted. Then, the data is well compatible with the null hypothesis. However, if the test statistics falls in the tails or outside the distribution, then the null hypothesis is rejected and we could write that the mean weight of Snowfinches is statistically significantly different from 40g. Unfortunately, we cannot infer about the probability of the null hypothesis and how relevant the result is. Figure 2.11: Illustration of a one-sample t-test. The blue histogram shows the distribution of the measured weights with the sample mean (lightblue) indicated as a vertical line. The black line is the t-distribution that shows how hypothetical sample means are expected to be distributed if the big population of Snowfinches has a mean weight of 40g (i.e., if the null hypothesis was true). Orange area shows the area of the t-distribution that lays equal or farther away from 40g than the sample mean. The orange area is the p-value. We can use the r-function t.test to calculate the p-value of a one sample t-test. t.test(y, mu=40) ## ## One Sample t-test ## ## data: y ## t = 3.0951, df = 7, p-value = 0.01744 ## alternative hypothesis: true mean is not equal to 40 ## 95 percent confidence interval: ## 40.89979 46.72521 ## sample estimates: ## mean of x ## 43.8125 The output of the r-function t.test also includes the mean and the 95% confidence interval (or compatibility or uncertainty interval) of the mean. The CI could, alternatively, be obtained as the 2.5% and 97.5% quantiles of a t-distribution with a mean equal to the sample mean, degrees of freedom equal to the sample size minus one and a standard deviation equal to the standard error of the mean. # lower limit of 95% CI mean(y) + qt(0.025, df=length(y)-1)*sd(y)/sqrt(n) ## [1] 40.89979 # upper limit of 95% CI mean(y) + qt(0.975, df=length(y)-1)*sd(y)/sqrt(n) ## [1] 46.72521 When applying the Bayes theorem for obtaining the posterior distribution of the mean we end up with the same t-distribution as described above, in case we use flat prior distributions for the mean and the standard deviation. Thus, the two different approaches end up with the same result! par(mar=c(4.5, 5, 2, 2)) hist(y, col="blue", xlim=c(30,52), las=1, freq=FALSE, main=NA, ylim=c(0, 0.3)) abline(v=mean(y), lwd=2, col="lightblue") abline(v=40, lwd=2) lines(density(bsim@coef)) text(45, 0.3, "posterior distribution\\nof the mean of y", cex=0.8, adj=c(0,1), xpd=NA) Figure 2.12: Illustration of the posterior distribution of the mean. The blue histogram shows the distribution of the measured weights with the sample mean (lightblue) indicated as a vertical line. The black line is the posterior distribution that shows what we know about the mean after having looked at the data. The area under the posterior density function that is larger than 40 is the posterior probability of the hypothesis that the true mean Snwofinch weight is larger than 40g. The posterior probability of the hypothesis that the true mean Snowfinch weight is larger than 40g, \\(P(H:\\mu>40) =\\), is equal to the proportion of simulated random values from the posterior distribution, saved in the vector bsim@coef, that are larger than 40. # Two ways of calculating the proportion of values # larger than a specific value within a vector of values round(sum(bsim@coef[,1]>40)/nrow(bsim@coef),2) ## [1] 0.99 round(mean(bsim@coef[,1]>40),2) ## [1] 0.99 # Note: logical values TRUE and FALSE become # the numeric values 1 and 0 within the functions sum() and mean() We, thus, can be pretty sure that the mean Snowfinch weight (in the big world population) is larger than 40g. Such a conclusion is not very informative, because it does not tell us how much larger we can expect the mean Snowfinch weight to be. Therefore, we prefer reporting a credible interval (or compatibility interval or uncertainty interval) that tells us what values for the mean Snowfinch weight are compatible with the data (given the data model we used realistically reflects the data generating process). Based on such an interval, we can conclude that we are pretty sure that the mean Snowfinch weight is between 40 and 48g. # 80% credible interval, compatibility interval, uncertainty interval quantile(bsim@coef[,1], probs=c(0.1, 0.9)) ## 10% 90% ## 42.07725 45.54080 # 95% credible interval, compatibility interval, uncertainty interval quantile(bsim@coef[,1], probs=c(0.025, 0.975)) ## 2.5% 97.5% ## 40.90717 46.69152 # 99% credible interval, compatibility interval, uncertainty interval quantile(bsim@coef[,1], probs=c(0.005, 0.995)) ## 0.5% 99.5% ## 39.66181 48.10269 2.5.3 Comparison of the locations between two groups (two-sample t-test) Many research questions aim at measuring differences between groups. For example, we could be curious to know how different in size car owner are from people not owning a car. A boxplot is a nice possibility to visualize the ell length measurements of two (or more) groups (Fig. 2.13). From the boxplot, we do not see how many observations are in the two samples. We can add that information to the plot. The boxplot visualizes the samples but it does not show what we know about the big (unmeasured) population mean. To show that, we need to add a compatibility interval (or uncertainty interval, credible interval, confidence interval, in brown in Fig. 2.13). Figure 2.13: Ell length of car owners (Y) and people not owning a car (N). Horizontal bar = median, box = interquartile range, whiskers = extremest observation within 1.5 times the interquartile range from the quartile, circles=observations farther than 1.5 times the interquartile range from the quartile. Filled brown circles = means, vertical brown bars = 95% compatibility interval. When we added the two means with a compatibility interval, we see what we know about the two means, but we do still not see what we know about the difference between the two means. The uncertainties of the means do not show the uncertainty of the difference between the means. To do so, we need to extract the difference between the two means from a model that describes (abstractly) how the data has been generated. Such a model is a linear model that we will introduce in Chapter 11. The second parameter measures the differences in the means of the two groups. And from the simulated posterior distribution we can extract a 95% compatibility interval. Thus, we can conclude that the average ell length of car owners is with high probability between 0.5 cm smaller and 2.5 cm larger than the averag ell of people not having a car. mod <- lm(ell~car, data=dat) mod ## ## Call: ## lm(formula = ell ~ car, data = dat) ## ## Coefficients: ## (Intercept) carY ## 43.267 1.019 bsim <- sim(mod, n.sim=nsim) quantile(bsim@coef[,"carY"], prob=c(0.025, 0.5, 0.975)) ## 2.5% 50% 97.5% ## -0.501348 1.014478 2.494324 The corresponding two-sample t-test gives a p-value for the null hypothesis: “The difference between the two means equals zero.”, a confidence interval for the difference and the two means. While the function lmgives the difference Y minus N, the function t.testgives the difference N minus Y. Luckily the two means are also given in the output, so we know which group mean is the larger one. t.test(ell~car, data=dat, var.equal=TRUE) ## ## Two Sample t-test ## ## data: ell by car ## t = -1.4317, df = 20, p-value = 0.1677 ## alternative hypothesis: true difference in means between group N and group Y is not equal to 0 ## 95 percent confidence interval: ## -2.5038207 0.4657255 ## sample estimates: ## mean in group N mean in group Y ## 43.26667 44.28571 In both possibilities, we used to compare the to means, the Bayesian posterior distribution of the difference and the t-test or the confidence interval of the difference, we used a data model. We thus assumed that the observations are normally distributed. In some cases, such an assumption is not a reasonable assumption. Then the result is not reliable. In such cases, we can either search for a more realistic model or use non-parametric (also called distribution free) methods. Nowadays, we have almost infinite possibilities to construct data models (e.g. generalized linear models and beyond). Therefore, we normally start looking for a model that fits the data better. However, in former days, all these possiblities did not exist (or were not easily available for non-mathematicians). Therefore, we here introduce two of such non-parametric methods, the Wilcoxon-test (or Mann-Whitney-U-test) and the randomisation test. Some of the distribution free statistical methods are based on the rank instead of the value of the observations. The principle of the Wilcoxon-test is to rank the observations and sum the ranks per group. It is not completely true that the non-parametric methods do not have a model. The model of the Wilcoxon-test “knows” how the difference in the sum of the ranks between two groups is distributed given the mean of the two groups do not differ (null hypothesis). Therefore, it is possible to get a p-value, e.g. by the function wilcox.test. wilcox.test(ell~car, data=dat) ## ## Wilcoxon rank sum test with continuity correction ## ## data: ell by car ## W = 34.5, p-value = 0.2075 ## alternative hypothesis: true location shift is not equal to 0 The note in the output tells us that ranking is ambiguous, when some values are equal. Equal values are called ties when they should be ranked. The result of the Wilcoxon-test tells us how probable it is to observe the difference in the rank sum between the two sample or a more extreme difference given the means of the two groups are equal. That is at least something. A similar result is obtained by using a randomisation test. This test is not based on ranks but on the original values. The aim of the randomisation is to simulate a distribution of the difference in the arithmetic mean between the two groups assuming this difference would be zero. To do so, the observed values are randomly distributed among the two groups. Because of the random distribution among the two groups, we expect that, if we repeat that virtual experiment many times, the average difference between the group means would be zero (both virtual samples are drawn from the same big population). We can use a loop in R for repeating the random re-assignement to the two groups and, each time, extracting the difference between the group means. As a result, we have a vector of many (nsim) values that all are possible differences between group means given the two samples were drawn from the same population. The proportion of these values that have an equal or larger absolute value give the probability that the observed or a larger difference between the group means is observed given the null hypothesis would be true, thus that is a p-value. diffH0 <- numeric(nsim) for(i in 1:nsim){ randomcars <- sample(dat$car) rmod <- lm(ell~randomcars, data=dat) diffH0[i] <- coef(rmod)["randomcarsY"] } mean(abs(diffH0)>abs(coef(mod)["carY"])) # p-value ## [1] 0.1858 Visualizing the possible differences between the group means given the null hypothesis was true shows that the observed difference is well within what is expected if the two groups would not differ in their means (Fig. 2.14). Figure 2.14: Histogram if differences between the means of randomly assigned groups (grey) and the difference between the means of the two observed groups (red) The randomization test results in a p-value and, we could also report the observed difference between the group means. However, it does not tell us, what values of the difference all would be compatible with the data. We do not get an uncertainty measurement for the difference. In order to get a compatibility interval without assuming a distribution for the data (thus non-parametric) we could bootstrap the samples. Bootstrapping is sampling observations from the data with replacement. For example, if we have a sample of 8 observations, we draw 8 times a random observation from the 8 observation. Each time, we assume that all 8 observations are available. Thus a bootstrapped sample could include some observations several times, whereas others are missing. In this way, we simulate the variance in the data that is due to the fact that our data do not contain the whole big population. Also bootstrapping can be programmed in R using a loop. diffboot <- numeric(1000) for(i in 1:nsim){ ngroups <- 1 while(ngroups==1){ bootrows <- sample(1:nrow(dat), replace=TRUE) ngroups <- length(unique(dat$car[bootrows])) } rmod <- lm(ell~car, data=dat[bootrows,]) diffboot[i] <- coef(rmod)[2] } quantile(diffboot, prob=c(0.025, 0.975)) ## 2.5% 97.5% ## -0.3395643 2.4273810 The resulting values for the difference between the two group means can be interpreted as the distribution of those differences, if we had repeated the study many times (Fig. 2.15). A 95% interval of the distribution corresponds to a 95% compatibility interval (or confidence interval or uncertainty interval). hist(diffboot); abline(v=coef(mod)[2], lwd=2, col="red") Figure 2.15: Histogram of the boostrapped differences between the group means (grey) and the observed difference. For both methods, randomisation test and bootstrapping, we have to assume that all observations are independent. Randomization and bootstrapping becomes complicated or even unfeasible when data are structured. 2.6 Summary Bayesian data analysis is applying the Bayes theorem for summarizing knowledge based on data, priors and the model assumptions. Frequentist statistics is quantifying uncertainty by hypothetical repetitions. "],["analyses_steps.html", "3 Data analysis step by step 3.1 Plausibility of Data 3.2 Relationships 3.3 Data Distribution 3.4 Preparation of Explanatory Variables 3.5 Data Structure 3.6 Define Prior Distributions 3.7 Fit the Model 3.8 Check Model 3.9 Model Uncertainty 3.10 Draw Conclusions Further reading", " 3 Data analysis step by step In this chapter we provide a checklist with some guidance for data analysis. However, do not expect the list to be complete and for different studies, a different order of the steps may make more sense. We usually repeat steps 3.2 to 3.8 until we find a model that fit the data well and that is realistic enough to be useful for the intended purpose. Data analysis is always a lot of work and, often, the following steps have to be repeated many times until we find a useful model. There is a chance and danger at the same time: we may find interesting results that answer different questions than we asked originally. They may be very exciting and important, however they may be biased. We can report such findings, but we should state that they appeared (more or less by chance) during the data exploration and model fitting phase, and we have to be aware that the estimates may be biased because the study was not optimally designed with respect to these findings. It is important to always keep the original aim of the study in mind. Do not adjust the study question according to the data. We also recommend reporting what the model started with at the first iteration and describing the strategy and reasoning behind the model development process. 3.1 Plausibility of Data Prepare the data and check graphically, or via summary statistics, whether all the data are plausible. Prepare the data so that errors (typos, etc.) are minimal, for example, by double-checking the entries. See chapter 5 for useful R-code that can be used for data preparation and to make plausibility controls. 3.2 Relationships Think about the direct and indirect relationships among the variables of the study. We normally start a data analysis by drawing a sketch of the model including all explanatory variables and interactions that may be biologically meaningful. We will most likely repeat this step after having looked at the model fit. To make the data analysis transparent we should report every model that was considered. A short note about why a specific model was considered and why it was discarded helps make the modeling process reproducible. 3.3 Data Distribution What is the nature of the variable of interest (outcome, dependent variable)? At this stage, there is no use of formally comparing the distribution of the outcome variable to a statistical distribution, because the rawdata is not required to follow a specific distribution. The models assume that conditional on the explanatory variables and the model structure, the outcome variable follows a specific distribution. Therefore, checking how well the chosen distribution fits to the data is done after the model fit 3.8. This first choice is solely done based on the nature of the data. Normally, our first choice is one of the classical distributions for which robust software for model fitting is available. Here is a rough guideline for this first choice: continuous measurements \\(\\Longrightarrow\\) normal distribution > exceptions: time-to-event data \\(\\Longrightarrow\\) see survival analysis count \\(\\Longrightarrow\\) Poisson or negative-binomial distribution count with upper bound (proportion) \\(\\Longrightarrow\\) binomial distribution binary \\(\\Longrightarrow\\) Bernoully distribution rate (count by a reference) \\(\\Longrightarrow\\) Poisson including an offset nominal \\(\\Longrightarrow\\) multinomial distribution Chapter 4 gives an overview of the distributions that are most relevant for ecologists. 3.4 Preparation of Explanatory Variables Look at the distribution (histogram) of every explanatory variable: Linear models do not assume that the explanatory variables have any specific distribution. Thus there is no need to check for a normal distribution! However, very skewed distributions result in unequal weighting of the observations in the model. In extreme cases, the slope of a regression line is defined by one or a few observations only. We also need to check whether the variance is large enough, and to think about the shape of the expected effect. The following four questions may help with this step: Is the variance (of the explanatory variable) big enough so that an effect of the variable can be measured? Is the distribution skewed? If an explanatory variable is highly skewed, it may make sense to transform the variable (e.g., log, square-root). Does it show a bimodal distribution? Consider making the variable binary. Is it expected that a change of 1 at lower values for x has the same biological effect as a change of 1 at higher values of x? If not, a trans- formation (e.g., log) could linearize the relationship between x and y. Centering: Centering (\\(x.c = x-mean(x)\\)) is a transformation that produces a variable with a mean of 0. Centering is optional. We have two reasons to center a predictor variable. First, it helps the model fitting algorithm to better converge because it reduces correlations among model parameters. Second, with centered predictors, the intercept and main effects in the linear model are better interpretable (they are measured at the center of the data instead of at the covariate value of 0 which may be far off). Scaling: Scaling (\\(x.s = x/c\\), where \\(c\\) is a constant) is a transformation that changes the unit of the variable. Also scaling is optional. We have three reasons to scale an predictor variable. First, to make the effect sizes better understandable. For example, a population change from one year to the next may be very small and hard to interpret. When we give the change for a 10-year period, its ecological meaning is better understandable. Second, to make the estimate of the effect sizes comparable between variables, we may use \\(x.s = x/sd(x)\\). The resulting variable has a unit of one standard deviation. A standard deviation may be comparable between variables that oritinally are measured in different units (meters, seconds etc). A. Gelman and Hill (2007) (p. 55 f) propose to scale the variables by two times the standard deviation (\\(x.s = x/(2*sd(x))\\)) to make effect sizes comparable between numeric and binary variables. Third, scaling can be important for model convergence, especially when polynomials are included. Also, consider the use of orthogonal polynomials, see Chapter 4.2.9 in Korner-Nievergelt et al. (2015). Collinearity: Look at the correlation among the explanatory variables (pairs plot or correlation matrix). If the explanatory variables are correlated, go back to step 2. Also, Chapter 4.2.7 in Korner-Nievergelt et al. (2015) discusses collinearity. Are interactions and polynomial terms needed in the model? If not already done in step 2, think about the relationship between each explanatory variable and the dependent variable. Is it linear or do polynomial terms have to be included in the model? If the relationship cannot be described appropriately by polynomial terms, think of a nonlinear model or a generalized additive model (GAM). May the effect of one explanatory variable depend on the value of another explanatory variable (interaction)? 3.5 Data Structure After having taken into account all of the (fixed effect) terms from step 4: are the observations independent or grouped/structured? What random factors are needed in the model? Are the data obviously temporally or spatially correlated? Or, are other correlation structures present, such as phylogenetic relationships? Our strategy is to start with a rather simple model that may not account for all correlation structures that in fact are present in the data. We first, only include those that are known to be important a priory. Only when residual analyses reveals important additional correlation structures, we include them in the model. 3.6 Define Prior Distributions Decide whether we would like to use informative prior distributions or whether we would like use priors that only have a negligible effect on the results. When the results are later used for informing authorities or for making a decision (as usual in applied sciences), then we would like to base the results on all information available. Information from the literature is then used to construct informative prior distributions. In contrast to applied sciences, in basic research we often would like to show only the information in the data that should not be influenced by earlier results. Therefore, in basic research we look for priors that do not influence the results. 3.7 Fit the Model Fit the model. 3.8 Check Model We assess model fit by graphical analyses of the residuals (Chapter 6 in Korner-Nievergelt et al. (2015)), by predictive model checking (Section 10.1 in Korner-Nievergelt et al. (2015)), or by sensitivity analysis (Chapter 15 in Korner-Nievergelt et al. (2015)). For non-Gaussian models it is often easier to assess model fit using pos- terior predictive checks (Chapter 10 in Korner-Nievergelt et al. (2015)) rather than residual analyses. Posterior predictive checks usually show clearly in which aspect the model failed so we can go back to step 2 of the analysis. Recognizing in what aspect a model does not fit the data based on residual plots improves with experience. Therefore, we list in Chapter 16 of Korner-Nievergelt et al. (2015) some patterns that can appear in residual plots together with what these patterns possibly indicate. We also indicate what could be done in the specific cases. 3.9 Model Uncertainty If, while working through steps 1 to 8, possibly repeatedly, we came up with one or more models that fit the data reasonably well, we then turn to the methods presented in Chapter 11 (Korner-Nievergelt et al. (2015)) to draw inference from more than one model. If we have only one model, we proceed to 3.10. 3.10 Draw Conclusions Simulate values from the joint posterior distribution of the model parameters (sim or Stan). Use these samples to present parameter uncertainty, to obtain posterior distributions for predictions, probabilities of specific hypotheses, and derived quantities. Further reading R for Data Science by Garrett Grolemund and Hadley Wickham: Introduces the tidyverse framwork. It explains how to get data into R, get it into the most useful structure, transform it, visualise it and model it. "],["distributions.html", "4 Probability distributions 4.1 Introduction 4.2 Discrete distributions 4.3 Continuous distributions", " 4 Probability distributions 4.1 Introduction In Bayesian statistics, probability distributions are used for two fundamentally different purposes. First, they are used to describe distributions of data. These distributions are also called data distributions. Second, probability distributions are used to express information or knowledge about parameters. Such distributions are called prior or posterior distributions. The data distributions are part of descriptive statistics, whereas prior and posterior distributions are part of inferential statistics. The usage of probability distributions for describing data does not differ between frequentist and Bayesian statistics. Classically, the data distribution is known as “model assumption”. Specifically to Bayesian statistics is the formal expression of statistical uncertainty (or “information” or “knowledge”) by prior and posterior distributions. We here introduce some of the most often used probability distributions and present how they are used in statistics. Probability distributions are grouped into discrete and continuous distributions. Discrete distributions define for any discrete value the probability that exactly this value occurs. They are usually used as data distributions for discrete data such as counts. The function that describes a discrete distribution is called a probability function (their values are probabilities, i.e. a number between 0 and 1). Continuous distributions describe how continuous values are distributed. They are used as data distributions for continuous measurements such as body size and also as prior or posterior distributions for parameters such as the mean body size. Most parameters are measured on a continuous scale. The function that describes continuous distributions is called density function. Its values are non-negative and the area under the density function equals one. The area under a density function corresponds to probabilities. For example, the area under the density function above the value 2 corresponds to the proportion of data with values above 2 if the density function describes data, or it corresponds to the probability that the parameter takes on a value bigger than 2 if the density function is a posterior distribution. 4.2 Discrete distributions 4.2.1 Bernoulli distribution Bernoulli distributed data take on the exact values 0 or 1. The value 1 occurs with probability \\(p\\). \\(x \\sim Bernoulli(p)\\) The probability function is \\(p(x) = p^x(1-p)^{1-x}\\). The expected value is \\(E(x) = p\\) and the variance is \\(Var(x) = p(1-p)\\). The flipping experiment of a fair coin produces Bernoulli distributed data with \\(p=0.5\\) if head is taken as one and tail is taken as zero. The Bernoulli distribution is usually used as a data model for binary data such as whether a nest box is used or not, whether a seed germinated or not, whether a species occurs or not in a plot etc. 4.2.2 Binomial distribution The binomial distribution describes the number of ones among a predefined number of Bernoulli trials. For example, the number of heads among 20 coin flips, the number of used nest boxes among the 50 nest boxes of the study area, or the number of seed that germinated among the 10 seeds in the pot. Binomially distributed data are counts with an upper limit (\\(n\\)). \\(x \\sim binomial(p,n)\\) The probability function is \\(p(x) = {n\\choose x} p^x(1-p)^{(n-x)}\\). The expected value is \\(E(x) = np\\) and the variance is \\(Var(x) = np(1-p)\\). Figure 4.1: Two examples of a binomial distribution. size: number of trials (the argument in the corresponding R function, for example in rbinom, is called size). p: success probability. 4.2.3 Poisson distribution The Poisson distribution describes the distribution of counts without upper boundary, i.e., when we know how many times something happened but we do not know how many times it did not happen. A typical Poisson distributed variable is the number of raindrops in equally-sized grid cells on the floor, if we can assume that every rain drop falls down completely independent of the other raindrops and at a completely random point (Figure 4.2). \\(x \\sim Poisson(\\lambda)\\) The probability function is \\(p(x) = \\frac{1}{x!}\\lambda^xexp(-\\lambda)\\). It is implemented in the R-function dpois. The expected values is \\(E(x) = \\lambda\\) and the variance is \\(Var(x) = \\lambda\\). set.seed(1338) n <- 500 # simulate 500 raindrops x <- runif(n) # they fall at some random point (x,y) in space y <- runif(n) par(mfrow=c(1,2)) par(mar=c(1,1,1,1)) plot(c(0,1), c(0,1), type="n", xaxs="i", yaxs="i", xlab="", ylab="", axes=F) box() points(x,y, pch=16) # add a grid grid(10, 10, col=1) # number of points per grid-cell xcell <- cut(x, breaks=seq(0,1, by=0.1)) ycell <- cut(y, breaks=seq(0,1, by=0.1)) counts <- as.numeric(table(xcell, ycell)) par(mar=c(4,4,1,1)) hist(counts, col="blue", cex.lab=1.4, las=1, cex.axis=1.2, main="") Figure 4.2: A natural process that produces Poisson distributed data is the number of raindrops falling (at random) into equally sized cells of a grid. Left: spatial distribution of raindrops, right: corresponding distribution of the number of raindrops per cell. An important property of the Poisson distribution is that it has only one parameter \\(\\lambda\\). As a consequence, it does not allow for any combination of means and variances. In fact, they are assumed to be the same. In the real world, most count data do not behave like rain drops, that means variances of count data are in most real world examples not equal to the mean as assumed by the Poisson distribution. Therefore, when using the Poisson distribution as a data model, it is important to check for overdispersion. The property that in a Poisson distribution the mean equals the variance can be used to quickly assess whether the spatial distribution of observations, for example, nest locations, is clustered, random, or equally spaced. Animal locations could be clustered due to coloniality, social, or other attraction. More equally spaced location may be due to territoriality. Let \\(x\\) be the number of observations per grid cell; if \\(var(x)/mean(x)>>1\\) the observations are clustered, whereas if \\(var(x)/mean(x)<<1\\), the observations are more equally spaced than expected by chance. Clustering will lead to overdispersion in the counts whereas more equally spaced locations will lead to underdispersion. Further, note that not all variables measured as an integer number are count data! For example, the number of days an animal spends in a specific area before moving away looks like a count. However, it is a continuous measurement. The duration an animal spends in a specific areas could also be measured in hours or minutes. The Poisson model assumes that the counts are all events that happened. However, an emigration of an animal is just one event, independent of how long it stayed. 4.2.4 Negative-binomial distribution The negative-binomial distribution represents the number of zeros which occur in a sequence of Bernoulli trials before a target number of ones is reached. It is hard to see this situation in, e.g., the number of individuals counted on plots. Therefore, we were reluctant to introduce this distribution in our old book (Korner-Nievergelt et al. 2015). However, the negative-binomial distribution often fits much better to count data than the Poisson model because it has two parameters and therefore allows for fitting both the mean and the variance to the data. Therefore, we started using the negative-binomial distribution as a data model more often. \\(x \\sim negative-binomial(p,n)\\) Its probability function is rather complex: \\(p(x) = \\frac{\\Gamma(x+n)}{\\Gamma(n) x!} p^n (1-p)^x\\) with \\(\\Gamma\\) being the Gamma-function. Luckily, the negative-binomial probability function is implemented in the R-function dnegbin. The expected value of the negative-binomial distribution is \\(E(x) = n\\frac{(1-p)}{p}\\) and the variance is \\(Var(x) = n\\frac{(1-p)}{p^2}\\). We like to specify the distribution using the mean and the scale parameter \\(x \\sim negativ-binomial(\\mu,\\theta)\\), because in practice we often specify a linear predictor for the logarithm of the mean \\(\\mu\\). 4.3 Continuous distributions 4.3.1 Beta distribution The beta distribution is restricted to the range [0,1]. It describes the knowledge about a probability parameter. Therefore, it is usually used as a prior or posterior distribution for probabilities. The beta distribution sometimes is used as a data model for continuous probabilities, However, it is difficult to get a good fit of such models, because measured proportions often take on values of zero and ones which is not allowed in most (but not all) beta distributions, thus this distribution does not describe the variance of measured proportions correctly. However, for describing knowledge of a proportion parameter, it is a very convenient distribution with two parameters. \\(x \\sim beta(a,b)\\) Its density function is \\(p(x) = \\frac{\\Gamma(a+b)}{\\Gamma(a)\\Gamma(b)}x^{a-1}(1-x)^{b-1}\\). The R-function dbetadoes the rather complicated calculations for us. The expected value of a beta distribution is \\(E(x) = \\frac{a}{(a+b)}\\) and the variance is \\(Var(x) = \\frac{ab}{(a+b)^2(a+b+1)}\\). The \\(beta(1,1)\\) distribution is equal to the \\(uniform(0,1)\\) distribution. The higher the sum of \\(a\\) and \\(b\\), the more narrow is the distribution (Figure 4.3). Figure 4.3: Beta distributions with different parameter values. 4.3.2 Normal distribution The normal, or Gaussian distribution is widely used since a long time in statistics. It describes the distribution of measurements that vary because of a sum of random errors. Based on the central limit theorem, sample averages are approximately normally distributed (2). \\(x \\sim normal(\\mu, \\sigma^2)\\) The density function is \\(p(x) = \\frac{1}{\\sqrt{2\\pi}\\sigma}exp(-\\frac{1}{2\\sigma^2}(x -\\mu)^2)\\) and it is implemented in the R-function dnorm. The expected value is \\(E(x) = \\mu\\) and the variance is \\(Var(x) = \\sigma^2\\). The variance parameter can be specified to be a variance, a standard deviation or a precision. Different software (or authors) have different habits, e.g., R and Stan use the standard deviation sigma \\(\\sigma\\), whereas BUGS (WinBugs, OpenBUGS or jags) use the precision, which is the inverse of the variance $= $. The normal distribution is used as a data model for measurements that scatter symmetrically around a mean, such as body size (in m), food consumption (in g), or body temperature (°C). The normal distribution also serves as prior distribution for parameters that can take on negative or positive values. The larger the variance, the flatter (less informative) is the distribution. The standard normal distribution is a normal distribution with a mean of zero and a variance of one, \\(z \\sim normal(0, 1)\\). The standard normal distribution is also called the z-distribution. Or, a z-variable is a variable with a mean of zero and a standard deviation of one. x <- seq(-3, 3, length=100) y <- dnorm(x) # density function of a standard normal distribution dat <- tibble(x=x,y=y) plot(x,y, type="l", lwd=2, col="#d95f0e", las=1, ylab="normal density of x") segments(0, dnorm(1), 1, dnorm(1), lwd=2) segments(0, dnorm(0), 0, 0) text(0.5, 0.23, expression(sigma)) Figure 4.4: Standard normal distribution Plus minus one times the standard deviation (\\(\\sigma\\)) from the mean includes around 68% of the area under the curve (corresponding to around 68% of the data points in case the normal distribution is used as a data model, or 68% of the prior or posterior mass if the normal distribution is used to describe the knowledge about a parameter). Plus minus two times the standard deviation includes around 95% of the area under the curve. 4.3.3 Gamma distribution The gamma distribution is a continuous probability distribution for strictly positive values (zero is not included). The shape of the gamma distribution is right skewed with a long upper tail, whereas most of the mass is centered around a usually small value. It has two parameters, the shape \\(\\alpha\\) and the inverse scale \\(\\beta\\). \\(x \\sim gamma(\\alpha,\\beta)\\) Its density function is \\(p(x) = \\frac{\\beta^{\\alpha}}{\\Gamma(\\alpha)} x^{(\\alpha-1)} exp(-\\beta x)\\), or dgamma in R. The expected value is \\(E(x) = \\frac{\\alpha}{\\beta}\\) and the variance is \\(Var(x) = \\frac{\\alpha}{\\beta^2}\\). The gamma distribution is becoming more and more popular as a data model for durations (time to event) or other highly right skewed continuous measurements that do not have values of zero. The gamma distribution is a conjugate prior distribution for the mean of a Poisson distribution and for the precision parameter of a normal distribution. However, in hierarchical models with normally distributed random effects, it is not recommended to use the gamma distribution as a prior distribution for the among-group variance (A. Gelman 2006). The Cauchy or folded t-distribution seem to have less influence on the posterior distributions of the variance parameters. 4.3.4 Cauchy distribution The Cauchy distribution is a symmetric distribution with much heavier tails compared to the normal distribution. $ x Cauchy(a,b)$ Its probability density function is \\(p(x) = \\frac{1}{\\pi b[1+(\\frac{x-a}{b})^2]}\\). The mean and the variance of the Cauchy distribution are not defined. The median is \\(a\\). The part of the Cauchy distribution for positive values, i.e., half of the Cauchy distribution, is often used as a prior distribution for variance parameters. 4.3.5 t-distribution The t-distribution is the marginal posterior distribution of a the mean of a sample with unknown variance when conjugate prior distributions are used to obtain the posterior distribution. The t-distribution has three parameters, the degrees of freedom \\(v\\), the location \\(\\mu\\) and the scale \\(\\sigma\\). \\(x \\sim t(v, \\mu, \\sigma)\\) Its density function is \\(p(x) = \\frac{\\Gamma((v+1)/2)}{\\Gamma(v/2)\\sqrt{v\\pi}\\sigma}(1+\\frac{1}{v}(\\frac{x-\\mu}{\\sigma})^2)^{-(v+1)/2}\\). Its expected value is \\(E(x) = \\mu\\) for \\(v>1\\) and the variance is \\(Var(x) = \\frac{v}{v-2}\\sigma ^2\\) for \\(v>2\\). The t-distribution is sometimes used as data model. Because of its heavier tails compared to the normal model, the model parameters are less influenced by measurement errors when a t-distribution is used instead of a normal distribution. This is called “robust statistics”. Similar to the Cauchy distribution, the folded t-distribution, i.e., the positive part of the t-distribution, can serve as a prior distribution for variance parameters. 4.3.6 F-distribution The F-distribution is not important in Bayesian statistics. Ratios of sample variances drawn from populations with equal variances follow an F-distribution. The density function of the F-distribution is even more complicated than the one of the t-distribution! We do not copy it here. Further, we have not yet met any Bayesian example where the F-distribution is used (that does not mean that there is no). It is used in frequentist analyses in order to compare variances, e.g. within ANOVAs. If two variances only differ because of natural variance in the data (nullhypothesis) then \\(\\frac{Var(X_1)}{Var(X_2)}\\sim F_{df_1,df_2}\\). Figure 4.5: Different density functions of the F statistics "],["rfunctions.html", "5 Important R-functions 5.1 Data preparation 5.2 Figures 5.3 Summary", " 5 Important R-functions THIS CHAPTER IS UNDER CONSTRUCTION!!! 5.1 Data preparation 5.2 Figures 5.3 Summary "],["reproducibleresearch.html", "6 Reproducible research 6.1 Summary 6.2 Further reading", " 6 Reproducible research THIS CHAPTER IS UNDER CONSTRUCTION!!! 6.1 Summary 6.2 Further reading Rmarkdown: The first official book authored by the core R Markdown developers that provides a comprehensive and accurate reference to the R Markdown ecosystem. With R Markdown, you can easily create reproducible data analysis reports, presentations, dashboards, interactive applications, books, dissertations, websites, and journal articles, while enjoying the simplicity of Markdown and the great power of R and other languages. Bookdown by Yihui Xie: A guide to authoring books with R Markdown, including how to generate figures and tables, and insert cross-references, citations, HTML widgets, and Shiny apps in R Markdown. The book can be exported to HTML, PDF, and e-books (e.g. EPUB). The book style is customizable. You can easily write and preview the book in RStudio IDE or other editors, and host the book wherever you want (e.g. bookdown.org). Our book is written using bookdown. "],["furthertopics.html", "7 Further topics 7.1 Bioacoustic analyse 7.2 Python", " 7 Further topics This is a collection of short introductions or links with commented R code that cover other topics that might be useful for ecologists. 7.1 Bioacoustic analyse Bioacoustic analyses are nicely covered in a blog by Marcelo Araya-Salas. 7.2 Python Like R, python is a high-level programming language that is used by many ecologists. The reticulate package provides a comprehensive set of tools for interoperability between Python and R. "],["PART-II.html", "8 Introduction to PART II Further reading", " 8 Introduction to PART II Further reading A really good introductory book to Bayesian data analyses is (McElreath 2016). This book starts with a thorough introduction to applying the Bayes theorem for drawing inference from data. In addition, it carefully discusses what can and what cannot be concluded from statistical results. We like this very much. We like looking up statistical methods in papers and books written by Andrew Gelman (e.g. A. Gelman et al. 2014b) and Trevor Hastie (e.g. Efron and Hastie (2016)) because both explain complicated things in a concise and understandable way. "],["bayesian_paradigm.html", "9 The Bayesian paradigm 9.1 Introduction 9.2 Summary", " 9 The Bayesian paradigm THIS CHAPTER IS UNDER CONSTRUCTION!!! 9.1 Introduction 9.2 Summary xxx "],["priors.html", "10 Prior distributions and prior sensitivity analyses 10.1 Introduction 10.2 How to choose a prior 10.3 Prior sensitivity", " 10 Prior distributions and prior sensitivity analyses 10.1 Introduction The prior is an integral part of a Bayesian model. We must specify one. When to use informative priors: In practice (management, politics etc.) we would like to base our decisions on all information available. Therefore, we consider it to be responsible including informative priors in applied research whenever possible. Priors allow combining information from the literature with information in data or combining information from different data sets. When using non-informative, flat or weakly informative priors: in basic research when results should only report the information in the current data set it may be reasonable to use non-informative priors. Results from a case study may later be used in a meta-analyses that assumes independence across the different studies included. However, flat priors are not always non-informative, may lead to overconfidence in spuriously large effects (similar to frequentist methods) and may be accompanied by computational difficulties. Therefore, weakly informative priors are recommended (Lemoine 2019). 10.2 How to choose a prior The Stan development team gives a profound and up-to-date prior choice recommendation. We are not yet sure what we can further add here that may be useful, as we normally check the prior choice recommendation by the Stan development team. Further references: Lemoine (2019) A. Gelman (2006) 10.2.1 Priors for variance parameters A. Gelman (2006) discusses advantages of using folded t-distributions or cauchy distributions as prior distributions for variance parameters in hierarchical models. When specifying t-distributions, we find it hard to imagine how the distributions looks like with what parameter values. Therefore, we simulate values from different distributions and look at the histograms. Because the parameterisation of the t-distribution differs among software language, it is important to use the software the model is finally fitted in Figure 10.1 we give some examples of folded t-distributions specified in jags using different values for the precision (second parameter) and degrees of freedom (third parameter). Figure 10.1: Folded t-distributions with different precisions and degrees of freedom. The panel titles give the jags code of the distribution. Dark blue vertical lines indicate 90% quantiles, light-blue lines indicate 98% quantiles. Todo: give examples for Stan 10.3 Prior sensitivity Todo: it may be helpful to present a worked-through example of a prior sensitivity analysis? "],["lm.html", "11 Normal Linear Models 11.1 Linear regression 11.2 Linear model with one categorical predictor (one-way ANOVA) 11.3 Other variants of normal linear models: Two-way anova, analysis of covariance and multiple regression 11.4 Partial coefficients and some comments on collinearity 11.5 Ordered Factors and Contrasts 11.6 Quadratic and Higher Polynomial Terms", " 11 Normal Linear Models 11.1 Linear regression 11.1.1 Background Linear regression is the basis of a large part of applied statistical analysis. Analysis of variance (ANOVA) and analysis of covariance (ANCOVA) can be considered special cases of linear regression, and generalized linear models are extensions of linear regression. Typical questions that can be answered using linear regression are: How does \\(y\\) change with changes in \\(x\\)? How is y predicted from \\(x\\)? An ordinary linear regression (i.e., one numeric \\(x\\) and one numeric \\(y\\) variable) can be represented by a scatterplot of \\(y\\) against \\(x\\). We search for the line that fits best and describe how the observations scatter around this regression line (see Fig. 11.2 for an example). The model formula of a simple linear regression with one continuous predictor variable \\(x_i\\) (the subscript \\(i\\) denotes the \\(i=1,\\dots,n\\) data points) is: \\[\\begin{align} \\mu_i &=\\beta_0 + \\beta_1 x_i \\\\ y_i &\\sim normal(\\mu_i, \\sigma^2) \\tag{11.1} \\end{align}\\] While the first part of Equation (11.1) describes the regression line, the second part describes how the data points, also called observations, are distributed around the regression line (Figure 11.1). In other words: the observation \\(y_i\\) stems from a normal distribution with mean \\(\\mu_i\\) and variance \\(\\sigma^2\\). The mean of the normal distribution, \\(\\mu_i\\) , equals the sum of the intercept (\\(b_0\\) ) and the product of the slope (\\(b_1\\)) and the continuous predictor value, \\(x_i\\). Equation (11.1) is called the data model, because it describes mathematically the process that has (or, better, that we think has) produced the data. This nomenclature also helps to distinguish data models from models for parameters such as prior or posterior distributions. The differences between observation \\(y_i\\) and the predicted values \\(\\mu_i\\) are the residuals (i.e., \\(\\epsilon_i=y_i-\\mu_i\\)). Equivalently to Equation (11.1), the regression could thus be written as: \\[\\begin{align} y_i &= \\beta_0 + \\beta_1 x_i + \\epsilon_i\\\\ \\epsilon_i &\\sim normal(0, \\sigma^2) \\tag{11.2} \\end{align}\\] We prefer the notation in Equation (11.1) because, in this formula, the stochastic part (second row) is nicely separated from the deterministic part (first row) of the model, whereas, in the second notation (11.2) the first row contains both stochastic and deterministic parts. For illustration, we here simulate a data set and below fit a linear regression to these simulated data. The advantage of simulating data is that the following analyses can be reproduced without having to read data into R. Further, for simulating data, we need to translate the algebraic model formula into R language which helps us understanding the model structure. set.seed(34) # set a seed for the random number generator # define the data structure n <- 50 # sample size x <- runif(n, 10, 30) # sample values of the predictor variable # define values for each model parameter sigma <- 5 # standard deviation of the residuals b0 <- 2 # intercept b1 <- 0.7 # slope # simulate y-values from the model mu <- b0 + b1 * x # define the regression line (deterministic part) y <- rnorm(n, mu, sd = sigma) # simulate y-values # save data in a data.frame dat <- tibble(x = x, y = y) Figure 11.1: Illustration of a linear regression. The blue line represents the deterministic part of the model, i.e., here regression line. The stochastic part is represented by a probability distribution, here the normal distribution. The normal distribution changes its mean but not the variance along the x-axis, and it describes how the data are distributed. The blue line and the orange distribution together are a statistical model, i.e., an abstract representation of the data which is given in black. Using matrix notation equation (11.1) can also be written in one row: \\[\\boldsymbol{y} \\sim Norm(\\boldsymbol{X} \\boldsymbol{\\beta}, \\sigma^2\\boldsymbol{I})\\] where \\(\\boldsymbol{ I}\\) is the \\(n \\times n\\) identity matrix (it transforms the variance parameter to a \\(n \\times n\\) matrix with its diagonal elements equal \\(\\sigma^2\\) ; \\(n\\) is the sample size). The multiplication by \\(\\boldsymbol{ I}\\) is necessary because we use vector notation, \\(\\boldsymbol{y}\\) instead of \\(y_{i}\\) . Here, \\(\\boldsymbol{y}\\) is the vector of all observations, whereas \\(y_{i}\\) is a single observation, \\(i\\). When using vector notation, we can write the linear predictor of the model, \\(\\beta_0 + \\beta_1 x_i\\) , as a multiplication of the vector of the model coefficients \\[\\boldsymbol{\\beta} = \\begin{pmatrix} \\beta_0 \\\\ \\beta_1 \\end{pmatrix}\\] times the model matrix \\[\\boldsymbol{X} = \\begin{pmatrix} 1 & x_1 \\\\ \\dots & \\dots \\\\ 1 & x_n \\end{pmatrix}\\] where \\(x_1 , \\dots, x_n\\) are the observed values for the predictor variable, \\(x\\). The first column of \\(\\boldsymbol{X}\\) contains only ones because the values in this column are multiplied with the intercept, \\(\\beta_0\\) . To the intercept, the product of the second element of \\(\\boldsymbol{\\beta}\\), \\(\\beta_1\\) , with each element in the second column of \\(\\boldsymbol{X}\\) is added to obtain the predicted value for each observation, \\(\\boldsymbol{\\mu}\\): \\[\\begin{align} \\boldsymbol{X \\beta}= \\begin{pmatrix} 1 & x_1 \\\\ \\dots & \\dots \\\\ 1 & x_n \\end{pmatrix} \\times \\begin{pmatrix} \\beta_0 \\\\ \\beta_1 \\end{pmatrix} = \\begin{pmatrix} \\beta_0 + \\beta_1x_1 \\\\ \\dots \\\\ \\beta_0 + \\beta_1x_n \\end{pmatrix}= \\begin{pmatrix} \\hat{y}_1 \\\\ \\dots \\\\ \\hat{y}_n \\end{pmatrix} = \\boldsymbol{\\mu} \\tag{11.3} \\end{align}\\] 11.1.2 Fitting a Linear Regression in R In Equation (11.1), the fitted values \\(\\mu_i\\) are directly defined by the model coefficients, \\(\\beta_{0}\\) and \\(\\beta_{1}\\) . Therefore, when we can estimate \\(\\beta_{0}\\), \\(\\beta_{1}\\) , and \\(\\sigma^2\\), the model is fully defined. The last parameter \\(\\sigma^2\\) describes how the observations scatter around the regression line and relies on the assumption that the residuals are normally distributed. The estimates for the model parameters of a linear regression are obtained by searching for the best fitting regression line. To do so, we search for the regression line that minimizes the sum of the squared residuals. This model fitting method is called the least-squares method, abbreviated as LS. It has a very simple solution using matrix algebra (see e.g., Aitkin et al. 2009). The least-squares estimates for the model parameters of a linear regression are obtained in R using the function lm. mod <- lm(y ~ x, data = dat) coef(mod) ## (Intercept) x ## 2.0049517 0.6880415 summary(mod)$sigma ## [1] 5.04918 The object “mod” produced by lm contains the estimates for the intercept, \\(\\beta_0\\) , and the slope, \\(\\beta_1\\). The residual standard deviation \\(\\sigma^2\\) is extracted using the function summary. We can show the result of the linear regression as a line in a scatter plot with the covariate (x) on the x-axis and the observations (y) on the y-axis (Fig. 11.2). Figure 11.2: Linear regression. Black dots = observations, blue solid line = regression line, orange dotted lines = residuals. The fitted values lie where the orange dotted lines touch the blue regression line. Conclusions drawn from a model depend on the model assumptions. When model assumptions are violated, estimates usually are biased and inappropriate conclusions can be drawn. We devote Chapter 12 to the assessment of model assumptions, given its importance. 11.1.3 Drawing Conclusions To answer the question about how strongly \\(y\\) is related to \\(x\\) taking into account statistical uncertainty we look at the joint posterior distribution of \\(\\boldsymbol{\\beta}\\) (vector that contains \\(\\beta_{0}\\) and \\(\\beta_{1}\\) ) and \\(\\sigma^2\\) , the residual variance. The function sim calculates the joint posterior distribution and renders a simulated values from this distribution. What does sim do? It simulates parameter values from the joint posterior distribution of a model assuming flat prior distributions. For a normal linear regression, it first draws a random value, \\(\\sigma^*\\) from the marginal posterior distribution of \\(\\sigma\\), and then draws random values from the conditional posterior distribution for \\(\\boldsymbol{\\beta}\\) given \\(\\sigma^*\\) (A. Gelman et al. 2014a). The conditional posterior distribution of the parameter vector \\(\\boldsymbol{\\beta}\\), \\(p(\\boldsymbol{\\beta}|\\sigma^*,\\boldsymbol{y,X})\\) can be analytically derived. With flat prior distributions, it is a uni- or multivariate normal distribution \\(p(\\boldsymbol{\\beta}|\\sigma^*,\\boldsymbol{y,X})=normal(\\boldsymbol{\\hat{\\beta}},V_\\beta,(\\sigma^*)^2)\\) with: \\[\\begin{align} \\boldsymbol{\\hat{\\beta}=(\\boldsymbol{X^TX})^{-1}X^Ty} \\tag{11.4} \\end{align}\\] and \\(V_\\beta = (\\boldsymbol{X^T X})^{-1}\\). The marginal posterior distribution of \\(\\sigma^2\\) is independent of specific values of \\(\\boldsymbol{\\beta}\\). It is, for flat prior distributions, an inverse chi-square distribution \\(p(\\sigma^2|\\boldsymbol{y,X})=Inv-\\chi^2(n-k,\\sigma^2)\\), where \\(\\sigma^2 = \\frac{1}{n-k}(\\boldsymbol{y}-\\boldsymbol{X,\\hat{\\beta}})^T(\\boldsymbol{y}-\\boldsymbol{X,\\hat{\\beta}})\\), and \\(k\\) is the number of parameters. The marginal posterior distribution of \\(\\boldsymbol{\\beta}\\) can be obtained by integrating the conditional posterior distribution \\(p(\\boldsymbol{\\beta}|\\sigma^2,\\boldsymbol{y,X})=normal(\\boldsymbol{\\hat{\\beta}},V_\\beta\\sigma^2)\\) over the distribution of \\(\\sigma^2\\) . This results in a uni- or multivariate \\(t\\)-distribution. Because sim simulates values \\(\\beta_0^*\\) and \\(\\beta_1^*\\) always conditional on \\(\\sigma^*\\), a triplet of values (\\(\\beta_0^*\\), \\(\\beta_1^*\\), \\(\\sigma^*\\)) is one draw of the joint posterior distribution. When we visualize the distribution of the simulated values for one parameter only, ignoring the values for the other, we display the marginal posterior distribution of that parameter. Thus, the distribution of all simulated values for the parameter \\(\\beta_0\\) is a \\(t\\)-distribution even if a normal distribution has been used for simulating the values. The \\(t\\)-distribution is a consequence of using a different \\(\\sigma^2\\)-value for every draw of \\(\\beta_0\\). Using the function sim from the package, we can draw values from the joint posterior distribution of the model parameters and describe the marginal posterior distribution of each model parameter using these simulated values. library(arm) nsim <- 1000 bsim <- sim(mod, n.sim = nsim) The function sim simulates (in our example) 1000 values from the joint posterior distribution of the three model parameters \\(\\beta_0\\) , \\(\\beta_1\\), and \\(\\sigma\\). These simulated values are shown in Figure 11.3. Figure 11.3: Joint (scatterplots) and marginal (histograms) posterior distribution of the model parameters. The six scatterplots show, using different axes, the three-dimensional cloud of 1000 simulations from the joint posterior distribution of the three parameters. The posterior distribution describes, given the data and the model, which values relative to each other are more likely to correspond to the parameter value we aim at measuring. It expresses the uncertainty of the parameter estimate. It shows what we know about the model parameter after having looked at the data and given the model is realistic. The 2.5% and 97.5% quantiles of the marginal posterior distributions can be used as 95% uncertainty intervals of the model parameters. The function coef extracts the simulated values for the beta coefficients, returning a matrix with nsim rows and the number of columns corresponding to the number of parameters. In our example, the first column contains the simulated values from the posterior distribution of the intercept and the second column contains values from the posterior distribution of the slope. The “2” in the second argument of the apply-function (see Chapter ??) indicates that the quantile function is applied columnwise. apply(X = coef(bsim), MARGIN = 2, FUN = quantile, probs = c(0.025, 0.975)) %>% round(2) ## (Intercept) x ## 2.5% -2.95 0.44 ## 97.5% 7.17 0.92 We also can calculate an uncertainty interval of the estimated residual standard deviation, \\(\\hat{\\sigma}\\). quantile(bsim@sigma, probs = c(0.025, 0.975)) %>% round(1) ## 2.5% 97.5% ## 4.2 6.3 We can further get a posterior probability for specific hypotheses, such as “The slope parameter is larger than 1” or “The slope parameter is larger than 0.5”. These probabilities are the proportion of simulated values from the posterior distribution that are larger than 1 and 0.5, respectively. sum(coef(bsim)[,2] > 1) / nsim # alternatively: mean(coef(bsim)[,2]>1) ## [1] 0.008 sum(coef(bsim)[,2] > 0.5) / nsim ## [1] 0.936 From this, there is very little evidence in the data that the slope is larger than 1, but we are quite confident that the slope is larger than 0.5 (assuming that our model is realistic). We often want to show the effect of \\(x\\) on \\(y\\) graphically, with information about the uncertainty of the parameter estimates included in the graph. To draw such effect plots, we use the simulated values from the posterior distribution of the model parameters. From the deterministic part of the model, we know the regression line \\(\\mu = \\beta_0 + \\beta_1 x_i\\). The simulation from the joint posterior distribution of \\(\\beta_0\\) and \\(\\beta_1\\) gives 1000 pairs of intercepts and slopes that describe 1000 different regression lines. We can draw these regression lines in an x-y plot (scatter plot) to show the uncertainty in the regression line estimation (Fig. 11.4, left). Note, that in this case it is not advisable to use ggplot because we draw many lines in one plot, which makes ggplot rather slow. par(mar = c(4, 4, 0, 0)) plot(x, y, pch = 16, las = 1, xlab = "Outcome (y)") for(i in 1:nsim) { abline(coef(bsim)[i,1], coef(bsim)[i,2], col = rgb(0, 0, 0, 0.05)) } Figure 11.4: Regression with 1000 lines based on draws form the joint posterior distribution for the intercept and slope parameters to visualize the uncertainty of the estimated regression line. A more convenient way to show uncertainty is to draw the 95% uncertainty interval, CrI, of the regression line. To this end, we first define new x-values for which we would like to have the fitted values (about 100 points across the range of x will produce smooth-looking lines when connected by line segments). We save these new x-values within the new tibble newdat. Then, we create a new model matrix that contains these new x-values (newmodmat) using the function model.matrix. We then calculate the 1000 fitted values for each element of the new x (one value for each of the 1000 simulated regressions, Fig. 11.4), using matrix multiplication (%*%). We save these values in the matrix “fitmat”. Finally, we extract the 2.5% and 97.5% quantiles for each x-value from fitmat, and draw the lines for the lower and upper limits of the credible interval (Fig. 11.5). # Calculate 95% credible interval newdat <- tibble(x = seq(10, 30, by = 0.1)) newmodmat <- model.matrix( ~ x, data = newdat) fitmat <- matrix(ncol = nsim, nrow = nrow(newdat)) for(i in 1:nsim) {fitmat[,i] <- newmodmat %*% coef(bsim)[i,]} newdat$CrI_lo <- apply(fitmat, 1, quantile, probs = 0.025) newdat$CrI_up <- apply(fitmat, 1, quantile, probs = 0.975) # Make plot regplot <- ggplot(dat, aes(x = x, y = y)) + geom_point() + geom_smooth(method = lm, se = FALSE) + geom_line(data = newdat, aes(x = x, y = CrI_lo), lty = 3) + geom_line(data = newdat, aes(x = x, y = CrI_up), lty = 3) + labs(x = "Predictor (x)", y = "Outcome (y)") regplot Figure 11.5: Regression with 95% credible interval of the posterior distribution of the fitted values. The interpretation of the 95% uncertainty interval is straightforward: We are 95% sure that the true regression line is within the credible interval (given the data and the model). As with all statistical results, this interpretation is only valid in the model world (if the world would look like the model). The larger the sample size, the narrower the interval, because each additional data point increases information about the true regression line. The uncertainty interval measures statistical uncertainty of the regression line, but it does not describe how new observations would scatter around the regression line. If we want to describe where future observations will be, we have to report the posterior predictive distribution. We can get a sample of random draws from the posterior predictive distribution \\(\\hat{y}|\\boldsymbol{\\beta},\\sigma^2,\\boldsymbol{X}\\sim normal( \\boldsymbol{X \\beta, \\sigma^2})\\) using the simulated joint posterior distributions of the model parameters, thus taking the uncertainty of the parameter estimates into account. We draw a new \\(\\hat{y}\\)-value from \\(normal( \\boldsymbol{X \\beta, \\sigma^2})\\) for each simulated set of model parameters. Then, we can visualize the 2.5% and 97.5% quantiles of this distribution for each new x-value. # increase number of simulation to produce smooth lines of the posterior # predictive distribution set.seed(34) nsim <- 50000 bsim <- sim(mod, n.sim=nsim) fitmat <- matrix(ncol=nsim, nrow=nrow(newdat)) for(i in 1:nsim) fitmat[,i] <- newmodmat%*%coef(bsim)[i,] # prepare matrix for simulated new data newy <- matrix(ncol=nsim, nrow=nrow(newdat)) # for each simulated fitted value, simulate one new y-value for(i in 1:nsim) { newy[,i] <- rnorm(nrow(newdat), mean = fitmat[,i], sd = bsim@sigma[i]) } # Calculate 2.5% and 97.5% quantiles newdat$pred_lo <- apply(newy, 1, quantile, probs = 0.025) newdat$pred_up <- apply(newy, 1, quantile, probs = 0.975) # Add the posterior predictive distribution to plot regplot + geom_line(data = newdat, aes(x = x, y = pred_lo), lty = 2) + geom_line(data = newdat, aes(x = x, y = pred_up), lty = 2) Figure 11.6: Regression line with 95% uncertainty interval (dotted lines) and the 95% interval of the simulated predictive distribution (broken lines). Note that we increased the number of simulations to 50,000 to produce smooth lines. Of future observations, 95% are expected to be within the interval defined by the broken lines in Fig. 11.6. Increasing sample size will not give a narrower predictive distribution because the predictive distribution primarily depends on the residual variance \\(\\sigma^2\\) which is a property of the data that is independent of sample size. The way we produced Fig. 11.6 is somewhat tedious compared to how easy we could have obtained the same figure using frequentist methods: predict(mod, newdata = newdat, interval = \"prediction\") would have produced the y-values for the lower and upper lines in Fig. 11.6 in one R-code line. However, once we have a simulated sample of the posterior predictive distribution, we have much more information than is contained in the frequentist prediction interval. For example, we could give an estimate for the proportion of observations greater than 20, given \\(x = 25\\). sum(newy[newdat$x == 25, ] > 20) / nsim ## [1] 0.44504 Thus, we expect 44% of future observations with \\(x = 25\\) to be higher than 20. We can extract similar information for any relevant threshold value. Another reason to learn the more complicated R code we presented here, compared to the frequentist methods, is that, for more complicated models such as mixed models, the frequentist methods to obtain confidence intervals of fitted values are much more complicated than the Bayesian method just presented. The latter can be used with only slight adaptations for mixed models and also for generalized linear mixed models. 11.1.4 Interpretation of the R summary output The solution for \\(\\boldsymbol{\\beta}\\) is the Equation (11.3). Most statistical software, including R, return an estimated frequentist standard error for each \\(\\beta_k\\). We extract these standard errors together with the estimates for the model parameters using the summary function. summary(mod) ## ## Call: ## lm(formula = y ~ x, data = dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -11.5777 -3.6280 -0.0532 3.9873 12.1374 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.0050 2.5349 0.791 0.433 ## x 0.6880 0.1186 5.800 0.000000507 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.049 on 48 degrees of freedom ## Multiple R-squared: 0.412, Adjusted R-squared: 0.3998 ## F-statistic: 33.63 on 1 and 48 DF, p-value: 0.0000005067 The summary output first gives a rough summary of the residual distribution. However, we will do more rigorous residual analyses in Chapter 12. The estimates of the model coefficients follow. The column “Estimate” contains the estimates for the intercept \\(\\beta_0\\) and the slope \\(\\beta_1\\) . The column “Std. Error” contains the estimated (frequentist) standard errors of the estimates. The last two columns contain the t-value and the p-value of the classical t-test for the null hypothesis that the coefficient equals zero. The last part of the summary output gives the parameter \\(\\sigma\\) of the model, named “residual standard error” and the residual degrees of freedom. We think the name “residual standard error” for “sigma” is confusing, because \\(\\sigma\\) is not a measurement of uncertainty of a parameter estimate like the standard errors of the model coefficients are. \\(\\sigma\\) is a model parameter that describes how the observations scatter around the fitted values, that is, it is a standard deviation. It is independent of sample size, whereas the standard errors of the estimates for the model parameters will decrease with increasing sample size. Such a standard error of the estimate of \\(\\sigma\\), however, is not given in the summary output. Note that, by using Bayesian methods, we could easily obtain the standard error of the estimated \\(\\sigma\\) by calculating the standard deviation of the posterior distribution of \\(\\sigma\\). The \\(R^2\\) and the adjusted \\(R^2\\) measure the proportion of variance in the outcome variable \\(y\\) that is explained by the predictors in the model. \\(R^2\\) is calculated from the sum of squared residuals, \\(SSR = \\sum_{i=1}^{n}(y_i - \\hat{y})\\), and the “total sum of squares”, \\(SST = \\sum_{i=1}^{n}(y_i - \\bar{y})\\), where \\(\\bar{y})\\) is the mean of \\(y\\). \\(SST\\) is a measure of total variance in \\(y\\) and \\(SSR\\) is a measure of variance that cannot be explained by the model, thus \\(R^2 = 1- \\frac{SSR}{SST}\\) is a measure of variance that can be explained by the model. If \\(SSR\\) is close to \\(SST\\), \\(R^2\\) is close to zero and the model cannot explain a lot of variance. The smaller \\(SSR\\), the closer \\(R^2\\) is to 1. This version of \\(R2\\) approximates 1 if the number of model parameters approximates sample size even if none of the predictor variables correlates with the outcome. It is exactly 1 when the number of model parameters equals sample size, because \\(n\\) measurements can be exactly described by \\(n\\) parameters. The adjusted \\(R^2\\), \\(R^2 = \\frac{var(y)-\\hat\\sigma^2}{var(y)}\\) takes sample size \\(n\\) and the number of model parameters \\(k\\) into account (see explanation to variance in chapter 2). Therefore, the adjusted \\(R^2\\) is recommended as a measurement of the proportion of explained variance. 11.2 Linear model with one categorical predictor (one-way ANOVA) The aim of analysis of variance (ANOVA) is to compare means of an outcome variable \\(y\\) between different groups. To do so in the frequentist’s framework, variances between and within the groups are compared using F-tests (hence the name “analysis of variance”). When doing an ANOVA in a Bayesian way, inference is based on the posterior distributions of the group means and the differences between the group means. One-way ANOVA means that we only have one predictor variable, specifically a categorical predictor variable (in R defined as a “factor”). We illustrate the one-way ANOVA based on an example of simulated data (Fig. 11.7). We have measured weights of 30 virtual individuals for each of 3 groups. Possible research questions could be: How big are the differences between the group means? Are individuals from group 2 heavier than the ones from group 1? Which group mean is higher than 7.5 g? # settings for the simulation set.seed(626436) b0 <- 12 # mean of group 1 (reference group) sigma <- 2 # residual standard deviation b1 <- 3 # difference between group 1 and group 2 b2 <- -5 # difference between group 1 and group 3 n <- 90 # sample size # generate data group <- factor(rep(c("group 1","group 2", "group 3"), each=30)) simresid <- rnorm(n, mean=0, sd=sigma) # simulate residuals y <- b0 + as.numeric(group=="group 2") * b1 + as.numeric(group=="group 3") * b2 + simresid dat <- tibble(y, group) # make figure dat %>% ggplot(aes(x = group, y = y)) + geom_boxplot(fill = "orange") + labs(y = "Weight (g)", x = "") + ylim(0, NA) Figure 11.7: Weights (g) of the 30 individuals in each group. The dark horizontal line is the median, the box contains 50% of the observations (i.e., the interquartile range), the whiskers mark the range of all observations that are less than 1.5 times the interquartile range away from the edge of the box. An ANOVA is a linear regression with a categorical predictor variable instead of a continuous one. The categorical predictor variable with \\(k\\) levels is (as a default in R) transformed to \\(k-1\\) indicator variables. An indicator variable is a binary variable containing 0 and 1 where 1 indicates a specific level (a category of the predictor variable). Often, one indicator variable is constructed for every level except for the reference level. In our example, the categorical variable is “group” with the three levels “group 1”, “group 2”, and “group 3” (\\(k = 3\\)). Group 1 is taken as the reference level (default in R is the first in the alphabeth), and for each of the other two groups an indicator variable is constructed, \\(I(group_i = 2)\\) and \\(I(group_i = 3)\\). The function \\(I()\\) gives out 1, if the expression is true and 0 otherwise. We can write the model as a formula: \\[\\begin{align} \\mu_i &=\\beta_0 + \\beta_1 I(group_i=2) + \\beta_1 I(group_i=3) \\\\ y_i &\\sim normal(\\mu_i, \\sigma^2) \\tag{11.5} \\end{align}\\] where \\(y_i\\) is the \\(i\\)-th observation (weight measurement for individual \\(i\\) in our example), and \\(\\beta_{0,1,2}\\) are the model coefficients. The residual variance is \\(\\sigma^2\\). The model coefficients \\(\\beta_{0,1,2}\\) constitute the deterministic part of the model. From the model formula it follows that the group means, \\(m_g\\), are: \\[\\begin{align} m_1 &=\\beta_0 \\\\ m_2 &=\\beta_0 + \\beta_1 \\\\ m_3 &=\\beta_0 + \\beta_2 \\\\ \\tag{11.6} \\end{align}\\] There are other possibilities to describe three group means with three parameters, for example: \\[\\begin{align} m_1 &=\\beta_1 \\\\ m_2 &=\\beta_2 \\\\ m_3 &=\\beta_3 \\\\ \\tag{11.7} \\end{align}\\] In this case, the model formula would be: \\[\\begin{align} \\mu_i &= \\beta_1 I(group_i=1) + \\beta_2 I(group_i=2) + \\beta_3 I(group_i=3) \\\\ y_i &\\sim Norm(\\mu_i, \\sigma^2) \\tag{11.8} \\end{align}\\] The way the group means are calculated within a model is called the parameterization of the model. Different statistical software use different parameterizations. The parameterization used by R by default is the one shown in Equation (11.5). R automatically takes the first level as the reference (the first level is the first one alphabetically unless the user defines a different order for the levels). The mean of the first group (i.e., of the first factor level) is the intercept, \\(b_0\\) , of the model. The mean of another factor level is obtained by adding, to the intercept, the estimate of the corresponding parameter (which is the difference from the reference group mean). The parameterization of the model is defined by the model matrix. In the case of a one-way ANOVA, there are as many columns in the model matrix as there are factor levels (i.e., groups); thus there are k factor levels and k model coefficients. Recall from Equation (11.3) that for each observation, the entry in the \\(j\\)-th column of the model matrix is multiplied by the \\(j\\)-th element of the model coefficients and the \\(k\\) products are summed to obtain the fitted values. For a data set with \\(n = 5\\) observations of which the first two are from group 1, the third from group 2, and the last two from group 3, the model matrix used for the parameterization described in Equation (11.6) and defined in R by the formula ~ group is \\[\\begin{align} \\boldsymbol{X}= \\begin{pmatrix} 1 & 0 & 0 \\\\ 1 & 0 & 0 \\\\ 1 & 1 & 0 \\\\ 1 & 0 & 1 \\\\ 1 & 0 & 1 \\\\ \\end{pmatrix} \\end{align}\\] If parameterization of Equation (11.7) (corresponding R formula: ~ group - 1) were used, \\[\\begin{align} \\boldsymbol{X}= \\begin{pmatrix} 1 & 0 & 0 \\\\ 1 & 0 & 0 \\\\ 0 & 1 & 0 \\\\ 0 & 0 & 1 \\\\ 0 & 0 & 1 \\\\ \\end{pmatrix} \\end{align}\\] To obtain the parameter estimates for model parameterized according to Equation (11.6) we fit the model in R: # fit the model mod <- lm(y~group, data=dat) # parameter estimates mod ## ## Call: ## lm(formula = y ~ group, data = dat) ## ## Coefficients: ## (Intercept) groupgroup 2 groupgroup 3 ## 12.367 2.215 -5.430 summary(mod)$sigma ## [1] 1.684949 The “Intercept” is \\(\\beta_0\\). The other coefficients are named with the factor name (“group”) and the factor level (either “group 2” or “group 3”). These are \\(\\beta_1\\) and \\(\\beta_2\\) , respectively. Before drawing conclusions from an R output we need to examine whether the model assumptions are met, that is, we need to do a residual analysis as described in Chapter 12. Different questions can be answered using the above ANOVA: What are the group means? What is the difference in the means between group 1 and group 2? What is the difference between the means of the heaviest and lightest group? In a Bayesian framework we can directly assess how strongly the data support the hypothesis that the mean of the group 2 is larger than the mean of group 1. We first simulate from the posterior distribution of the model parameters. library(arm) nsim <- 1000 bsim <- sim(mod, n.sim=nsim) Then we obtain the posterior distributions for the group means according to the parameterization of the model formula (Equation (11.6)). m.g1 <- coef(bsim)[,1] m.g2 <- coef(bsim)[,1] + coef(bsim)[,2] m.g3 <- coef(bsim)[,1] + coef(bsim)[,3] The histograms of the simulated values from the posterior distributions of the three means are given in Fig. 11.8. The three means are well separated and, based on our data, we are confident that the group means differ. From these simulated posterior distributions we obtain the means and use the 2.5% and 97.5% quantiles as limits of the 95% uncertainty intervals (Fig. 11.8, right). # save simulated values from posterior distribution in tibble post <- tibble(`group 1` = m.g1, `group 2` = m.g2, `group 3` = m.g3) %>% gather("groups", "Group means") # histograms per group leftplot <- ggplot(post, aes(x = `Group means`, fill = groups)) + geom_histogram(aes(y=..density..), binwidth = 0.5, col = "black") + labs(y = "Density") + theme(legend.position = "top", legend.title = element_blank()) # plot mean and 95%-CrI rightplot <- post %>% group_by(groups) %>% dplyr::summarise( mean = mean(`Group means`), CrI_lo = quantile(`Group means`, probs = 0.025), CrI_up = quantile(`Group means`, probs = 0.975)) %>% ggplot(aes(x = groups, y = mean)) + geom_point() + geom_errorbar(aes(ymin = CrI_lo, ymax = CrI_up), width = 0.1) + ylim(0, NA) + labs(y = "Weight (g)", x ="") multiplot(leftplot, rightplot, cols = 2) Figure 11.8: Distribution of the simulated values from the posterior distributions of the group means (left); group means with 95% uncertainty intervals obtained from the simulated distributions (right). To obtain the posterior distribution of the difference between the means of group 1 and group 2, we simply calculate this difference for each draw from the joint posterior distribution of the group means. d.g1.2 <- m.g1 - m.g2 mean(d.g1.2) ## [1] -2.209551 quantile(d.g1.2, probs = c(0.025, 0.975)) ## 2.5% 97.5% ## -3.128721 -1.342693 The estimated difference is -2.2095511. In the small model world, we are 95% sure that the difference between the means of group 1 and 2 is between -3.1287208 and -1.3426929. How strongly do the data support the hypothesis that the mean of group 2 is larger than the mean of group 1? To answer this question we calculate the proportion of the draws from the joint posterior distribution for which the mean of group 2 is larger than the mean of group 1. sum(m.g2 > m.g1) / nsim ## [1] 1 This means that in all of the 1000 simulations from the joint posterior distribution, the mean of group 2 was larger than the mean of group 1. Therefore, there is a very high probability (i.e., it is close to 1; because probabilities are never exactly 1, we write >0.999) that the mean of group 2 is larger than the mean of group 1. 11.3 Other variants of normal linear models: Two-way anova, analysis of covariance and multiple regression Up to now, we introduced normal linear models with one predictor only. We can add more predictors to the model and these can be numerical or categorical ones. Traditionally, models with 2 or 3 categorical predictors are called two-way or three-way ANOVA, respectively. Models with a mixture of categorical and numerical predictors are called ANCOVA. And, models containing only numerical predictors are called multiple regressions. Nowadays, we only use the term “normal linear model” as an umbrella term for all these types of models. While it is easy to add additional predictors in the R formula of the model, it becomes more difficult to interpret the coefficients of such multi-dimensional models. Two important topics arise with multi-dimensional models, interactions and partial effects. We dedicate partial effects the full next chapter and introduce interactions in this chapter using two examples. The first, is a model including two categorical predictors and the second is a model with one categorical and one numeric predictor. 11.3.1 Linear model with two categorical predictors (two-way ANOVA) In the first example, we ask how large are the differences in wing length between age and sex classes of the Coal tit Periparus ater. Wing lengths were measured on 19 coal tit museum skins with known sex and age class. data(periparusater) dat <- tibble(periparusater) # give the data a short handy name dat$age <- recode_factor(dat$age, "4"="adult", "3"="juvenile") # replace EURING code dat$sex <- recode_factor(dat$sex, "2"="female", "1"="male") # replace EURING code To describe differences in wing length between the age classes or between the sexes a normal linear model with two categorical predictors is fitted to the data. The two predictors are specified on the right side of the model formula separated by the “+” sign, which means that the model is an additive combination of the two effects (as opposed to an interaction, see following). mod <- lm(wing ~ sex + age, data=dat) After having seen that the residual distribution does not appear to violate the model assumptions (as assessed with diagnostic residual plots, see Chapter 12), we can draw inferences. We first have a look at the model parameter estimates: mod ## ## Call: ## lm(formula = wing ~ sex + age, data = dat) ## ## Coefficients: ## (Intercept) sexmale agejuvenile ## 61.3784 3.3423 -0.8829 summary(mod)$sigma ## [1] 2.134682 R has taken the first level of the factors age and sex (as defined in the data.frame dat) as the reference levels. The intercept is the expected wing length for individuals having the reference level in age and sex, thus adult female. The other two parameters provide estimates of what is to be added to the intercept to get the expected wing length for the other levels. The parameter sexmale is the average difference between females and males. We can conclude that in males have in average a 3.3 mm longer wing than females. Similarly, the parameter agejuvenile measures the differences between the age classes and we can conclude that, in average, juveniles have a 0.9 shorter wing than adults. When we insert the parameter estimates into the model formula, we get the receipt to calculate expected values for each age and sex combination: \\(\\hat{y_i} = \\hat{\\beta_0} + \\hat{\\beta_1}I(sex=male) + \\hat{\\beta_2}I(age=juvenile)\\) which yields \\(\\hat{y_i}\\) = 61.4 \\(+\\) 3.3 \\(I(sex=male) +\\) -0.9 \\(I(age=juvenile)\\). Alternatively, we could use matrix notation. We construct a new data set that contains one virtual individual for each age and sex class. newdat <- tibble(expand.grid(sex=factor(levels(dat$sex)), age=factor(levels(dat$age)))) # expand.grid creates a data frame with all combination of values given newdat ## # A tibble: 4 × 2 ## sex age ## <fct> <fct> ## 1 female adult ## 2 male adult ## 3 female juvenile ## 4 male juvenile newdat$fit <- predict(mod, newdata=newdat) # fast way of getting fitted values # or Xmat <- model.matrix(~sex+age, data=newdat) # creates a model matrix newdat$fit <- Xmat %*% coef(mod) For this new data set the model matrix contains four rows (one for each combination of age class and sex) and three columns. The first column contains only ones because the values of this column are multiplied by the intercept (\\(\\beta_0\\)) in the matrix multiplication. The second column contains an indicator variable for males (so only the rows corresponding to males contain a one) and the third column has ones for juveniles. \\[\\begin{align} \\hat{y} = \\boldsymbol{X \\hat{\\beta}} = \\begin{pmatrix} 1 & 0 & 0 \\\\ 1 & 1 & 0 \\\\ 1 & 0 & 1 \\\\ 1 & 1 & 1 \\\\ \\end{pmatrix} \\times \\begin{pmatrix} 61.4 \\\\ 3.3 \\\\ -0.9 \\end{pmatrix} = \\begin{pmatrix} 61.4 \\\\ 64.7 \\\\ 60.5 \\\\ 63.8 \\end{pmatrix} = \\boldsymbol{\\mu} \\tag{11.3} \\end{align}\\] The result of the matrix multiplication is a vector containing the expected wing length for adult and juvenile females and adult and juvenile males. When creating the model matrix with model.matrix care has to be taken that the columns in the model matrix match the parameters in the vector of model coefficients. To achieve that, it is required that the model formula is identical to the model formula of the model (same order of terms!), and that the factors in newdat are identical in their levels and their order as in the data the model was fitted to. To describe the uncertainty of the fitted values, we use 2000 sets of parameter values of the joint posterior distribution to obtain 2000 values for each of the four fitted values. These are stored in the object “fitmat”. In the end, we extract for every fitted value, i.e., for every row in fitmat, the 2.5% and 97.5% quantiles as the lower and upper limits of the 95% uncertainty interval. nsim <- 2000 bsim <- sim(mod, n.sim=nsim) fitmat <- matrix(ncol=nsim, nrow=nrow(newdat)) for(i in 1:nsim) fitmat[,i] <- Xmat %*% coef(bsim)[i,] newdat$lwr <- apply(fitmat, 1, quantile, probs=0.025) newdat$upr <- apply(fitmat, 1, quantile, probs=0.975) dat$sexage <- factor(paste(dat$sex, dat$age)) newdat$sexage <- factor(paste(newdat$sex, newdat$age)) dat$pch <- 21 dat$pch[dat$sex=="male"] <- 22 dat$col="blue" dat$col[dat$age=="adult"] <- "orange" par(mar=c(4,4,0.5,0.5)) plot(wing~jitter(as.numeric(sexage), amount=0.05), data=dat, las=1, ylab="Wing length (mm)", xlab="Sex and age", xaxt="n", pch=dat$pch, bg=dat$col, cex.lab=1.2, cex=1, cex.axis=1, xlim=c(0.5, 4.5)) axis(1, at=c(1:4), labels=levels(dat$sexage), cex.axis=1) segments(as.numeric(newdat$sexage), newdat$lwr, as.numeric(newdat$sexage), newdat$upr, lwd=2, lend="butt") points(as.numeric(newdat$sexage), newdat$fit, pch=17) Figure 11.9: Wing length measurements on 19 museumm skins of coal tits per age class and sex. Fitted values are from the additive model (black triangles) and from the model including an interaction (black dots). Vertical bars = 95% uncertainty intervals. We can see that the fitted values are not equal to the arithmetic means of the groups; this is especially clear for juvenile males. The fitted values are constrained because only three parameters were used to estimate four means. In other words, this model assumes that the age difference is equal in both sexes and, vice versa, that the difference between the sexes does not change with age. If the effect of sex changes with age, we would include an interaction between sex and age in the model. Including an interaction adds a fourth parameter enabling us to estimate the group means exactly. In R, an interaction is indicated with the : sign. mod2 <- lm(wing ~ sex + age + sex:age, data=dat) # alternative formulations of the same model: # mod2 <- lm(wing ~ sex * age, data=dat) # mod2 <- lm(wing ~ (sex + age)^2, data=dat) The formula for this model is \\(\\hat{y_i} = \\hat{\\beta_0} + \\hat{\\beta_1}I(sex=male) + \\hat{\\beta_2}I(age=juvenile) + \\hat{\\beta_3}I(age=juvenile)I(sex=male)\\). From this formula we get the following expected values for the sexes and age classes: for adult females: \\(\\hat{y} = \\beta_0\\) for adult males: \\(\\hat{y} = \\beta_0 + \\beta_1\\) for juveniles females: \\(\\hat{y} = \\beta_0 + \\beta_2\\) for juveniles males: \\(\\hat{y} = \\beta_0 + \\beta_1 + \\beta_2 + \\beta_3\\) The interaction parameter measures how much different between age classes is the difference between the sexes. To obtain the fitted values the R-code above can be recycled with two adaptations. First, the model name needs to be changed to “mod2”. Second, importantly, the model matrix needs to be adapted to the new model formula. newdat$fit2 <- predict(mod2, newdata=newdat) bsim <- sim(mod2, n.sim=nsim) Xmat <- model.matrix(~ sex + age + sex:age, data=newdat) fitmat <- matrix(ncol=nsim, nrow=nrow(newdat)) for(i in 1:nsim) fitmat[,i] <- Xmat %*% coef(bsim)[i,] newdat$lwr2 <- apply(fitmat, 1, quantile, probs=0.025) newdat$upr2 <- apply(fitmat, 1, quantile, probs=0.975) print(newdat[,c(1:5,7:9)], digits=3) ## # A tibble: 4 × 8 ## sex age fit[,1] lwr upr fit2 lwr2 upr2 ## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 female adult 61.4 59.3 63.3 61.1 58.8 63.5 ## 2 male adult 64.7 63.3 66.2 64.8 63.3 66.4 ## 3 female juvenile 60.5 58.4 62.6 60.8 58.2 63.4 ## 4 male juvenile 63.8 61.7 66.0 63.5 60.7 66.2 These fitted values are now exactly equal to the arithmetic means of each groups. tapply(dat$wing, list(dat$age, dat$sex), mean) # arithmetic mean per group ## female male ## adult 61.12500 64.83333 ## juvenile 60.83333 63.50000 We can also see that the uncertainty of the fitted values is larger for the model with an interaction than for the additive model. This is because, in the model including the interaction, an additional parameter has to be estimated based on the same amount of data. Therefore, the information available per parameter is smaller than in the additive model. In the additive model, some information is pooled between the groups by making the assumption that the difference between the sexes does not depend on age. The degree to which a difference in wing length is ‘important’ depends on the context of the study. Here, for example, we could consider effects of wing length on flight energetics and maneuverability or methodological aspects like measurement error. Mean between-observer difference in wing length measurement is around 0.3 mm (Jenni and Winkler 1989). Therefore, we may consider that the interaction is important because its point estimate is larger than 0.3 mm. mod2 ## ## Call: ## lm(formula = wing ~ sex + age + sex:age, data = dat) ## ## Coefficients: ## (Intercept) sexmale agejuvenile ## 61.1250 3.7083 -0.2917 ## sexmale:agejuvenile ## -1.0417 summary(mod2)$sigma ## [1] 2.18867 Further, we think a difference of 1 mm in wing length may be relevant compared to the among-individual variation of which the standard deviation is around 2 mm. Therefore, we report the parameter estimates of the model including the interaction together with their uncertainty intervals. Table 11.1: Parameter estimates of the model for wing length of Coal tits with 95% uncertainty interval. Parameter Estimate lwr upr (Intercept) 61.12 58.85 63.53 sexmale 3.71 0.93 6.59 agejuvenile -0.29 -3.93 3.36 sexmale:agejuvenile -1.04 -5.96 3.90 From these parameters we obtain the estimated differences in wing length between the sexes for adults of 3.7mm and the posterior probability of the hypotheses that males have an average wing length that is at least 1mm larger compared to females is mean(bsim@coef[,2]>1) which is 0.97. Thus, there is some evidence that adult Coal tit males have substantially larger wings than adult females in these data. However, we do not draw further conclusions on other differences from these data because statistical uncertainty is large due to the low sample size. 11.3.2 A linear model with a categorical and a numeric predictor (ANCOVA) An analysis of covariance, ANCOVA, is a normal linear model that contains at least one factor and one continuous variable as predictor variables. The continuous variable is also called a covariate, hence the name analysis of covariance. An ANCOVA can be used, for example, when we are interested in how the biomass of grass depends on the distance from the surface of the soil to the ground water in two different species (Alopecurus pratensis, Dactylis glomerata). The two species were grown by Ellenberg (1953) in tanks that showed a gradient in distance from the soil surface to the ground water. The distance from the soil surface to the ground water is used as a covariate (‘water’). We further assume that the species react differently to the water conditions. Therefore, we include an interaction between species and water. The model formula is then \\(\\hat{y_i} = \\beta_0 + \\beta_1I(species=Dg) + \\beta_2water_i + \\beta_3I(species=Dg)water_i\\) \\(y_i \\sim normal(\\hat{y_i}, \\sigma^2)\\) To fit the model, it is important to first check whether the factor is indeed defined as a factor and the continuous variable contains numbers (i.e., numeric or integer values) in the data frame. data(ellenberg) index <- is.element(ellenberg$Species, c("Ap", "Dg")) & complete.cases(ellenberg$Yi.g) dat <- ellenberg[index,c("Water", "Species", "Yi.g")] # select two species dat <- droplevels(dat) str(dat) ## 'data.frame': 84 obs. of 3 variables: ## $ Water : int 5 20 35 50 65 80 95 110 125 140 ... ## $ Species: Factor w/ 2 levels "Ap","Dg": 1 1 1 1 1 1 1 1 1 1 ... ## $ Yi.g : num 34.8 28 44.5 24.8 37.5 ... Species is a factor with two levels and Water is an integer variable, so we are fine and we can fit the model mod <- lm(log(Yi.g) ~ Species + Water + Species:Water, data=dat) # plot(mod) # 4 standard residual plots We log-transform the biomass to make the residuals closer to normally distributed. So, the normal distribution assumption is met well. However, a slight banana shaped relationship exists between the residuals and the fitted values indicating a slight non-linear relationship between biomass and water. Further, residuals showed substantial autocorrelation because the grass biomass was measured in different tanks. Measurements from the same tank were more similar than measurements from different tanks after correcting for the distance to water. Thus, the analysis we have done here suffers from pseudoreplication. We will re-analyze the example data in a more appropriate way in Chapter 13. Let’s have a look at the model matrix (first and last six rows only). head(model.matrix(mod)) # print the first 6 rows of the matrix ## (Intercept) SpeciesDg Water SpeciesDg:Water ## 24 1 0 5 0 ## 25 1 0 20 0 ## 26 1 0 35 0 ## 27 1 0 50 0 ## 28 1 0 65 0 ## 29 1 0 80 0 tail(model.matrix(mod)) # print the last 6 rows of the matrix ## (Intercept) SpeciesDg Water SpeciesDg:Water ## 193 1 1 65 65 ## 194 1 1 80 80 ## 195 1 1 95 95 ## 196 1 1 110 110 ## 197 1 1 125 125 ## 198 1 1 140 140 The first column of the model matrix contains only 1s. These are multiplied by the intercept in the matrix multiplication that yields the fitted values. The second column contains the indicator variable for species Dactylis glomerata (Dg). Species Alopecurus pratensis (Ap) is the reference level. The third column contains the values for the covariate. The last column contains the product of the indicator for species Dg and water. This column specifies the interaction between species and water. The parameters are the intercept, the difference between the species, a slope for water and the interaction parameter. mod ## ## Call: ## lm(formula = log(Yi.g) ~ Species + Water + Species:Water, data = dat) ## ## Coefficients: ## (Intercept) SpeciesDg Water SpeciesDg:Water ## 4.33041 -0.23700 -0.01791 0.01894 summary(mod)$sigma ## [1] 0.9001547 These four parameters define two regression lines, one for each species (Figure 11.10 Left). For Ap, it is \\(\\hat{y_i} = \\beta_0 + \\beta_2water_i\\), and for Dg it is \\(\\hat{y_i} = (\\beta_0 + \\beta_1) + (\\beta_2 + \\beta_3)water_i\\). Thus, \\(\\beta_1\\) is the difference in the intercept between the species and \\(\\beta_3\\) is the difference in the slope. Figure 11.10: Aboveground biomass (g, log-transformed) in relation to distance to ground water and species (two grass species). Fitted values from a model including an interaction species x water (left) and a model without interaction (right) are added. The dotted line indicates water=0. As a consequence of including an interaction in the model, the interpretation of the main effects become difficult. From the above model output, we read that the intercept of the species Dg is lower than the intercept of the species Ap. However, from a graphical inspection of the data, we would expect that the average biomass of species Dg is higher than the one of species Ap. The estimated main effect of species is counter-intuitive because it is measured where water is zero (i.e, it is the difference in the intercepts and not between the mean biomasses of the species). Therefore, the main effect of species in the above model does not have a biologically meaningful interpretation. We have two possibilities to get a meaningful species effect. First, we could delete the interaction from the model (Figure 11.10 Right). Then the difference in the intercept reflects an average difference between the species. However, the fit for such an additive model is much worth compared to the model with interaction, and an average difference between the species may not make much sense because this difference so much depends on water. Therefore, we prefer to use a model including the interaction and may opt for th second possibility. Second, we could move the location where water equals 0 to the center of the data by transforming, specifically centering, the variable water: \\(water.c = water - mean(water)\\). When the predictor variable (water) is centered, then the intercept corresponds to the difference in fitted values measured in the center of the data. For drawing biological conclusions from these data, we refer to Chapter 13, where we use a more appropriate model. 11.4 Partial coefficients and some comments on collinearity Many biologists think that it is forbidden to include correlated predictor variables in a model. They use variance inflating factors (VIF) to omit some of the variables. However, omitting important variables from the model just because a correlation coefficient exceeds a threshold value can have undesirable effects. Here, we explain why and we present the usefulness and limits of partial coefficients (also called partial correlation or partial effects). We start with an example illustrating the usefulness of partial coefficients and then, give some guidelines on how to deal with collinearity. As an example, we look at hatching dates of Snowfinches and how these dates relate to the date when snow melt started (first date in the season when a minimum of 5% ground is snow free). A thorough analyses of the data is presented by Schano et al. (2021). An important question is how well can Snowfinches adjust their hatching dates to the snow conditions. For Snowfinches, it is important to raise their nestlings during snow melt. Their nestlings grow faster when they are reared during the snow melt compared to after snow has completely melted, because their parents find nutrient rich insect larvae in the edges of melting snow patches. load("RData/snowfinch_hatching_date.rda") # Pearson's correlation coefficient cor(datsf$elevation, datsf$meltstart, use = "pairwise.complete") ## [1] 0.3274635 mod <- lm(meltstart~elevation, data=datsf) 100*coef(mod)[2] # change in meltstart with 100m change in elevation ## elevation ## 2.97768 Hatching dates of Snowfinch broods were inferred from citizen science data from the Alps, where snow melt starts later at higher elevations compared to lower elevations. Thus, the start of snow melt is correlated with elevation (Pearson’s correlation coefficient 0.33). In average, snow starts melting 3 days later with every 100m increase in elevation. mod1 <- lm(hatchday.mean~meltstart, data=datsf) mod1 ## ## Call: ## lm(formula = hatchday.mean ~ meltstart, data = datsf) ## ## Coefficients: ## (Intercept) meltstart ## 167.99457 0.06325 From a a normal linear regression of hatching date on the snow melt date, we obtain an estimate of 0.06 days delay in hatching date with one day later snow melt. This effect sizes describes the relationship in the data that were collected along an elevational gradient. Along the elevational gradient there are many factors that change such as average temperature, air pressure or sun radiation. All these factors may have an influence on the birds decision to start breeding. Consequentily, from the raw correlation between hatching dates and start of snow melt we cannot conclude how Snowfinches react to changes in the start of snow melt because the correlation seen in the data may be caused by other factors changing with elevation (such a correlation is called “pseudocorrelation”). However, we are interested in the correlation between hatching date and date of snow melt independent of other factors changing with elevation. In other words, we would like to measure how much in average hatching date delays when snow melt starts one day later while all other factors are kept constant. This is called the partial effect of snow melt date. Therefore, we include elevation as a covariate in the model. library(arm) mod <- lm(hatchday.mean~elevation + meltstart, data=datsf) mod ## ## Call: ## lm(formula = hatchday.mean ~ elevation + meltstart, data = datsf) ## ## Coefficients: ## (Intercept) elevation meltstart ## 154.383936 0.007079 0.037757 From this model, we obtain an estimate of 0.04 days delay in hatching date with one day later snow melt at a given elevation. That gives a difference in hatching date between early and late years (around one month difference in snow melt date) at a given elevation of 1.13 days (Figure 11.11). We further get an estimate of 0.71 days later hatching date for each 100m shift in elevation. Thus, a 18.75 days later snow melt corresponds to a similar delay in average hatching date when elevation increases by 100m. When we estimate the coefficient within a constant elevation (coloured regression lines in Figure 11.11), it is lower than the raw correlation and closer to a causal relationship, because it is corrected for elevation. However, in observational studies, we never can be sure whether the partial coefficients can be interpreted as a causal relationship unless we include all factors that influence hatching date. Nevertheless, partial effects give much more insight into a system compared to univariate analyses because we can separated effects of simultaneously acting variables (that we have measured). The result indicates that Snowfinches may not react very sensibly to varying timing of snow melt, whereas at higher elevations they clearly breed later compared to lower elevations. Figure 11.11: Illustration of the partial coefficient of snow melt date in a model of hatching date. Panel A shows the entire raw data together with the regression lines drawn for three different elevations. The regression lines span the range of snow melt dates occurring at the respective elevation (shown in panel C). Panel B is the same as panel A, but zoomed in to the better see the regression lines and with an additional regression line (in black) from the model that does not take elevation into account. We have seen that it can be very useful to include more than one predictor variable in a model even if they are correlated with each other. In fact, there is nothing wrong with that. However, correlated predictors (collinearity) make things more complicated. For example, partial regression lines should not be drawn across the whole range of values of a variable, to avoid extrapolating out of data. At 2800 m asl snow melt never starts in the beginning of March. Therefore, the blue regression line would not make sense for snow melt dates in March. Further, sometimes correlations among predictors indicate that these predictors measure the same underlying aspect and we are actually interested in the effect of this underlying aspect on our response. For example, we could include also the date of the end of snow melt. Both variables, the start and the end of the snow melt measure the timing of snow melt. Including both as predictor in the model would result in partial coefficients that measure how much hatching date changes when the snow melt starts one day later, while the end date is constant. That interpretation is a mixture of the effect of timing and duration rather than of snow melt timing alone. Similarly, the coefficient of the end of snow melt measures a mixture of duration and timing. Thus, if we include two variables that are correlated because they measure the same aspect (just a little bit differently), we get coefficients that are hard to interpret and may not measure what we actually are interested in. In such a cases, we get easier to interpret model coefficients, if we include just one variable of each aspect that we are interested in, e.g. we could include one timing variable (e.g. start of snow melt) and the duration of snow melt that may or may not be correlated with the start of snow melt. To summarize, the decision of what to do with correlated predictors primarily relies on the question we are interested in, i.e., what exactly should the partial coefficients be an estimate for. A further drawback of collinearity is that model fitting can become difficult. When strong correlations are present, model fitting algorithms may fail. If they do not fail, the statistical uncertainty of the estimates often becomes large. This is because the partial coefficient of one variable needs to be estimated for constant values of the other predictors in the model which means that a reduced range of values is available as illustrated in Figure 11.11 C. However, if uncertainty intervals (confidence, credible or compatibility intervals) are reported alongside the estimates, then using correlated predictors in the same model is absolutely fine, if the fitting algorithm was successful. The correlations per se can be interesting. Further readings on how to visualize and analyse data with complex correlation structures: principal component analysis (Manly 1994) path analyses, e.g. Shipley (2009) structural equation models (Hoyle 2012) 11.5 Ordered Factors and Contrasts In this chapter, we have seen that the model matrix is an \\(n \\times k\\) matrix (with \\(n\\) = sample size and \\(k\\) = number of model coefficients) that is multiplied by the vector of the \\(k\\) model coefficients to obtain the fitted values of a normal linear model. The first column of the model matrix normally contains only ones. This column is multiplied by the intercept. The other columns contain the observed values of the predictor variables if these are numeric variables, or indicator variables (= dummy variables) for factor levels if the predictors are categorical variables (= factors). For categorical variables the model matrix can be constructed in a number of ways. How it is constructed determines how the model coefficients can be interpreted. For example, coefficients could represent differences between means of specific factor levels to the mean of the reference level. That is what we have introduced above. However, they could also represent a linear, quadratic or cubic effect of an ordered factor. Here, we show how this works. An ordered factor is a categorical variable with levels that have a natural order, for example, ‘low’, ‘medium’ and ‘high’. How do we tell R that a factor is ordered? The swallow data contain a factor ‘nesting_aid’ that contains the type aid provided in a barn for the nesting swallows. The natural order of the levels is none < support (e.g., a wooden stick in the wall that helps support a nest built by the swallow) < artificial_nest < both (support and artificial nest). However, when we read in the data R orders these levels alphabetically rather than according to the logical order. data(swallows) levels(swallows$nesting_aid) ## [1] "artif_nest" "both" "none" "support" And with the function contrasts we see how R will construct the model matrix. contrasts(swallows$nesting_aid) ## both none support ## artif_nest 0 0 0 ## both 1 0 0 ## none 0 1 0 ## support 0 0 1 R will construct three dummy variables and call them ‘both’, ‘none’, and ‘support’. The variable ‘both’ will have an entry of one when the observation is ‘both’ and zero otherwise. Similarly, the other two dummy variables are indicator variables of the other two levels and ‘artif_nest’ is the reference level. The model coefficients can then be interpreted as the difference between ‘artif_nest’ and each of the other levels. The instruction how to transform a factor into columns of a model matrix is called the contrasts. Now, let’s bring the levels into their natural order and define the factor as an ordered factor. swallows$nesting_aid <- factor(swallows$nesting_aid, levels=c("none", "support", "artif_nest", "both"), ordered=TRUE) levels(swallows$nesting_aid) ## [1] "none" "support" "artif_nest" "both" The levels are now in the natural order. R will, from now on, use this order for analyses, tables, and plots, and because we defined the factor to be an ordered factor, R will use polynomial contrasts: contrasts(swallows$nesting_aid) ## .L .Q .C ## [1,] -0.6708204 0.5 -0.2236068 ## [2,] -0.2236068 -0.5 0.6708204 ## [3,] 0.2236068 -0.5 -0.6708204 ## [4,] 0.6708204 0.5 0.2236068 When using polynomial contrasts, R will construct three (= number of levels minus one) variables that are called ‘.L’, ‘.Q’, and ‘.C’ for linear, quadratic and cubic effects. The contrast matrix defines which numeric value will be inserted in each of the three corresponding columns in the model matrix for each observation, for example, an observation with ‘support’ in the factor ‘nesting_aid’ will get the values -0.224, -0.5 and 0.671 in the columns L, Q and C of the model matrix. These contrasts define yet another way to get 4 different group means: \\(m1 = \\beta_0 – 0.671* \\beta_1 + 0.5*\\beta_2 - 0.224* \\beta_3\\) \\(m2 = \\beta_0 – 0.224* \\beta_1 - 0.5*\\beta_2 + 0.671* \\beta_3\\) \\(m3 = \\beta_0 + 0.224* \\beta_1 - 0.5*\\beta_2 - 0.671* \\beta_3\\) \\(m4 = \\beta_0 + 0.671* \\beta_1 + 0.5*\\beta_2 + 0.224* \\beta_3\\) The group means are the same, independent of whether a factor is defined as ordered or not. The ordering also has no effect on the variance that is explained by the factor ‘nesting_aid’ or the overall model fit. Only the model coefficients and their interpretation depend on whether a factor is defined as ordered or not. When we define a factor as ordered, the coefficients can be interpreted as linear, quadratic, cubic, or higher order polynomial effects. The number of the polynomials will always be the number of factor levels minus one (unless the intercept is omitted from the model in which case it is the number of factor levels). Linear, quadratic, and further polynomial effects normally are more interesting for ordered factors than single differences from a reference level because linear and polynomial trends tell us something about consistent changes in the outcome along the ordered factor levels. Therefore, an ordered factor with k levels is treated like a covariate consisting of the centered level numbers (-1.5, -0.5, 0.5, 1.5 in our case with four levels) and k-1 orthogonal polynomials of this covariate are included in the model. Thus, if we have an ordered factor A with three levels, y~A is equivalent to y~x+I(x^2), with x=-1 for the lowest, x=0 for the middle and x=1 for the highest level. Note that it is also possible to define own contrasts if we are interested in specific differences or trends. However, it is not trivial to find meaningful and orthogonal (= uncorrelated) contrasts. 11.6 Quadratic and Higher Polynomial Terms The straight regression line for the biomass of grass species Ap Alopecurus pratensis dependent on the distance to the ground water does not fit well (Figure 11.10). The residuals at low and high values of water tend to be positive and intermediate water levels are associated with negative residuals. This points out a possible violation of the model assumptions. The problem is that the relationship between distance to water and biomass of species Ap is not linear. In real life, we often find non-linear relationships, but if the shape of the relationship is quadratic (plus, potentially, a few more polynomials) we can still use ‘linear modeling’ (the term ‘linear’ refers to the linear function used to describe the relationship between the outcome and the predictor variables: \\(f(x) = \\beta_0 + \\beta_1x + \\beta_2x^2\\) is a linear function compared to, e.g., \\(f(x) = \\beta^x\\), which is not a linear function). We simply add the quadratic term of the predictor variable, that is, water in our example, as a further predictor in the linear predictor: \\(\\hat{y_i} = \\beta_0+\\beta_1water_i+\\beta_2water_i^2\\). A quadratic term can be fitted in R using the function I() which tells R that we want the squared values of distance to water. If we do not use I() the ^2 indicates a two-way interaction. The model specification is then lm(log(Yi.g) ~ Water + I(Water^2), data=...). The cubic term would be added by +I(Water^3). As with interactions, a polynomial term changes the interpretation of lower level polynomials. Therefore, we normally include all polynomials up to a specific degree. Furthermore, polynomials are normally correlated (if no special transformation is used, see below) which could cause problems when fitting the model such as non-convergence. To avoid collinearity among polynomials, so called orthogonal polynomials can be used. These are polynomials that are uncorrelated. To that end, we can use the function poly which creates as many orthogonal polynomials of the variable as we want: poly(dat$Water, 2) creates two columns, the first one can be used to model the linear effect of water, the second one to model the quadratic term of water: t.poly <- poly(dat$Water, 2) dat$Water.l <- t.poly[,1] # linear term for water dat$Water.q <- t.poly[,2] # quadratic term for water mod <- lm(log(Yi.g) ~ Water.l + Water.q, data=dat) When orthogonal polynomials are used, the estimated linear and quadratic effects can be interpreted as purely linear and purely quadratic influences of the predictor on the outcome. The function poly applies a specific transformation to the original variables. To reproduce the transformation (e.g. for getting the corresponding orthogonal polynomials for new data used to draw an effect plot), the function predict can be used with the poly-object created based on the original data. newdat <- data.frame(Water = seq(0,130)) # transformation analogous to the one used to fit the model: newdat$Water.l <- predict(t.poly, newdat$Water)[,1] newdat$Water.q <- predict(t.poly, newdat$Water)[,2] These transformed variables can then be used to calculate fitted values that correspond to the water values specified in the new data. "],["residualanalysis.html", "12 Assessing Model Assumptions 12.1 Model Assumptions 12.2 Independent and Identically Distributed 12.3 The QQ-Plot 12.4 Temporal Autocorrelation 12.5 Spatial Autocorrelation 12.6 Heteroscedasticity", " 12 Assessing Model Assumptions 12.1 Model Assumptions Every statistical model makes assumptions. We try to build models that reflect the data-generating process as realistically as possible. However, a model never is the truth. Yet, all inferences drawn from a model, such as estimates of effect size or derived quantities with credible intervals, are based on the assumption that the model is true. However, if a model captures the datagenerating process poorly, for example, because it misses important structures (predictors, interactions, polynomials), inferences drawn from the model are probably biased and results become unreliable. In a (hypothetical) model that captures all important structures of the data generating process, the stochastic part, the difference between the observation and the fitted value (the residuals), should only show random variation. Analyzing residuals is a very important part of the data analysis process. Residual analysis can be very exciting, because the residuals show what remains unexplained by the present model. Residuals can sometimes show surprising patterns and, thereby, provide deeper insight into the system. However, at this step of the analysis it is important not to forget the original research questions that motivated the study. Because these questions have been asked without knowledge of the data, they protect against data dredging. Of course, residual analysis may raise interesting new questions. Nonetheless, these new questions have emerged from patterns in the data, which might just be random, not systematic, patterns. The search for a model with good fit should be guided by thinking about the process that generated the data, not by trial and error (i.e., do not try all possible variable combinations until the residuals look good; that is data dredging). All changes done to the model should be scientifically justified. Usually, model complexity increases, rather than decreases, during the analysis. 12.2 Independent and Identically Distributed Usually, we model an outcome variable as independent and identically distributed (iid) given the model parameters. This means that all observations with the same predictor values behave like independent random numbers from the identical distribution. As a consequence, residuals should look iid. Independent means that: The residuals do not correlate with other variables (those that are included in the model as well as any other variable not included in the model). The residuals are not grouped (i.e., the means of any set of residuals should all be equal). The residuals are not autocorrelated (i.e., no temporal or spatial autocorrelation exist; Sections 12.4 and 12.5). Identically distributed means that: All residuals come from the same distribution. In the case of a linear model with normal error distribution (Chapter 11) the residuals are assumed to come from the same normal distribution. Particularly: The residual variance is homogeneous (homoscedasticity), that is, it does not depend on any predictor variable, and it does not change with the fitted value. The mean of the residuals is zero over the whole range of predictor values. When numeric predictors (covariates) are present, this implies that the relationship between x and y can be adequately described by a straight line. Residual analysis is mainly done graphically. R makes it easy to plot residuals to look at the different aspects just listed. As an example, we use a linear regression for the biomass of the grass species Dactylis glomerata in relation to water conditions in the soil. The first panel in Fig. 12.1 shows the residuals against the fitted values together with a smoother (red line). This plot is called the Tukey-Ascombe plot. The mean of the residuals should be around zero along the whole range of fitted values. Note that smoothers are very sensitive to random structures in the data, especially for low sample sizes and toward the edges of the data range. Often, curves at the edges of the data do not worry us because the edges of smoothers are based on small sample sizes. The second panel a normal quantile-quantile (QQ) plot of the residuals. When the residuals are normally distributed, the points lie aong the diagonal line. This plot is explained in more detail below. The third panel shows the square root of the absolute values of the standardized residuals, a measure of residual variance, versus the fitted values, together with a smoother. When the residual variance is homogeneous along the range of fitted values, the smoother is horizontal. The fourth panel shows the residuals against the leverage. An observation with a measurement of a predictor variable far from the others has a large leverage. When all predictors are factors, observations with a rare combination of factor levels have higher leverages than observations with a common combination of factor levels. Such observations have the potential to have a large influence on the regression line. A high leverage does not necessarily mean that this observation has a big influence on the model. If that observation fits well to the pattern of all other data points, the observation does not have an unduly large influence on the model estimates, despite its large leverage. However, if it does not fit into the picture, this observation has a strong influence on the parameter estimates. The influence of one observation on the parameter estimates is measured by the Cook’s distance. Observations with large Cook’s distances lie beyond the red dashed lines in the fourth of the residual plots (the 0.5 and 1 iso lines for Cook’s distances are given as dashed lines). Observations with a Cook’s distance larger than 1 are usually considered to be overly influential and should be checked. The diagnostic plots (Fig. 12.1) of the residuals of the model fitted to the data of the species Dactylis glomerata look quite acceptable. 1. The average residual value is around zero along the range of fitted values, 2. the points are alined diagonally in the QQ-plot, 3. the variance does not noticably change along the fitted values, and 4. no observation has a large Cook’s distance. data(ellenberg) mod <- lm(Yi.g~Water, data=ellenberg[ellenberg$Species=="Dg",]) par(mfrow=c(2,2)) plot(mod) Figure 12.1: Standard diagnostic residual plots of a linear regression for the biomass data of D. glomerata. However, when the same model is fitted to data of Alopecurus pratensis, the model assumptions may not be met well (Fig. 12.2). The average of the residuals decreases with increasing fitted values (panel 1). A few observations, in particular observation 133, do not fit to a normal distribution (panel 2). The residual variance increases with increasing fitted values (panel 3). Observation 133 has a too high Cook’s distance. mod <- lm(Yi.g~Water, data=ellenberg[ellenberg$Species=="Ap",]) par(mfrow=c(2,2)) plot(mod) Figure 12.2: Standard diagnostic residual plots of a linear regression for the biomass data of A. pratensis. An increasing variance with increasing fitted values is a widespread case. The logarithm or square-root transformation of the response variable often is a quick and simple solution. Also, in this case, the log transformation improved the diagnostic plots (Fig. 12.3). mod <- lm(log(Yi.g)~Water, data=ellenberg[ellenberg$Species=="Ap",]) par(mfrow=c(2,2)) plot(mod) Figure 12.3: Standard diagnostic residual plots of a linear regression for the logarithm of the biomass data of A. pratensis. The four plots produced by plot(mod) show the most important aspects of the model fit. However, often these four plots are not sufficient. IN addition, we recommend plotting the residuals against all variables in the data set (including those not used in the current model). It is further recommended to think about the data structure. Can we assume that all observations are independent of each other? May there be spatial or temporal correlation? 12.3 The QQ-Plot Each residual represents a quantile of the sample of \\(n\\) residuals. These quantiles are defined by the sample size \\(n\\). A useful choice is the \\(((1,...,n)-0.5)/n\\)-th quantiles. A QQ-plot shows the residuals on the y-axis and the values of the \\(((1,...,n)-0.5)/n\\)-th quantiles of a theoretical normal distribution on the x-axis. A QQ-plot could also be used to compare the distribution of whatever variable with any distribution, but we want to use the normal distribution here because that is the assumed distribution of the residuals in the model. If the residuals are normally distributed, the points are expected to lie along the diagonal line in the QQ-plot. It is often rather difficult to decide whether a deviation from the line is tolerable or not. The function compareqqnorm may help. It draws, eight times, a random sample of \\(n\\) values from a normal distribution with a mean of zero and a standard deviation equal to the residual standard deviation of the model. It then creates a QQ-plot for all eight random samples and for the residuals in a random order. If the QQ-plot of the residuals can easily be identified amont the nine QQ-plots, there is reason to think the distribution of the residuals deviates from normal. Otherwise, there is no indication to suspect violation of the normality assumption. The position of the residual plot of the model in the nine panels is printed to the R console. 12.4 Temporal Autocorrelation 12.5 Spatial Autocorrelation 12.6 Heteroscedasticity "],["lmer.html", "13 Linear Mixed Effect Models 13.1 Background 13.2 Fitting a normal linear mixed model in R 13.3 Restricted maximum likelihood estimation (REML)", " 13 Linear Mixed Effect Models 13.1 Background 13.1.1 Why Mixed Effects Models? Mixed effects models (or hierarchical models A. Gelman and Hill (2007) for a discussion on the terminology) are used to analyze nonindependent, grouped, or hierarchical data. For example, when we measure growth rates of nestlings in different nests by taking mass measurements of each nestling several times during the nestling phase, the measurements are grouped within nestlings (because there are repeated measurements of each) and the nestlings are grouped within nests. Measurements from the same individual are likely to be more similar than measurements from different individuals, and individuals from the same nest are likely to be more similar than nestlings from different nests. Measurements of the same group (here, the “groups” are individuals or nests) are not independent. If the grouping structure of the data is ignored in the model, the residuals do not fulfill the independence assumption. Further, predictor variables can be measured on different hierarchical levels. For example, in each nest some nestlings were treated with a hormone implant whereas others received a placebo. Thus, the treatment is measured at the level of the individual, while clutch size is measured at the level of the nest. Clutch size was measured only once per nest but entered in the data file more than once (namely for each individual from the same nest). Repeated measure results in pseudoreplication if we do not account for the hierarchical data structure in the model. Mixed models allow modeling of the hierarchical structure of the data and, therefore, account for pseudoreplication. Mixed models are further used to analyze variance components. For example, when the nestlings were cross-fostered so that they were not raised by their genetic parents, we would like to estimate the proportions of the variance (in a measurement, e.g., wing length) that can be assigned to genetic versus to environmental differences. The three problems, grouped data, repeated measure and interest in variances are solved by adding further variance parameters to the model. As a result, the linear predictor of such models contain parameters that are fixed and parameters that vary among levels of a grouping variable. The latter are called “random effects”. Thus, a mixed model contains fixed and random effects. Often the grouping variable, which is a categorical variable, i.e., a factor, is called the random effect, even though it is not the factor that is random. The levels of the factor are seen as a random sample from a bigger population of levels, and a distribution, usually the normal distribution, is fitted to the level-specific parameter values. Thus, a random effect in a model can be seen as a model (for a parameter) that is nested within the model for the data. Predictors that are defined as fixed effects are either numeric or, if they are categorical, they have a finite (“fixed”) number of levels. For example, the factor “treatment” in the Barn owl study below has exactly two levels “placebo” and “corticosterone” and nothing more. In contrast, random effects have a theoretically infinite number of levels of which we have measured a random sample. For example, we have measured 10 nests, but there are many more nests in the world that we have not measured. Normally, fixed effects have a low number of levels whereas random effects have a large number of levels (at least 3!). For fixed effects we are interested in the specific differences between levels (e.g., between males and females), whereas for random effects we are only interested in the between-level (between-group, e.g., between-nest) variance rather than in differences between specific levels (e.g., nest A versus nest B). Typical fixed effects are: treatment, sex, age classes, or season. Typical random effects are: nest, individual, field, school, or study plot. It depends sometimes on the aim of the study whether a factor should be treated as fixed or random. When we would like to compare the average size of a corn cob between specific regions, then we include region as a fixed factor. However, when we would like to know how the size of a corn cob is related to the irrigation system and we have several measurements within each of a sample of regions, then we treat region as a random factor. 13.1.2 Random Factors and Partial Pooling In a model with fixed factors, the differences of the group means to the mean of the reference group are separately estimated as model parameters. This produces \\(k-1\\) (independent) model parameters, where \\(k\\) is the number of groups (or number of factor levels). In contrast, for a random factor, the between-group variance is estimated and the \\(k\\) group-specific means are assumed to be normally distributed around the population mean. These \\(k\\) means are thus not independent. We usually call the differences between the specific mean of group \\(g\\) and the mean of all groups \\(b_g\\). They are assumed to be realizations of the same (in most cases normal) distribution with a mean of zero. They are like residuals. The variance of the \\(b_g\\) values is the among-group variance. Treating a factor as a random factor is equivalent to partial pooling of the data. There are three different ways to obtain means for grouped data. First, the grouping structure of the data can be ignored. This is called complete pooling (left panel in Figure 13.1). Second, group means may be estimated separately for each group. In this case, the data from all other groups are ignored when estimating a group mean. No pooling occurs in this case (right panel in Figure 13.1). Third, the data of the different groups can be partially pooled (i.e., treated as a random effect). Thereby, the group means are weighted averages of the population mean and the unpooled group means. The weights are proportional to sample size and the inverse of the variance (see A. Gelman and Hill (2007), p. 252). Further, the estimated mean of all group equals the mean of the group specific means, thus, every group is weighed similarly for calculating the overall mean. In contrast, in the complete pooling case, the groups get weights proportional to their sample sizes. Complete pooling Partial pooling No pooling \\(\\hat{y_i} = \\beta_0\\) \\(y_i \\sim normal(\\hat{y_i}, \\sigma^2)\\) \\(\\hat{y_i} = \\beta_0 + b_{g[i]}\\) \\(b_g \\sim normal(0, \\sigma_b^2)\\) \\(y_i \\sim normal(\\hat{y_i}, \\sigma^2)\\) \\(\\hat{y_i} = \\beta_{0[g[i]]}\\) \\(y_i \\sim normal(\\hat{y_i}, \\sigma_g^2)\\) Figure 13.1: Three possibilities to obtain group means for grouped data: complete pooling, partial pooling, and no pooling. Open symbols = data, orange dots with vertical bars = group means with 95% uncertainty intervals, horizontal black line with shaded interval = population mean with 95% uncertainty interval. What is the advantage of analyses using partial pooling (i.e., mixed, hierarchical, or multilevel modelling) compared to the complete or no pooling analyses? Complete pooling ignores the grouping structure of the data. As a result, the uncertainty interval of the population mean may be too narrow. We are too confident in the result because we assume that all observations are independent when they are not. This is a typical case of pseudoreplication. On the other hand, the no pooling method (which is equivalent to treating the factor as fixed) has the danger of overestimation of the among-group variance because the group means are estimated independently of each other. The danger of overestimating the among-group variance is particularly large when sample sizes per group are low and within-group variance large. In contrast, the partial pooling method assumes that the group means are a random sample from a common distribution. Therefore, information is exchanged between groups. Estimated means for groups with low sample sizes, large variances, and means far away from the population mean are shrunk towards the population mean. Thus, group means that are estimated with a lot of imprecision (because of low sample size and high variance) are shrunk towards the population mean. How strongly they are shrunk depends on the precision of the estimates for the group specific means and the population mean. An example will help make this clear. Imagine that we measured 60 nestling birds from 10 nests (6 nestlings per nest) and found that the average nestling mass at day 10 was around 20 g with a among-nest standard deviation of 2 g. Then, we measure only one nestling from one additional nest (from the same population) whose mass was 12 g. What do we know about the average mass of this new nest? The mean of the measurements for this nest is 12 g, but with n = 1 uncertainty is high. Because we know that the average mass of the other nests was 20 g, and because the new nest belonged to the same population, a value higher than 12 g is a better estimate for an average nestling mass of the new nest than the 12 g measurement of one single nestling, which could, by chance, have been an exceptionally light individual. This is the shrinkage that partial pooling allows in a mixed model. Because of this shrinkage, the estimates for group means from a mixed model are sometimes called shrinkage estimators. A consequence of the shrinkage is that the residuals are positively correlated with the fitted values. To summarize, mixed models are used to appropriately estimate among-group variance, and to account for non-independency among data points. 13.2 Fitting a normal linear mixed model in R To introduce the linear mixed model, we use repeated hormone measures at nestling Barn Owls Tyto alba. The cortbowl data set contains stress hormone data (corticosterone, variable ‘totCort’) of nestling Barn owls which were either treated with a corticosterone-implant, or with a placebo-implant as the control group. The aim of the study was to quantify the corticosterone increase due to the corticosterone implants (Almasi et al. 2009). In each brood, one or two nestlings were implanted with a corticosterone-implant and one or two nestlings with a placebo-implant (variable ‘Implant’). Blood samples were taken just before implantation, and at days 2 and 20 after implantation. data(cortbowl) dat <- cortbowl dat$days <- factor(dat$days, levels=c("before", "2", "20")) str(dat) # the data was sampled in 2004,2005, and 2005 by the Swiss Ornithologicla Institute ## 'data.frame': 287 obs. of 6 variables: ## $ Brood : Factor w/ 54 levels "231","232","233",..: 7 7 7 7 8 8 9 9 10 10 ... ## $ Ring : Factor w/ 151 levels "898054","898055",..: 44 45 45 46 31 32 9 9 18 19 ... ## $ Implant: Factor w/ 2 levels "C","P": 2 2 2 1 2 1 1 1 2 1 ... ## $ Age : int 49 29 47 25 57 28 35 53 35 31 ... ## $ days : Factor w/ 3 levels "before","2","20": 3 2 3 2 3 1 2 3 2 2 ... ## $ totCort: num 5.76 8.42 8.05 25.74 8.04 ... In total, there are 287 measurements of 151 individuals (variable ‘Ring’) of 54 broods. Because the measurements from the same individual are non-independent, we use a mixed model to analyze these data: Two additional arguments for a mixed model are: a) the mixed model allows prediction of corticosterone levels for an ‘average’ individual, whereas the fixed effect model allows prediction of corticosterone levels only for the 151 individuals that were sampled; and b) fewer parameters are needed. If we include individual as a fixed factor, we would use 150 parameters, while the random factor needs a much lower number of parameters. We first create a graphic to show the development for each individual, separately for owls receiving corticosterone versus owls receiving a placebo (Figure 13.2). Figure 13.2: Total corticosterone before and at day 2 and 20 after implantation of a corticosterone or a placebo implant. Lines connect measurements of the same individual. We fit a normal linear model with ‘Ring’ as a random factor, and ‘Implant’, ‘days’ and the interaction of ‘Implant’ \\(\\times\\) ‘days’ as fixed effects. Note that both ‘Implant’ and ‘days’ are defined as factors, thus R creates indicator variables for all levels except the reference level. Later, we will also include ‘Brood’ as a grouping level; for now, we ignore this level and start with a simpler (less perfect) model for illustrative purposes. \\(\\hat{y_i} = \\beta_0 + b_{Ring[i]} + \\beta_1I(days=2) + \\beta_2I(days=20) + \\beta_3I(Implant=P) + \\beta_4I(days=2)I(Implant=P) + \\beta_5I(days=20)I(Implant=P)\\) \\(b_{Ring} \\sim normal(0, \\sigma_b)\\) \\(y_i \\sim normal(\\hat{y_i}, \\sigma)\\) Several different functions to fit a mixed model have been written in R: lme, gls, gee have been the first ones. Then lmer followed, and now, stan_lmer and brm allow to fit a large variety of hierarchical models. We here start w ith using lmer from the package lme4 (which is automatically loaded to the R-console when loading arm), because it is a kind of basis function also for stan_lmerand brm. Further, sim can treat lmer-objects but none of the earlier ones. The function lmer is used similarly to the function lm. The only difference is that the random factors are added in the model formula within parentheses. The ’1’ stands for the intercept and the ‘|’ means ‘grouped by’. ‘(1|Ring)’, therefore, adds the random deviations for each individual to the average intercept. These deviations are the b_{Ring} in the model formula above. Corticosterone data are log transformed to achieve normally distributed residuals. After having fitted the model, in real life, we always first inspect the residuals, before we look at the model output. However, that is a dilemma for this text book. Here, we would like to explain how the model is constructed just after having shown the model code. Therefore, we do the residual analyses later, but in real life, we would do it now. mod <- lmer(log(totCort) ~ Implant + days + Implant:days + (1|Ring), data=dat, REML=TRUE) mod ## Linear mixed model fit by REML ['lmerMod'] ## Formula: log(totCort) ~ Implant + days + Implant:days + (1 | Ring) ## Data: dat ## REML criterion at convergence: 611.9053 ## Random effects: ## Groups Name Std.Dev. ## Ring (Intercept) 0.3384 ## Residual 0.6134 ## Number of obs: 287, groups: Ring, 151 ## Fixed Effects: ## (Intercept) ImplantP days2 days20 ## 1.91446 -0.08523 1.65307 0.26278 ## ImplantP:days2 ImplantP:days20 ## -1.71999 -0.09514 The output of the lmer-object tells us that the model was fitted using the REML-method, which is the restricted maximum likelihood method. The ‘REML criterion’ is the statistic describing the model fit for a model fitted by REML. The model output further contains the parameter estimates. These are grouped into a random effects and fixed effects section. The random effects section gives the estimates for the among-individual standard deviation of the intercept (\\(\\sigma_{Ring} =\\) 0.34) and the residual standard deviation (\\(\\sigma =\\) 0.61). The fixed effects section gives the estimates for the intercept (\\(\\beta_0 =\\) 1.91), which is the mean logarithm of corticosterone for an ‘average’ individual that received a corticosterone implant at the day of implantation. The other model coefficients are defined as follows: the difference in the logarithm of corticosterone between placebo- and corticosterone-treated individuals before implantation (\\(\\beta_1 =\\) -0.09), the difference between day 2 and before implantation for the corticosterone-treated individuals (\\(\\beta_2 =\\) 1.65), the difference between day 20 and before implantation for the corticosterone-treated individuals (\\(\\beta_3 =\\) 0.26), and the interaction parameters which tell us how the differences between day 2 and before implantation (\\(\\beta_4 =\\) -1.72), and day 20 and before implantation (\\(\\beta_5 =\\) -0.1), differ for the placebo-treated individuals compared to the corticosterone treated individuals. Neither the model output shown above nor the summary function (not shown) give any information about the proportion of variance explained by the model such as an \\(R^2\\). The reason is that it is not straightforward to obtain a measure of model fit in a mixed model, and different definitions of \\(R^2\\) exist (Nakagawa and Schielzeth 2013). The function fixef extracts the estimates for the fixed effects, the function ranef extracts the estimates for the random deviations from the population intercept for each individual. The ranef-object is a list with one element for each random factor in the model. We can extract the random effects for each ring using the $Ring notation. round(fixef(mod), 3) ## (Intercept) ImplantP days2 days20 ImplantP:days2 ## 1.914 -0.085 1.653 0.263 -1.720 ## ImplantP:days20 ## -0.095 head(ranef(mod)$Ring) # print the first 6 Ring effects ## (Intercept) ## 898054 0.24884979 ## 898055 0.11845863 ## 898057 -0.10788277 ## 898058 0.06998959 ## 898059 -0.08086498 ## 898061 -0.08396839 13.3 Restricted maximum likelihood estimation (REML) For a mixed model the restricted maximum likelihood method is used by default instead of the maximum likelihood (ML) method. The reason is that the ML-method underestimates the variance parameters because this method assumes that the fixed parameters are known without uncertainty when estimating the variance parameters. However, the estimates of the fixed effects have uncertainty. The REML method uses a mathematical trick to make the estimates for the variance parameters independent of the estimates for the fixed effects. We recommend reading the very understandable description of the REML method in Zuur et al. (2009). For our purposes, the relevant difference between the two methods is that the ML-estimates are unbiased for the fixed effects but biased for the random effects, whereas the REML-estimates are biased for the fixed effects and unbiased for the random effects. However, when sample size is large compared to the number of model parameters, the differences between the ML- and REML-estimates become negligible. As a guideline, use REML if the interest is in the random effects (variance parameters), and ML if the interest is in the fixed effects. The estimation method can be chosen by setting the argument ‘REML’ to ‘FALSE’ (default is ‘TRUE’). mod <- lmer(log(totCort) ~ Implant + days + Implant:days + (1|Ring), data=dat, REML=FALSE) # using ML When we fit the model by stan_lmer from the rstanarm-package or brm from the brms-package, i.e., using the Bayes theorem instead of ML or REML, we do not have to care about this choice (of course!). The result from a Bayesian analyses is unbiased for all parameters (at least from a mathematical point of view - also parameters from a Bayesian model can be biased if the model violates assumptions or is confounded). "],["glm.html", "14 Generalized linear models 14.1 Introduction 14.2 Bernoulli model 14.3 Binomial model 14.4 Poisson model", " 14 Generalized linear models 14.1 Introduction Up to now, we have dealt with models that assume normally distributed residuals. Sometimes the nature of the outcome variable makes it impossible to fulfill this assumption as might occur with binary variables (e.g., alive/dead, a specific behavior occurred/did not occur), proportions (which are confined to be between 0 and 1), or counts that cannot have negative values. For such cases, models for distributions other than the normal distribution are needed; such models are called generalized linear models (GLM). They consist of three elements: the linear predictor \\(\\bf X \\boldsymbol \\beta\\) the link function \\(g()\\) the data distribution The linear predictor is exactly the same as in normal linear models. It is a linear function that defines the relationship between the dependent and the explanatory variables. The link function transforms the expected values of the outcome variable into the range of the linear predictor, which ranges from \\(-\\infty\\) to \\(+\\infty\\). Or, perhaps more intuitively, the inverse link function transforms the values of the linear predictor into the range of the outcome variable. Table 14.1 gives a list of possible link functions that work with different data distributions. Then, a specific data distribution, for example, binomial or Poisson, is used to describe how the observations scatter around the expected values. A general model formula for a generalized linear model is: \\[\\bf y \\sim ExpDist(\\bf\\hat y, \\boldsymbol\\theta)\\] \\[g(\\bf\\hat y) = \\bf X\\boldsymbol \\beta \\] where ExpDist is a distribution of the exponential family and \\(g()\\) is the link function. The vector \\(\\bf y\\) contains the observed values of the outcome variable, \\(\\bf \\beta\\) contains the model parameters in the linear predictor (also called the model coefficients), and \\(\\bf X\\) is the model matrix containing the values of the predictor variables. \\(\\boldsymbol \\theta\\) is an optional vector of additional parameters needed to define the data distribution (e.g., the number of trials in the binomial distribution or the variance in the normal distribution). The normal linear model is a specific case of a generalized linear model, namely when ExpDist equals the normal distribution and \\(g()\\) is the identity function (\\(g(x) = x\\)). Statistical distributions of the exponential family are normal, Bernoulli, binomial, Poisson, inverse-normal, gamma, negative binomial, among others. The normal, Bernoulli, binomial, Poisson or negative binomial distributions are by far the most often used distributions. Most, but not all, data we gather in the life sciences can be analyzed assuming one of these few distributions. Table 14.1: Frequently used distributions for the glm function with their default (D) link functions and other link functions that are possible. link Gaussian Binomial Gamma Inv_Gauss Poisson Negative_binomial logit D probit x cloglog x identity D x x x inverse D log x D D 1/mu^2 D sqrt x cauchit x x exponent (mu^a) x Paul Buerkner has implemented many different distributions and link function in the package brms, see here. 14.2 Bernoulli model 14.2.1 Background If the outcome variable can only take one of two values (e.g., a species is present or absent, or the individual survived or died; coded as 1 or 0) we use a Bernoulli model, also called logistic regression. The Bernoulli distribution only allows for the values zero and ones and it has only one parameter \\(p\\), which defines the probability that the value is 1. When fitting a Bernoulli model to data, we have to estimate \\(p\\). Often we are interested in correlations between \\(p\\) and one or several explanatory variables. Therefore, we model \\(p\\) as linearly dependent on the explanatory variables. Because the values of \\(p\\) are squeezed between 0 and 1 (because it is a probability), \\(p\\) is transformed by the link-function before the linear relationship is modeled. \\[g(p_i) = \\bf X\\boldsymbol \\beta \\] Functions that can transform a probability into the scale of the linear predictor (\\(-\\infty\\) to \\(+\\infty\\)) are, for example, logit, probit, cloglog, or cauchit. These link functions differ slightly in the way they link the outcome variable to the explanatory variables (Figure 14.1). The logit link function is the most often used link function in binomial models. However, sometimes another link function might fit the data better. Kevin S. Van Horn gives useful tipps when to use which link function. Figure 14.1: Left panel: Shape of different link functions commonly used for modelling probabilities. Right panel: The relationship between the predictor x (x-axis) and p on the scale of the link function (y-axis) is assumed to be linear. 14.2.2 Fitting a Bernoulli model in R Functions to fit a Bernoulli model are glm, stan_glm, brm, and there are many more that we do not know so well as the three we focus on in this book. We start by using the function glm. It uses the “iteratively reweighted least-squares method” which is an adaptation of the least-square (LS) method for fitting generalized linear models. The argument familyallows to choose a data distribution. For fitting a Bernoulli model, we need to specify binomial. That is because the Bernoulli distribution is equal to the binomial distribution with only one trial (size parameter = 1). Note, if we forget the family argument, we fit a normal linear model, and there is no warning by R! With the specification of the distribution, we also choose the link-function. The default link function for the binomial or Bernoulli model is the logit-function. To change the link-function, use e.g. family=binomial(link=cloglog). As an example, we use presence-absence data of little owls Athene noctua in nest boxes during the breeding season. The original data are published in Gottschalk, Ekschmitt, and Wolters (2011); here we use only parts of these data. The variable PA contains the presence of a little owl: 1 indicates a nestbox used by little owls, whereas 0 stands for an empty nestbox. The variable elevation has the elevation in meters above sea level. We are interested in how the presence of the little owl is associated with elevation within the study area, that is, how the probability of presence changes with elevation. Our primary interest, therefore, is the slope \\(\\beta_1\\) of the regression line. \\[ y_i \\sim Bernoulli(p_i) \\] \\[ logit(p_i) = \\beta_0 + \\beta_1 elevation\\] where \\(logit(p_i) = log(p_i/(1-p_i))\\). data(anoctua) # Athene noctua data in the blmeco package mod <- glm(PA~elevation, data=anoctua, family=binomial) mod ## ## Call: glm(formula = PA ~ elevation, family = binomial, data = anoctua) ## ## Coefficients: ## (Intercept) elevation ## 0.579449 -0.006106 ## ## Degrees of Freedom: 360 Total (i.e. Null); 359 Residual ## Null Deviance: 465.8 ## Residual Deviance: 445.6 AIC: 449.6 14.2.3 Assessing model assumptions in a Bernoulli model As for the normal linear model, the Bernoulli model (and any oder statistical model) assumes that the residuals are independent and identically distributed (iid). Independent means that every observation \\(i\\) is independent of the other observations. Particularly, there are no groups in the data and no temporal or spatial correlation. For generalised linear model different residuals exist. The standard residual plots obtained by plot(mod) produce the same four plots as for an lm object, but it uses the deviance residuals for the first three plots (residuals versus fitted values, QQ plot, and residual variance versus fitted values) and the Pearson’s residuals for the last (residuals versus leverage). The deviance residuals are the contribution of each observation to the deviance of the model. This is the default type when the residuals are extracted from the model using the function resid. The Pearson’s residual for observation \\(i\\) is the difference between the observed and the fitted number of successes divided by the standard deviation given the number of trials and the fitted success probability: \\(\\epsilon_i = \\frac{y_i-n_i \\hat{p_i}}{\\sqrt{n_i \\hat{p_i}(1-\\hat{p_i})}}\\). Other types of residuals are “working,” “response,” or “partial” (see Davison and Snell (1991)). For the residual plots, R chooses the type of residuals so that each plot should look roughly like the analogous plot for the normal linear model. However, in most cases the plots look awkward due to the discreteness of the data, especially when success probabilities are close to 0 or 1. We recommend thinking about why they do not look perfect; with experience, serious violations of model assumptions can be recognized. But often posterior predictive model checking or graphical comparison of fitted values to the data are better suited to assess model fit in GLMs. For Bernoulli models, the residual plots normally look quite awful because the residual distribution very often has two peaks, a negative and a positive one resulting from the binary nature of the outcome variable. However, it is still good to have a look at these plots using plot(mod). At least the average should roughly be around zero and not show a trend. An often more informative plot to judge model fit for a binary logistic regression is to compare the fitted values with the data. To better see the observations, we slightly jitter them in the vertical direction. If the model would fit the data well, the data would be, on average, equal to the fitted values. Thus, we add the \\(y = x\\)-line to the plot using the abline function with intercept 0 and slope 1. Of course, binary data cannot lie on this line because they can only take on the two discrete values 0 or 1. However, the mean of the 0 and 1 values should lie on the line if the model fits well. Therefore, we calculate the mean for suitably selected classes of fitted values. In our example, we choose a class width of 0.1. Then, we calculate means per class and add these to the plot, together with a classical standard error that tells us how reliable the means are. This can be an indication whether our arbitrarily chosen class width is reasonable. plot(fitted(mod), jitter(anoctua$PA, amount=0.05), xlab="Fitted values", ylab="Probability of presence", las=1, cex.lab=1.2, cex=0.8) abline(0,1, lty=3) t.breaks <- cut(fitted(mod), seq(0,1, by=0.1)) means <- tapply(anoctua$PA, t.breaks, mean) semean <- function(x) sd(x)/sqrt(length(x)) means.se <- tapply(anoctua$PA, t.breaks, semean) points(seq(0.05, 0.95, by=0.1), means, pch=16, col="orange") segments(seq(0.05, 0.95, by=0.1), means-2*means.se, seq(0.05, 0.95,by=0.1), means+2*means.se,lwd=2, col="orange") mod <- glm(PA ~ elevation + I(elevation^2) + I(elevation^3) + I(elevation^4), data=anoctua, family=binomial) t.breaks <- cut(fitted(mod), seq(0,1, by=0.1)) means <- tapply(anoctua$PA, t.breaks, mean) semean <- function(x) sd(x)/sqrt(length(x)) means.se <- tapply(anoctua$PA, t.breaks, semean) points(seq(0.05, 0.95, by=0.1)+0.01, means, pch=16, col="lightblue", cex=0.7) segments(seq(0.05, 0.95, by=0.1)+0.01, means-2*means.se, seq(0.05, 0.95,by=0.1)+0.01, means+2*means.se,lwd=2, col="lightblue") Figure 14.2: Goodness of fit plot for the Bernoulli model fitted to little owl presence-absence data. Open circles = observed presence (1) or absence (0) jittered in the vertical direction; orange dots = mean (and 95% compatibility intervals given as vertical bards) of the observations within classes of width 0.1 along the x-axis. The dotted line indicates perfect coincidence between observation and fitted values. Orange larger points are from the model assuming a linear effect of elevation, wheras the smaller light blue points are from a model assuming a non-linear effect. The means of the observed data (orange dots) do not fit well to the data (Figure 14.2). For low presence probabilities, the model overestimates presence probabilities whereas, for medium presence probabilities, the model underestimates presence probability. This indicates that the relationship between little owl presence and elevation may not be linear. After including polynomials up to the fourth degree, we obtained a reasonable fit (light blue dots in Figure 14.2). Further aspects of model fit that may be checked in Bernoulli models: Are all observations independent? May spatial or temporal correlation be an issue? Are all parameters well informed by the data? Some parameters may not be identifiable due to complete separation, i.e. when there is no overlap between the 0 and 1’s regarding one of the predictor variables. In such cases glm may fail to fit the model. However, Bayesian methods (stan_glm or brm) do not fail but the result may be highly influenced by the prior distributions. A prior sensitivity analysis is recommended. Note, we do not have to worry about overdispersion when the outcome variable is binary, even though the variance of the Bernoulli distribution is defined by p and no separate variance parameter exists. However, because the data can only take the values 0 and 1, there is no possibility that the data can show a higher variance than the one assumed by the Bernoulli distribution. 14.2.4 Visualising the results When we are ready to report and visualise the results (i.e. after assessing the model fit, when we think the model reasonably well describes the data generating process). We can simulate the posterior distribution of \\(\\beta_1\\) and obtain the 95% compatibility interval. library(arm) nsim <- 5000 bsim <- sim(mod, n.sim=nsim) # sim from package arm apply(bsim@coef, 2, quantile, prob=c(0.5, 0.025, 0.975)) ## (Intercept) elevation I(elevation^2) I(elevation^3) I(elevation^4) ## 50% -24.39396 0.3967360 -0.0022011353 0.000004884348 -0.000000003824508 ## 2.5% -35.61303 0.1984131 -0.0035191339 0.000001354667 -0.000000007254023 ## 97.5% -13.45759 0.6020802 -0.0009058836 0.000008491873 -0.000000000448624 To interpret this polynomial function, an effect plot is helpful. To that end, and as we have done before, we calculate fitted values over the range of the covariate, together with compatibility intervals. newdat <- data.frame(elevation = seq(80,600,by=1)) Xmat <- model.matrix(~elevation+I(elevation^2)+I(elevation^3)+ I(elevation^4), data=newdat) # the model matrix fitmat <- matrix(nrow=nrow(newdat), ncol=nsim) for(i in 1:nsim) fitmat[,i] <- plogis(Xmat %*% bsim@coef[i,]) newdat$lwr <- apply(fitmat,1,quantile,probs=0.025) newdat$fit <- plogis(Xmat %*% coef(mod)) newdat$upr <- apply(fitmat,1,quantile,probs=0.975) We now can plot the data together with the estimate and its compatibility interval. We, again, use the function jitter to slightly scatter the points along the y-axis to make overlaying points visible. plot(anoctua$elevation, jitter(anoctua$PA, amount=0.05), las=1, cex.lab=1.4, cex.axis=1.2, xlab="Elevation", ylab="Probability of presence") lines(newdat$elevation, newdat$fit, lwd=2) lines(newdat$elevation, newdat$lwr, lty=3) lines(newdat$elevation, newdat$upr, lty=3) Figure 14.3: Little owl presence data versus elevation with regression line and 95% compatibility interval (dotted lines). Open circles = observed presence (1) or abesnce (0) jittered in the vertical direction. 14.2.5 Some remarks Binary data do not contain a lot of information. Therefore, large sample sizes are needed to obtain robust results. Often presence/absence data are obtained by visiting plots several times during a distinct period, for example, a breeding period, and then it is reported whether a species has been seen or not. If it has been seen and if there is no misidentification in the data, it is present, however, if it has not been seen we are usually not sure whether we have not detected it or whether it is absent. In the case of repeated visits to the same plot, it is possible to estimate the detection probability using occupancy models MacKenzie et al. (2002) or point count models Royle (2004). Finally, logistic regression can be used in the sense of a discriminant function analysis that aims to find predictors that discriminate members of two groups Anderson (1974). However, if one wants to use the fitted value from such an analysis to assign group membership of a new subject, one has to take the prevalence of the two groups in the data into account. 14.3 Binomial model 14.3.1 Background The binomial model is usesd when the response variable is a count with an upper limit, e.g., the number of seeds that germinated among a total number of seeds in a pot, or the number of chicks hatching from the total number of eggs. Thus, we can use the binomial model always when the response is the sum of a predefined number of Bernoulli trials. Whether a seed germinates or not is a Bernoulli trial. If we have more than one seed, the number of germinated seeds follow a binomial distribution. As an example, we use data from a study on the effects of anthropogenic fire regimes traditionally applied to savanna habitat in Gabon, Central Africa (Walters 2012). Young trees survive fires better or worse depending, among other factors, on the fuel load, which, in turn, depends heavily on the time since the last fire happened. Thus, plots were burned after different lengths of time since the previous fire (4, 9, or 12 months ago). Trees that resprouted after the previous (first) fire were counted before and after the experimental (second) fire to estimate their survival of the experimental fire depending on the time since the previous fire. The outcome variable is the number of surviving trees among the total number of trees per plot \\(y_i\\). The explanatory variable is the time since the previous fire, a factor with three levels: “4m”, “9m”, and “12m”. Assuming that the data follow a binomial distribution, the following model can be fitted to the data: \\[ y_i \\sim binomial(p_i, n_i) \\] \\[ logit(p_i) = \\beta_0 + \\beta_1 I(treatment_i=9m) + \\beta_2 I(treatment_i=12m)\\] where \\(p_i\\) being the survival probability and \\(n_i\\) the total number of tree sprouts on plot \\(i\\). Note that \\(n_i\\) should not be confused with the sample size of the data set, i.e. the number of rows in the data table. 14.3.2 Fitting a binomial model in R We normally use glm, stan_glm or brm for fitting a binomial model depending on the complexity of the predictors and correlation structure. We here, again, start with using the glm function. A peculiarity with binomial models is that the outcome is not just one number, it is the number of trees still live \\(y_i\\) out of \\(n_i\\) trees that were alive before the experimental fire. Therefore, the outcome variable has to be given as a matrix with two columns. The first column contains the number of successes (number of survivors \\(y_i\\)) and the second column contains the number of failures (number of trees killed by the fire, \\(n_i - y_i\\)). We build this matrix using cbind (“column bind”). data(resprouts) # example data from package blmeco resprouts$succ <- resprouts$post resprouts$fail <- resprouts$pre - resprouts$post mod <- glm(cbind(succ, fail) ~ treatment, data=resprouts, family=binomial) mod ## ## Call: glm(formula = cbind(succ, fail) ~ treatment, family = binomial, ## data = resprouts) ## ## Coefficients: ## (Intercept) treatment9m treatment12m ## -1.241 1.159 -2.300 ## ## Degrees of Freedom: 40 Total (i.e. Null); 38 Residual ## Null Deviance: 845.8 ## Residual Deviance: 395 AIC: 514.4 Experienced readers will be alarmed because the residual deviance is much larger than the residual degrees of freedom, which indicates overdispersion. We will soon discuss overdispersion, but, for now, we continue with the analysis for the sake of illustration. The estimated model parameters are \\(\\hat{b_0} =\\) -1.24, \\(\\hat{b_1} =\\) 1.16, and \\(\\hat{b_2} =\\) -2.3. These estimates tell us that tree survival was higher for the 9-month fire lag treatment compared to the 4-month treatment (which is the reference level), but lowest in the 12-month treatment. To obtain the mean survival probabilities per treatment, some math is needed because we have to back-transform the linear predictor to the scale of the outcome variable. The mean survival probability for the 4-month treatment is \\(logit^{-1}(\\)-1.24$) = =$0.22, for the 9-month treatment it is \\(logit^{-1}(\\)-1.24$ +$ 1.16\\() =\\) 0.48, and for the 12-month treatment it is \\(logit^{-1}(\\)-1.24$ +$ -2.3\\() =\\) 0.03. The function plogis gives the inverse of the logit function and can be used to estimate the survival probabilities, for example: plogis(coef(mod)[1]+ coef(mod)[2]) # for the 9month treatment ## (Intercept) ## 0.4795799 The direct interpretation of the model coefficients \\(\\beta_1\\) and \\(\\beta_2\\) is that they are the log of the ratio of the odds of two treatment levels (i.e., the log odds ratio). The odds for treatment “4 months” are 0.22/(1-0.22)=0.29 (calculated using non rounded values), which is the estimated ratio of survived to killed trees in this treatment. For treatment “9 months,” the odds are 0.48/(1-0.48) = 0.92, and the log odds ratio is log(0.92/0.29) = 1.16 = \\(beta_1\\). The model output includes the null deviance and the residual deviance. Deviance is a measure of the difference between the data and a model. It corresponds to the sum of squares in the normal linear model. The smaller the residual deviance the better the model fits to the data. Adding a predictor reduces the deviance, even if the predictor does not have any relation to the outcome variable. The Akaike information criterion (AIC) value in the model output (last line) is a deviance measure that is penalized for the number of model parameters. It can be used for model comparison. The residual deviance is defined as minus two times the difference of the log-likelihoods of the saturated model and our model. The saturated model is a model that uses the observed proportion of successes as the success probability for each observation \\(y_i \\sim binomial(y_i/n_i, n_i)\\). The saturated model has the highest possible likelihood (given the data set and the binomial model). This highest possible likelihood is compared to the likelihood of the model at hand, \\(y_i \\sim binomial(p_i, n_i)\\) with \\(p_i\\) dependent on some predictor variables. The null deviance is minus two times the difference of the log-likelihoods of the saturated model, and a model that contains only one overall mean success probability, the null model \\(y_i \\sim binomial(p, n_i)\\). The null deviance corresponds to the total sum of squares, that is, it is a measure of the total variance in the data. 14.3.3 Assessing assumptions in a binomial model In the standard residual plots, we see that in our example data there are obviously a number of influential points (especially the data points with row numbers 7, 20, and 26; Figure 14.4). The corresponding data points may be inspected for errors, or additional predictors may be identified that help to explain why these points are extreme (Are they close/far from the village? Were they grazed? etc.). par(mfrow=c(2,2)) plot(mod) Figure 14.4: The four standard residual plots obtained by using the plot-function. For whatever reason, the variance in the data is larger than assumed by the binomial distribution. We detect this higher variance in the mean of the absolute values of the standardized residuals that is clearly larger than one (lower left panel in Figure 14.4). This is called overdispersion, which we mentioned earlier and deal with next. The variance of a binomial model is defined by \\(n\\) and \\(p\\), that is, there is no separate variance parameter. In our example \\(p\\) is fully defined by \\(\\beta_0\\), \\(\\beta_1\\), and \\(\\beta_2\\): \\(p_i = logit^{-1}(\\beta_0 + \\beta_1 I(treatment_i = 9m) + \\beta_2 I(treatment_i = 12m))\\), and \\(n_i\\) is part of the data. Similarly, in a Poisson model (which we will introduce in the next chapter) the variance is defined by the mean. Unfortunately, real data, as in our example, often show higher and sometimes lower variance than expected by a binomial (or a Poisson) distribution (Figure 14.5). When the variance in the data is higher than expected by the binomial (or the Poisson) distribution we have overdispersion. The uncertainties for the parameter estimates will be underestimated if we do not take overdispersion into account. Overdispersion is indicated when the residual deviance is substantially larger than the residual degrees of freedom. This always has to be checked in the output of a binomial or a Poisson model. In our example, the residual deviance is 10 times larger than the residual degrees of freedom, thus, we have strong overdispersion. Figure 14.5: Histogram of a binomial distribution without overdispersion (orange) and one with the same total number of trials and average success probability, but with overdispersion (blue). What can we do when we have overdispersion? The best way to deal with overdispersion is to find the reason for it. Overdispersion is common in biological data because animals do not behave like random objects but their behavior is sensitive to many factors that we cannot always measure such as social relationships, weather, habitat, experience, and genetics. In most cases, overdispersion is caused by influential factors that were not included in the model. If we find them and can include them in the model (as fixed or as random variables) overdispersion may disappear. If we do not find such predictor variables, we have at least three options. use a quasi-binomial model add an observation level random factor use a beta-binomial model or in case of an overdispersed Poisson model, the negative binomial model may be a good option Fit a quasibinomial or quasi-Poisson model by specifying “quasibinomial” or “quasipoisson” in the family-argument. mod <- glm(cbind(succ,fail) ~ treatment, data=resprouts, family=quasibinomial) This will fit a binomial model that estimates, in addition to the other model parameters, a dispersion parameter, \\(u\\), that is multiplied by the binomial or Poisson variance to obtain the residual variance: \\(var(y_i) = u n_i p_i(1 - p_i)\\), or \\(var(y_i)= u\\lambda_i\\), respectively. This inflated variance is then used to obtain the standard errors of the parameter estimates.However, the quasi-distributions are unnatural distributions (there is no physical justification for these distributions, such as number of coin flips that are tails among a defined number of coin flips). Quasi-models do not differ from the binomial or the Poisson model in any parameter except that the variance is stretched so that fits to the variance in the data. We can see quasi-models as a kind of post-hoc correction for overdispersion. Thus, it is better to use the quasi-model instead of an overdispersed model to draw inference. However, the point estimates may be highly influenced by a few extreme observations. Therefore, we prefer to use options that explicitly model the additional variance. Adding an observation-level random factor (i.e., a factor with the levels 1 to \\(n\\), the sample size) models the additional variance as a normal distribution (in the scale of the link function). Adding such an additional variance parameter to the model allows and accounts for extra variance in the data (Harrison 2014). To do that, we have to fit a generalized linear mixed model (GLMM) instead of a GLM. What do we have to do when the residual deviance is smaller than the residual degrees of freedom, that is, when we have “underdispersion”? Some statisticians do not bother about underdispersion, because, when the variance in the data is smaller than assumed by the model, uncertainty is overestimated. This means that conclusions will be conservative (i.e., on the “safe” side). However, we think that underdispersion should bother us as biologists (or other applied scientists). In most cases, underdispersion means that the variance in the data is smaller than expected by a random process, that is, the variance may be constrained by something. Thus, we should be interested in thinking about the factors that constrain the variance in the data. An example is the number of surviving young in some raptor species, (e.g., in the lesser spotted eagle Aquila pomarina). Most of the time two eggs are laid, but the first hatched young will usually kill the second (which was only a “backup” in case the first egg does not yield a healthy young). Because of this behavior, the number of survivors among the number of eggs laid will show much less variance than expected from \\(n_i\\) and \\(p_i\\), leading to underdispersion. Clutch size is another example of data that often produces underdispersion (but it is a Poisson rather than a binomial process, because there is no \\(n_i\\)). Sometimes, apparent under- or overdispersion can be caused by too many 0s in the data than assumed by the binomial or Poisson model. For example, the number of black stork \\(Ciconia nigra\\) nestlings that survived the nestling phase is very often 0, because the whole nest was depredated or fell from the tree (black storks nest in trees). If the nest survives, the number of survivors varies between 0 and 5 depending on other factors such as food availability or weather conditions. A histogram of these data shows a bimodal distribution with one peak at 0 and another peak around 2.5. It looks like a Poisson distribution, but with a lot of additional 0 values. This is called zero-inflation. Zero-inflation is often the result of two different processes being involved in producing the data. The process that determines whether a nest survives differs from the process that determines how many nestlings survive, given the nest survives. When we analyze such data using a model that assumes only one single process it will be very hard to understand the system and the results are likely to be biased because the distributional assumptions are violated. In such cases, we will be more successful when our model explicitly models the two different processes. Such models are zero-inflated binomial or zero-inflated Poisson models. We normally check whether zero-inflation may be an issue by posterior predictive model checking. If we find zero-inflation in binomial data, we try using a zero-inflated binomial model as provided by Paul Buerkner in the package brms. 14.3.4 Visualising the results For the moment, we use the binomial GLM to analyze the tree sprout data. This model suffers from overdispersion and thus, the uncertainty intervals will be too small. We will provide a more appropriate analyses in a later chapter. We simulate 2000 values from the joint posterior distribution of the model parameters. mod <- glm(cbind(succ,fail) ~ treatment, data=resprouts, family=binomial) nsim <- 2000 bsim <- sim(mod, n.sim=nsim) # simulate from the posterior distr. For each set of simulated model parameters, we derive the linear predictor by multiplying the model matrix with the corresponding set of model parameters. Then, the inverse logit function (\\(logit^{-1}(x) = \\frac{e^x}{(1+ e^x)}\\); R function plogis) is used to obtain the fitted value for each fire lag treatment. Lastly, we extract, for each treatment level, the 2.5% and 97.5% quantile of the posterior distribution of the fitted values and plot it together with the estimates (the fitted values) per treatment and the raw data. newdat <- data.frame(treatment=factor(c("4m","9m","12m"),levels=c("4m","9m","12m"))) Xmat <- model.matrix(~treatment, newdat) fitmat <- matrix(nrow=nrow(newdat), ncol=nsim) for(i in 1:nsim) fitmat[,i] <- plogis(Xmat %*% bsim@coef[i,]) newdat$lwr <- apply(fitmat, 1, quantile, prob=0.025) newdat$upr <- apply(fitmat, 1, quantile, prob=0.975) newdat$fit <- plogis(Xmat%*%coef(mod)) newdat$lag <- c(4,9,12) # used for plotting resprouts$lag <- c(4,9,12)[match(resprouts$treatment,c("4m","9m","12m"))] # used for plotting plot(newdat$lag, newdat$fit, type="n", xlab="Fire lag [months]", ylab="Tree survival", las=1, cex.lab=1.4, cex.axis=1, xaxt="n", xlim=c(0, 13), ylim=c(0,0.6)) axis(1, at=c(0,4,9,12), labels=c("0","4","9","12")) segments(newdat$lag, newdat$lwr, newdat$lag, newdat$upr, lwd=2) points(newdat$lag, newdat$fit, pch=21, bg="gray") points(resprouts$lag+0.3,resprouts$succ/resprouts$pre, cex=0.7) # adds the raw data to the plot Figure 14.6: Proportion of surviving trees (circles) for three fire lag treatments with estimated mean proportion of survivors using an inappropriate binomial model. Because of overdispersion, the 95% compatibility intervals are way too small. Gray dots = fitted values. Vertical bars = 95% compatibility intervals. 14.4 Poisson model 14.4.1 Background The Poisson distribution is a discrete probability distribution that naturally describes the distribution of count data. If we know how many times something happened, but we do not know how many times it did not happen (in contrast to the binomial model, where we know the number of trials), such counts usually follow a Poisson distribution. Count data are positive integers ranging from 0 to \\(+\\infty\\). A Poisson distribution is positive-skewed (long tail to the right) if the mean \\(\\lambda\\) is small and it approximates a normal distribution for large \\(\\lambda\\). The Poisson distribution constitutes the stochastic part of a Poisson model. The deterministic part describes how \\(\\lambda\\) is related to predictors. \\(\\lambda\\) can only take on positive values. Therefore, we need a link function that transforms \\(\\lambda\\) into the scale of the linear predictor (or, alternatively, an inverse link function that transforms the value from the linear predictor to nonnegative values). The most often used link function is the natural logarithm (log-link function). This link function transforms all \\(\\lambda\\)-values between 0 and 1 to the interval \\(-\\infty\\) to 0, and all \\(\\lambda\\)-values higher than 1 are projected into the interval 0 to \\(+\\infty\\). Sometimes, the identity link function is used instead of the log-link function, particularly when the predictor variable only contains positive values and the effect of the predictor is additive rather than multiplicative, that is, when a change in the predictor produces an addition of a specific value in the outcome rather than a multiplication by a specific value. Further, the cauchit function can also be used as a link function for Poisson models. 14.4.2 Fitting a Poisson model in R The same R functions that fit binomial models also fit Poisson models. As an example, we fit a Poisson model with log-link function to a simulated data set containing the number of (virtual) aphids on a square centimeter (\\(y\\)) and a numeric predictor variable representing, for example, an aridity index (\\(x\\)). Real ecological data without overdispersion or zeroinflation and with no random structure are rather rare. Therefore, we illustrate this model, which is the basis for more complex models, with simulated data. The model is: \\[y_i \\sim Poisson(\\lambda_i)\\] \\[log(\\lambda_i = \\bf X_i \\boldsymbol \\beta)\\] We use, similar to the R function log, the notation \\(log\\) for the natural logarithm. We fit the model in R using the function glm and use the argument “family” to specify that we assume a Poisson distribution as the error distribution. The log-link is used as the default link function. Then we add the regression line to the plot using the function curve. Further add the compatibility interval to the plot (of course only after having checked the model assumptions). set.seed(196855) n <- 50 # simulate 50 sampling sites, where we count aphids x <- rnorm(n) # the number of aphids depends, among others, on the aridity index x b0 <- 1 # intercept and b1 <- 0.5 # slope of the linear predictor y <- rpois(n, lambda=exp(b0+b1*x)) mod <- glm(y~x, family="poisson") n.sim <- 2000 bsim <- sim(mod, n.sim=n.sim) par(mar=c(4,4,1,1)) plot(x,y, pch=16, las=1, cex.lab=1.4, cex.axis=1.2) curve(exp(coef(mod)[1] + coef(mod)[2]*x), add=TRUE, lwd=2) newdat <- data.frame(x=seq(-3, 2.5, length=100)) Xmat <- model.matrix(~x, data=newdat) b <- coef(mod) newdat$fit <- exp(Xmat%*%b) fitmat <- matrix(ncol=n.sim, nrow=nrow(newdat)) for(i in 1:n.sim) fitmat[,i] <- exp(Xmat%*%bsim@coef[i,]) newdat$lwr <- apply(fitmat, 1, quantile, prob=0.025) newdat$upr <- apply(fitmat, 1, quantile, prob=0.975) lines(newdat$x, newdat$fit, lwd=2) lines(newdat$x, newdat$lwr, lty=3) lines(newdat$x, newdat$upr, lty=3) Figure 14.7: Simulated data (dots) with a Poisson regression line (solid) and the lower and upper bound of the 95% compatibility interval. 14.4.3 Assessing model assumptions Because the residual variance in the Poisson model is defined by \\(\\lambda\\) (the fitted value), it is not estimated as a separate parameter from the data. Therefore, we always have to check whether overdispersion is present. Ecological data are often overdispersed because not all influencing factors can be measured and included in the model. As with the binomial model, in a Poisson model overdispersion is present when the residual deviance is larger than the residual degrees of freedom. This is because if we add one independent observation to the data, the deviance increases, on average, by one if the variance equals \\(\\lambda\\). If the variance is larger, the contribution of each observation to the deviance is, on average, larger than one. We can check this in the model output: mod ## ## Call: glm(formula = y ~ x, family = "poisson") ## ## Coefficients: ## (Intercept) x ## 1.1329 0.4574 ## ## Degrees of Freedom: 49 Total (i.e. Null); 48 Residual ## Null Deviance: 85.35 ## Residual Deviance: 52.12 AIC: 198.9 The residual deviance is 52 compared to 48 degrees of freedom. This is perfect (of course, because the model is fit to simulated data). If we are not sure, we could do a posterior predictive model checking and compare the variance in the data with the variance in data that were simulated from the model. If there is substantial overdispersion, we could fit a quasi-Poisson model that includes a dispersion parameter. However, as explained previously, we prefer to explicitly model the variance. A good alternative for overdispersed count data that we now like very much (in contrast to what we wrote in the first printed edition of this book) is the negative binomial model. The standard residual plots (Figure 14.8) are obtained in the usual way. par(mfrow=c(2,2)) plot(mod) Figure 14.8: Standard residual plots for the Poisson model fitted to simulated data, thus they fit perfectly. Of course, again, they look perfect because we used simulated data. In a Poisson model, as for the binomial model, it is easier to detect lack of model fit using posterior predictive model checking. For example, data could be simulated from the model and the proportion of 0 values in the simulated data could be compared to the proportion of 0 values in the observations to assess whether zero-inflation is present or not. 14.4.4 Visualising results We can look at the posterior distributions of the model parameters. apply(bsim@coef, 2, quantile, prob=c(0.5, 0.025, 0.975)) ## (Intercept) x ## 50% 1.1370430 0.4569485 ## 2.5% 0.9692098 0.2974446 ## 97.5% 1.3000149 0.6143244 The 95% compatibility interval of \\(\\beta_1\\) is 0.3-0.6. Given that an effect of 0.2 or larger on the aridity scale would be considered biologically relevant, we can be quite confident that aridity has a relevant effect on aphid abundance given our data and our model. With the simulations from the posterior distributions of the model parameters (stored in the object bsim) we obtained samples of the posterior distributions of fitted values for each of 100 x-values along the x-axis and we have drawn the 95% compatibility interval of the regression line in Figure 14.7. 14.4.5 Modeling rates and densities: Poisson model with an offset Many count data are measured in relation to a reference, such as an area or a time period or a population. For example, when we count animals on plots of different sizes, the most important predictor variable will likely be the size of the plot. Or, in other words, the absolute counts do not make much sense when they are not corrected for plot size: the relevant measure is animal density. Similarly, when we count how many times a specific behavior occurs and we follow the focal animals during time periods of different lengths, then the interest is in the rate of occurrence rather than in the absolute number counted. One way to analyze rates and densities is to divide the counts by the reference value and assume that this rate (or a transformation thereof) is normally distributed. However, it is usually hard to obtain normally distributed residuals using rates or densities as dependent variables. A more natural approach to describe rates and densities is to use a Poisson model that takes the reference into account within the model. This is called an offset. To do so, \\(\\lambda\\) is multiplied by the reference \\(T\\) (e.g., time interval, area, population). Therefore, \\(log(T)\\) has to be added to the linear predictor. Adding \\(log(T)\\) to the linear predictor is like adding a new predictor variable (the log of \\(T\\)) to the model with its model parameter (the slope) fixed to 1. The term “offset” says that we add a predictor but do not estimate its effect because it is fixed to 1. \\[y_i \\sim Poisson(\\lambda_i T_i)\\] \\[ log(\\boldsymbol \\lambda \\boldsymbol T) = log(\\boldsymbol \\lambda) + log(\\boldsymbol T) = \\boldsymbol X \\boldsymbol \\beta + log(\\boldsymbol T)\\] In R, we can use the argument “offset” within the function glm to specify an offset. We illustrate this using a breeding bird census on wildflower fields in Switzerland in 2007 conducted by Zollinger et al. (2013). We focus on the common whitethroat Silvia communis, a bird of field margins and fallow lands that has become rare in the intensively used agricultural landscape. Wildflower fields are an ecological compensation measure to provide food and nesting grounds for species such as the common whitethroat. Such fields are sown and then left unmanaged for several years except for the control of potentially problematic species (e.g., some thistle species, Carduus spp.). The plant composition and the vegetation structure in the field gradually changes over the years, hence the interest in this study was to determine the optimal age of a wildflower field for use by the common whitethroat. We use the number of breeding pairs (bp) as the outcome variable and field size as an offset, which means that we model breeding pair density. We include the age of the field (age) as a linear and quadratic term because we expect there to be an optimal age of the field (i.e., a curvilinear relationship between the breeding pair density and age). We also include field size as a covariate (in addition to using it as the offset) because the size of the field may have an effect on the density; for example, small fields may have a higher density if the whitethroat can also use surrounding areas but uses the field to breed. Size (in hectares) was z-transformed before the model fit. data(wildflowerfields) # in the package blmeco dat <- wildflowerfields[wildflowerfields$year==2007,] # select data dat$size.ha <- dat$size/100 # change unit to ha dat$size.ha.z <- scale(dat$size.ha) mod <- glm(bp ~ age + I(age^2) + size.ha.z, offset=log(size.ha), data=dat, family=poisson) mod ## ## Call: glm(formula = bp ~ age + I(age^2) + size.ha.z, family = poisson, ## data = dat, offset = log(size.ha)) ## ## Coefficients: ## (Intercept) age I(age^2) size.ha.z ## -4.2294 1.5241 -0.1408 -0.5397 ## ## Degrees of Freedom: 40 Total (i.e. Null); 37 Residual ## Null Deviance: 48.5 ## Residual Deviance: 27.75 AIC: 70.2 For the residual analysis and for drawing conclusions, we can proceed in the same way we did in the Poisson model. From the model output we see that the residual deviance is smaller than the corresponding degrees of freedom, thus we have some degree of underdispersion. But the degree of underdispersion is not very extreme so we accept that the compatibility intervals will be a bit larger than “necessary” and proceed in this case. After residual analyses, we can produce an effect plot of the estimated whitethroat density against the age of the wildflower field (Figure 14.9). And we see that the expected whitethroat density is largest on wildflower fields of age 4 to 7 years. n.sim <- 5000 bsim <- sim(mod, n.sim=n.sim) apply(bsim@coef, 2, quantile, prob=c(0.025,0.5,0.975)) ## (Intercept) age I(age^2) size.ha.z ## 2.5% -7.006715 0.3158791 -0.26708865 -1.14757192 ## 50% -4.196504 1.5118620 -0.14034083 -0.54749587 ## 97.5% -1.445242 2.7196036 -0.01837473 0.02976658 par(mar=c(4,4,1,1)) plot(jitter(dat$age,amount=0.1),jitter(dat$bp/dat$size.ha,amount=0.1), pch=16, las=1, cex.lab=1.2, cex.axis=1, cex=0.7, xlab="Age of wildflower field [yrs]", ylab="Density of Whitethroat [bp/ha]") # add credible/compatibility interval newdat <- data.frame(age=seq(1, 9, length=100), size.ha.z=0) Xmat <- model.matrix(~age + I(age^2) + size.ha.z, data=newdat) b <- coef(mod) newdat$fit <- exp(Xmat%*%b) fitmat <- matrix(ncol=n.sim, nrow=nrow(newdat)) for(i in 1:n.sim) fitmat[,i] <- exp(Xmat%*%bsim@coef[i,]) newdat$lwr <- apply(fitmat, 1, quantile, prob=0.025) newdat$upr <- apply(fitmat, 1, quantile, prob=0.975) lines(newdat$age, newdat$fit, lwd=2) lines(newdat$age, newdat$lwr, lty=3) lines(newdat$age, newdat$upr, lty=3) Figure 14.9: Whitethroat densities are highest in wildflower fields that are around 4 to 6 years old. Dots are the raw data, the bold line give the fitted values (with the 95% compatibility interval given with dotted lines) for wildflower fields of different ages (years). The fitted values are given for average field sizes of 1.4 ha. "],["glmm.html", "15 Generalized linear mixed models 15.1 Introduction 15.2 Summary", " 15 Generalized linear mixed models 15.1 Introduction In chapter 13 on linear mixed effect models we have introduced how to analyze metric outcome variables for which a normal error distribution can be assumed (potentially after transformation), when the data have a hierarchical structure and, as a consequence, observations are not independent. In chapter 14 on generalized linear models we have introduced how to analyze outcome variables for which a normal error distribution can not be assumed, as for example binary outcomes or count data. More precisely, we have extended modelling outcomes with normal error to modelling outcomes with error distributions from the exponential family (e.g., binomial or Poisson). Generalized linear mixed models (GLMM) combine the two complexities and are used to analyze outcomes with a non-normal error distribution when the data have a hierarchical structure. In this chapter, we will show how to analyze such data. Remember, a hierarchical structure of the data means that the data are collected at different levels, for example smaller and larger spatial units, or include repeated measurements in time on a specific subject. Typically, the outcome variable is measured/observed at the lowest level but other variables may be measured at different levels. A first example is introduced in the next section. 15.1.1 Binomial Mixed Model 15.1.1.1 Background To illustrate the binomial mixed model we use a subset of a data set used by Grüebler, Korner-Nievergelt, and Von Hirschheydt (2010) on barn swallow Hirundo rustica nestling survival (we selected a nonrandom sample to be able to fit a simple model; hence, the results do not add unbiased knowledge about the swallow biology!). For 63 swallow broods, we know the clutch size and the number of the nestlings that fledged. The broods came from 51 farms (larger unit), thus some of the farms had more than one brood. Note that each farm can harbor one or several broods, and the broods are nested within farms (as opposed to crossed, see chapter 13), i.e., each brood belongs to only one farm. There are three predictors measured at the level of the farm: colony size (the number of swallow broods on that farm), cow (whether there are cows on the farm or not), and dung heap (the number of dung heaps, piles of cow dung, within 500 m of the farm). The aim was to assess how swallows profit from insects that are attracted by livestock on the farm and by dung heaps. Broods from the same farm are not independent of each other because they belong to the same larger unit (farm), and thus share the characteristics of the farm (measured or unmeasured). Predictor variables were measured at the level of the farm, and are thus the same for all broods from a farm. In the model described and fitted below, we account for the non-independence of these clutches when building the model by including a random intercept per farm to model random variation between farms. The outcome variable is a proportion (proportion fledged from clutch) and thus consists of two values for each observation, as seen with the binomial model without random factors (Section 14.2.2): the number of chicks that fledged (successes) and the number of chicks that died (failures), i.e., the clutch size minus number that fledged. The random factor “farm” adds a farm-specific deviation \\(b_g\\) to the intercept in the linear predictor. These deviations are modeled as normally distributed with mean \\(0\\) and standard deviation \\(\\sigma_g\\). \\[ y_i \\sim binomial\\left(p_i, n_i\\right)\\\\ logit\\left(p_i\\right) = \\beta_0 + b_{g[i]} + \\beta_1\\;colonysize_i + \\beta_2\\;I\\left(cow_i = 1\\right) + \\beta_3\\;dungheap_i\\\\ b_g \\sim normal\\left(0, \\sigma_g\\right) \\] # Data on Barn Swallow (Hirundo rustica) nestling survival on farms # (a part of the data published in Grüebler et al. 2010, J Appl Ecol 47:1340-1347) library(blmeco) data(swallowfarms) #?swallowfarms # to see the documentation of the data set dat <- swallowfarms str(dat) ## 'data.frame': 63 obs. of 6 variables: ## $ farm : int 1001 1002 1002 1002 1004 1008 1008 1008 1010 1016 ... ## $ colsize: int 1 4 4 4 1 11 11 11 3 3 ... ## $ cow : int 1 1 1 1 1 1 1 1 0 1 ... ## $ dung : int 0 0 0 0 1 1 1 1 2 2 ... ## $ clutch : int 8 9 8 7 13 7 9 16 10 8 ... ## $ fledge : int 8 0 6 5 9 3 7 4 9 8 ... # check number of farms in the data set length(unique(dat$farm)) ## [1] 51 15.1.1.2 Fitting a Binomial Mixed Model in R 15.1.1.2.1 Using the glmer function dat$colsize.z <- scale(dat$colsize) # z-transform values for better model convergence dat$dung.z <- scale(dat$dung) dat$die <- dat$clutch - dat$fledge dat$farm.f <- factor(dat$farm) # for clarity we define farm as a factor The glmer function uses the standard way to formulate a statistical model in R, with the outcome on the left, followed by the ~ symbol, meaning “explained by”, followed by the predictors, which are separated by +. The notation for the random factor with only a random intercept was introduced in chapter 13 and is (1|farm.f) here. Remember that for fitting a binomial model we have to provide the number of successful events (number of fledglings that survived) and the number of failures (those that died) within a two-column matrix that we create using the function cbind. # fit GLMM using glmer function from lme4 package library(lme4) mod.glmer <- glmer(cbind(fledge,die) ~ colsize.z + cow + dung.z + (1|farm.f) , data=dat, family=binomial) 15.1.1.2.2 Assessing Model Assumptions for the glmer fit The residuals of the model look fairly normal (top left panel of Figure 15.1 with slightly wider tails. The random intercepts for the farms look perfectly normal as they should. The plot of the residuals vs. fitted values (bottom left panel) shows a slight increase in the residuals with increasing fitted values. Positive correlations between the residuals and the fitted values are common in mixed models due to the shrinkage effect (chapter 13). Due to the same reason the fitted proportions slightly overestimate the observed proportions when these are large, but underestimate them when small (bottom right panel). What is looking like a lack of fit here can be seen as preventing an overestimation of the among farm variance based on the assumption that the farms in the data are a random sample of farms belonging to the same population. The mean of the random effects is close to zero as it should. We check that because sometimes the glmer function fails to correctly separate the farm-specific intercepts from the overall intercept. A non-zero mean of random effects does not mean a lack of fit, but a failure of the model fitting algorithm. In such a case, we recommend using a different fitting algorithm, e.g. brm (see below) or stan_glmer from the rstanarm package. A slight overdispersion (approximated dispersion parameter >1) seems to be present, but nothing to worry about. par(mfrow=c(2,2), mar=c(3,5,1,1)) # check normal distribution of residuals qqnorm(resid(mod.glmer), main="qq-plot residuals") qqline(resid(mod.glmer)) # check normal distribution of random intercepts qqnorm(ranef(mod.glmer)$farm.f[,1], main="qq-plot, farm") qqline(ranef(mod.glmer)$farm.f[,1]) # residuals vs fitted values to check homoscedasticity plot(fitted(mod.glmer), resid(mod.glmer)) abline(h=0) # plot data vs. predicted values dat$fitted <- fitted(mod.glmer) plot(dat$fitted,dat$fledge/dat$clutch) abline(0,1) Figure 15.1: Diagnostic plots to assess model assumptions for mod.glmer. Uppper left: quantile-quantile plot of the residuals vs. theoretical quantiles of the normal distribution. Upper rihgt: quantile-quantile plot of the random effects “farm”. Lower left: residuals vs. fitted values. Lower right: observed vs. fitted values. # check distribution of random effects mean(ranef(mod.glmer)$farm.f[,1]) ## [1] -0.001690303 # check for overdispersion dispersion_glmer(mod.glmer) ## [1] 1.192931 detach(package:lme4) 15.1.1.2.3 Using the brm function Now we fit the same model using the function brm from the R package brms. This function allows fitting Bayesian generalized (non-)linear multivariate multilevel models using Stan (Betancourt 2013) for full Bayesian inference. We shortly introduce the fitting algorithm used by Stan, Hamiltonian Monte Carlo, in chapter 18. When using the function brm there is no need to install rstan or write the model in Stan-language. A wide range of distributions and link functions are supported, and the function offers many things more. Here we use it to fit the model as specified by the formula object above. Note that brm requires that a binomial outcome is specified in the format successes|trials(), which is the number of fledged nestlings out of the total clutch size in our case. In contrast, the glmer function required to specify the number of nestlings that fledged and died (which together sum up to clutch size), in the format cbind(successes, failures). The family is also called binomial in brm, but would be bernoulli for a binary outcome, whereas glmer would use binomial in both situations (Bernoulli distribution is a special case of the binomial). However, it is slightly confusing that (at the time of writing this chapter) the documentation for brmsfamily did not mention the binomial family under Usage, where it probably went missing, but it is mentioned under Arguments for the argument family. Prior distributions are an integral part of a Bayesian model, therefore we need to specify prior distributions. We can see what default prior distributions brm is using by applying the get_prior function to the model formula. The default prior for the effect sizes is a flat prior which gives a density of 1 for any value between minus and plus infinity. Because this is not a proper probability distribution it is also called an improper distribution. The intercept gets a t-distribution with mean of 0, standard deviation of 2.5 and 3 degrees of freedoms. Transforming this t-distribution to the proportion scale (using the inverse-logit function) becomes something similar to a uniform distribution between 0 and 1 that can be seen as non-informative for a probability. For the among-farm standard deviation, it uses the same t-distribution as for the intercept. However, because variance parameters such as standard deviations only can take on positive numbers, it will use only the positive half of the t-distribution (this is not seen in the output of get_prior). When we have no prior information on any parameter, or if we would like to base the results solely on the information in the data, we specify weakly informative prior distributions that do not noticeably affect the results but they will facilitate the fitting algorithm. This is true for the priors of the intercept and among-farm standard deviation. However, for the effect sizes, we prefer specifying more narrow distributions (see chapter 10). To do so, we use the function prior. To apply MCMC sampling we need some more arguments: warmup specifies the number of iterations during which we allow the algorithm to be adapted to our specific model and to converge to the posterior distribution. These iterations should be discarded (similar to the burn-in period when using, e.g., Gibbs sampling); iter specifies the total number of iterations (including those discarded); chains specifies the number of chains; init specifies the starting values of the iterations. By default (init=NULL) or by setting init=\"random\" the initial values are randomly chosen which is recommended because then different initial values are chosen for each chain which helps to identify non-convergence. However, sometimes random initial values cause the Markov chains to behave badly. Then you can either use the maximum likelihood estimates of the parameters as starting values, or simply ask the algorithm to start with zeros. thin specifies the thinning of the chain, i.e., whether all iterations should be kept (thin=1) or for example every 4th only (thin=4); cores specifies the number of cores used for the algorithm; seed specifies the random seed, allowing for replication of results. library(brms) # check which parameters need a prior get_prior(fledge|trials(clutch) ~ colsize.z + cow + dung.z + (1|farm.f), data=dat, family=binomial(link="logit")) ## prior class coef group resp dpar nlpar lb ub ## (flat) b ## (flat) b colsize.z ## (flat) b cow ## (flat) b dung.z ## student_t(3, 0, 2.5) Intercept ## student_t(3, 0, 2.5) sd 0 ## student_t(3, 0, 2.5) sd farm.f 0 ## student_t(3, 0, 2.5) sd Intercept farm.f 0 ## source ## default ## (vectorized) ## (vectorized) ## (vectorized) ## default ## default ## (vectorized) ## (vectorized) # specify own priors myprior <- prior(normal(0,5), class="b") mod.brm <- brm(fledge|trials(clutch) ~ colsize.z + cow + dung.z + (1|farm.f) , data=dat, family=binomial(link="logit"), prior=myprior, warmup = 500, iter = 2000, chains = 2, init = "random", cores = 2, seed = 123) # note: thin=1 is default and we did not change this here. 15.1.1.2.4 Checking model convergence for the brm fit We first check whether we find warnings in the R console about problems of the fitting algorithm. Warnings should be taken seriously. Often, we find help in the Stan online documentation (or when typing launch_shinystan(mod.brm) into the R-console) what to change when calling the brm function to get a fit that is running smoothly. Once, we get rid of all warnings, we need to check how well the Markov chains mixed. We can either do that by scanning through the many diagnostic plots given by launch_shinystan(mod) or create the most important plots ourselves such as the traceplot (Figure 15.2). par(mar=c(2,2,2,2)) mcmc_plot(mod.brm, type = "trace") Figure 15.2: Traceplot of the Markov chains. After convergence, both Markov chains should sample from the same stationary distribution. Indications of non-convergence would be, if the two chains diverge or vary around different means. 15.1.1.2.5 Checking model fit by posterior predictive model checking To assess how well the model fits to the data we do posterior predictive model checking (Chapter 16). For binomial as well as for Poisson models comparing the standard deviation of the data with those of replicated data from the model is particularly important. If the standard deviation of the real data would be much higher compared to the ones of the replicated data from the model, overdispersion would be an issue. However, here, the model is able to capture the variance in the data correctly (Figure 15.3). The fitted vs observed plot also shows a good fit. yrep <- posterior_predict(mod.brm) sdyrep <- apply(yrep, 1, sd) par(mfrow=c(1,3), mar=c(3,4,1,1)) hist(yrep, freq=FALSE, main=NA, xlab="Number of fledglings") hist(dat$fledge, add=TRUE, col=rgb(1,0,0,0.3), freq=FALSE) legend(10, 0.15, fill=c("grey",rgb(1,0,0,0.3)), legend=c("yrep", "y")) hist(sdyrep) abline(v=sd(dat$fledge), col="red", lwd=2) plot(fitted(mod.brm)[,1], dat$fledge, pch=16, cex=0.6) abline(0,1) Figure 15.3: Posterior predictive model checking: Histogram of the number of fledglings simulated from the model together with a histogram of the real data, and a histogram of the standard deviations of replicated data from the model together with the standard deviation of the data (vertical line in red). The third plot gives the fitted vs. observed values. After checking the diagnostic plots, the posterior predictive model checking and the general model fit, we assume that the model describes the data generating process reasonably well, so that we can proceed to drawing conclusions. 15.1.1.3 Drawing Conclusions The generic summary function gives us the results for the model object containing the fitted model, and works for both the model fitted with glmer and brm. Let’s start having a look at the summary from mod.glmer. The summary provides the fitting method, the model formula, statistics for the model fit including the Akaike information criterion (AIC), the Bayesian information criterion (BIC), the scaled residuals, the random effects variance and information about observations and groups, a table with coefficient estimates for the fixed effects (with standard errors and a z-test for the coefficient) and correlations between fixed effects. We recommend to always check if the number of observations and groups, i.e., 63 barn swallow nests from 51 farms here, is correct. This information shows if the glmer function has correctly recognized the hierarchical structure in the data. Here, this is correct. To assess the associations between the predictor variables and the outcome analyzed, we need to look at the column “Estimate” in the table of fixed effects. This column contains the estimated model coefficients, and the standard error for these estimates is given in the column “Std. Error”, along with a z-test for the null hypothesis of a coefficient of zero. In the random effects table, the among farm variance and standard deviation (square root of the variance) are given. The function confint shows the 95% confidence intervals for the random effects (.sig01) and fixed effects estimates. In the summary output from mod.brm we see the model formula and some information on the Markov chains after the warm-up. In the group-level effects (between group standard deviations) and population-level effects (effect sizes, model coefficients) tables some summary statistics of the posterior distribution of each parameter are given. The “Estimate” is the mean of the posterior distribution, the “Est.Error” is the standard deviation of the posterior distribution (which is the standard error of the parameter estimate). Then we see the lower and upper limit of the 95% credible interval. Also, some statistics for measuring how well the Markov chains converged are given: the “Rhat” and the effective sample size (ESS). The bulk ESS tells us how many independent samples we have to describe the posterior distribution, and the tail ESS tells us on how many samples the limits of the 95% credible interval is based on. Because we used the logit link function, the coefficients are actually on the logit scale and are a bit difficult to interpret. What we can say is that positive coefficients indicate an increase and negative coefficients indicate a decrease in the proportion of nestlings fledged. For continuous predictors, as colsize.z and dung.z, this coefficient refers to the change in the logit of the outcome with a change of one in the predictor (e.g., for colsize.z an increase of one corresponds to an increase of a standard deviation of colsize). For categorical predictors, the coefficients represent a difference between one category and another (reference category is the one not shown in the table). To visualize the coefficients we could draw effect plots. # glmer summary(mod.glmer) ## Generalized linear mixed model fit by maximum likelihood (Laplace ## Approximation) [glmerMod] ## Family: binomial ( logit ) ## Formula: cbind(fledge, die) ~ colsize.z + cow + dung.z + (1 | farm.f) ## Data: dat ## ## AIC BIC logLik deviance df.resid ## 282.5 293.2 -136.3 272.5 58 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -3.2071 -0.4868 0.0812 0.6210 1.8905 ## ## Random effects: ## Groups Name Variance Std.Dev. ## farm.f (Intercept) 0.2058 0.4536 ## Number of obs: 63, groups: farm.f, 51 ## ## Fixed effects: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.09533 0.19068 -0.500 0.6171 ## colsize.z 0.05087 0.11735 0.434 0.6646 ## cow 0.39370 0.22692 1.735 0.0827 . ## dung.z -0.14236 0.10862 -1.311 0.1900 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Correlation of Fixed Effects: ## (Intr) clsz.z cow ## colsize.z 0.129 ## cow -0.828 -0.075 ## dung.z 0.033 0.139 -0.091 confint.95 <- confint(mod.glmer); confint.95 ## 2.5 % 97.5 % ## .sig01 0.16809483 0.7385238 ## (Intercept) -0.48398346 0.2863200 ## colsize.z -0.18428769 0.2950063 ## cow -0.05360035 0.8588134 ## dung.z -0.36296714 0.0733620 # brm summary(mod.brm) ## Family: binomial ## Links: mu = logit ## Formula: fledge | trials(clutch) ~ colsize.z + cow + dung.z + (1 | farm.f) ## Data: dat (Number of observations: 63) ## Draws: 2 chains, each with iter = 2000; warmup = 500; thin = 1; ## total post-warmup draws = 3000 ## ## Group-Level Effects: ## ~farm.f (Number of levels: 51) ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## sd(Intercept) 0.55 0.16 0.26 0.86 1.00 910 1284 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS ## Intercept -0.10 0.21 -0.52 0.32 1.00 2863 2165 ## colsize.z 0.05 0.14 -0.21 0.34 1.00 2266 1794 ## cow 0.41 0.25 -0.06 0.90 1.00 3069 2117 ## dung.z -0.15 0.12 -0.38 0.09 1.00 3254 2241 ## ## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1). From the results we conclude that in farms without cows (when cow=0) and for average colony sizes (when colsize.z=0) and average number of dung heaps (when dung.z=0) the average nestling survival of Barn swallows is the inverse-logit function of the Intercept, thus, plogis(-0.1) = 0.47 with a 95% uncertainty interval of 0.37 - 0.58. We further see that colony size and number of dung heaps are less important than whether cows are present or not. Their estimated partial effect is small and their uncertainty interval includes only values close to zero. However, whether cows are present or not may be important for the survival of nestlings. The average nestling survival in farms with cows is plogis(-0.1+0.41) = 0.58. For getting the uncertainty interval of this survival estimate, we need to do the calculation for every simulation from the posterior distribution of both parameters. bsim <- posterior_samples(mod.brm) # survival of nestlings on farms with cows: survivalest <- plogis(bsim$b_Intercept + bsim$b_cow) quantile(survivalest, probs=c(0.025, 0.975)) # 95% uncertainty interval ## 2.5% 97.5% ## 0.5126716 0.6412675 In medical research, it is standard to report the fixed-effects coefficients from GLMM with binomial or Bernoulli error as odds ratios by taking the exponent (R function exp for \\(e^{()}\\)) of the coefficient on the logit-scale. For example, the coefficient for cow from mod.glmer, 0.39 (95% CI from -0.05 to -0.05), represents an odds ratio of exp( 0.39)=1.48 (95% CI from 0.95 to 0.95). This means that the odds for fledging (vs. not fledging) from a clutch from a farm with livestock present is about 1.5 times larger than the odds for fledging if no livestock is present (relative effect). 15.2 Summary "],["modelchecking.html", "16 Posterior predictive model checking", " 16 Posterior predictive model checking Only if the model describes the data-generating process sufficiently accurately can we draw relevant conclusions from the model. It is therefore essential to assess model fit: our goal is to describe how well the model fits the data with respect to different aspects of the model. In this book, we present three ways to assess how well a model reproduces the data-generating process: (1) residual analysis, (2) posterior predictive model checking (this chapter) and (3) prior sensitivity analysis. Posterior predictive model checking is the comparison of replicated data generated under the model with the observed data. The aim of posterior predictive model checking is similar to the aim of a residual analysis, that is, to look at what data structures the model does not explain. However, the possibilities of residual analyses are limited, particularly in the case of non-normal data distributions. For example, in a logistic regression, positive residuals are always associated with \\(y_i = 1\\) and negative residuals with \\(y_i = 0\\). As a consequence, temporal and spatial patterns in the residuals will always look similar to these patterns in the observations and it is difficult to judge whether the model captures these processes adequately. In such cases, simulating data from the posterior predictive distribution of a model and comparing these data with the observations (i.e., predictive model checking) gives a clearer insight into the performance of a model. We follow the notation of A. Gelman et al. (2014b) in that we use “replicated data”, \\(y^{rep}\\) for a set of \\(n\\) new observations drawn from the posterior predictive distribution for the specific predictor variables \\(x\\) of the \\(n\\) observations in our data set. When we simulate new observations for new values of the predictor variables, for example, to show the prediction interval in an effect plot, we use \\(y^{new}\\). The first step in posterior predictive model checking is to simulate a replicated data set for each set of simulated values of the joint posterior distribution of the model parameters. Thus, we produce, for example, 2000 replicated data sets. These replicated data sets are then compared graphically, or more formally by test statistics, with the observed data. The Bayesian p-value offers a way for formalized testing. It is defined as the probability that the replicated data from the model are more extreme than the observed data, as measured by a test statistic. In case of a perfect fit, we expect that the test statistic from the observed data is well in the middle of the ones from the replicated data. In other words, around 50% of the test statistics from the replicated data are higher than the one from the observed data, resulting in a Bayesian p-value close to 0.5. Bayesian p-values close to 0 or close to 1, on the contrary, indicate that the aspect of the model measured by the specific test statistic is not well represented by the model. Test statistics have to be chosen such that they describe important data structures that are not directly measured as a model parameter. Because model parameters are chosen so that they fit the data well, it is not surprising to find p-values close to 0.5 when using model parameters as test statistics. For example, extreme values or quantiles of \\(y\\) are often better suited than the mean as test statistics, because they are less redundant with the model parameter that is fitted to the data. Similarly, the number of switches from 0 to 1 in binary data is suited to check for autocorrelation whereas the proportion of 1s among all the data may not give so much insight into the model fit. Other test statistics could be a measure for asymmetry, such as the relative difference between the 10 and 90% quantiles, or the proportion of zero values in a Poisson model. We like predictive model checking because it allows us to look at different, specific aspects of the model. It helps us to judge which conclusions from the model are reliable and to identify the limitation of a model. Predictive model checking also helps to understand the process that has generated the data. We use an analysis of the whitethroat breeding density in wildflower fields of different ages for illustration. The aim of this analysis was to identify an optimal age of wildflower fields that serves as good habitat for the whitethroat. Because the Stan developers have written highly convenient user friendly functions to do posterior predictive model checks, we fit the model with Stan using the function stan_glmer from the package rstanarm. data("wildflowerfields") dat <- wildflowerfields dat$size.ha <- dat$size/100 # change unit to ha dat$size.z <- scale(dat$size) # z-transform size dat$year.z <- scale(dat$year) age.poly <- poly(dat$age, 3) # create orthogonal polynomials dat$age.l <- age.poly[,1] # to ease convergence of the model fit dat$age.q <- age.poly[,2] dat$age.c <- age.poly[,3] library(rstanarm) mod <- stan_glmer(bp ~ year.z + age.l + age.q + age.c + size.z + (1|field) + offset(log(size.ha)), family=poisson, data=dat) The R-package shinystan (Gabry 2017) provides an easy way to do model checking. Therefore, there is no excuse to not do posterior predictive model checking. The R-code launch_shinystan(mod) opens a html-file that contains all kind of diagnostics of a model. Besides many statistics and diagnostic plots to assess how well the MCMC worked we also find a menu “PPcheck”. There, we can click through many of the plots that we, below, produce in R. The function posterior_predict simulates many (exactly as many as there are draws from the posterior distributions of the model parameters, thus 4000 if the default number of iteration has been used in Stan) different data sets from a model fit. Specifically, for each single set of parameter values of the joint posterior distribution it simulates one replicated data set. We can look at histograms of the data and the replicated (Figure 16.1). The real data (bp) look similar to the replicated data. set.seed(2352) # to make sure that the ylim and breaks of the histograms below can be used yrep <- posterior_predict(mod) par(mfrow=c(3,3), mar=c(2,1,2,1)) for(i in 1:8) hist(yrep[i,], col="blue", breaks=seq(-0.5, 18.5, by=1), ylim=c(0,85)) hist(dat$bp, col="blue", breaks=seq(-0.5, 18.5, by=1), ylim=c(0,85)) Figure 16.1: Histograms of 8 out of 4000 replicated data sets and of the observed data (dat$bp). The arguments breaks and ylim have been used in the function hist to produce the same scale of the x- and y-axis in all plots. This makes comparison among the plots easier. Let’s look at specific aspects of the data. The proportion of zero counts could be a sensitive test statistic for this data set. First, we define a function “propzero” that extracts the proportion of zero counts from a vector of count data. Then we apply this function to the observed data and to each of the 4000 replicated data sets. At last, we extract the 1 and 99% quantile of the proportion of zero values of the replicated data. propzeros <- function(x) sum(x==0)/length(x) propzeros(dat$bp) # prop. zero values in observed data ## [1] 0.4705882 pzeroyrep <- apply(yrep, 2, propzeros) # prop. zero values in yrep quantile(pzeroyrep, prob=c(0.01, 0.99)) ## 1% 99% ## 0.0335750 0.9557625 The observed data contain 47% zero values, which is well within the 98%-range of what the model predicted (3 - 96%). the Bayesian p-value is 0.6. mean(pzeroyrep>=propzeros(dat$bp)) ## [1] 0.5955882 What about the upper tail of the data? Let’s look at the 90% quantile. quantile(dat$bp, prob=0.9) # for observed data ## 90% ## 2 q90yrep <- apply(yrep, 2, quantile, prob=0.9) # for simulated data table(q90yrep) ## q90yrep ## 0 1 2 3 4 5 6 7 8 ## 10 38 47 22 8 7 1 1 2 Also, the 90% quantile of the data is within what the model predicts. We also can look at the spatial distribution of the data and the replicated data. The variables X and Y are the coordinates of the wildflower fields. We can use them to draw transparent gray dots sized according to the number of breeding pairs. par(mfrow=c(3,3), mar=c(1,1,1,1)) plot(dat$X, dat$Y, pch=16, cex=dat$bp+0.2, col=rgb(0,0,0,0.5), axes=FALSE) box() r <- sample(1:nrow(yrep), 1) # draw 8 replicated data sets at random for(i in r:(r+7)){ plot(dat$X, dat$Y, pch=16, cex=yrep[i,]+0.2, col=rgb(0,0,0,0.5), axes=FALSE) box() } Figure 16.2: Spatial distribution of the whitethroat breeding pair counts and of 8 randomly chosen replicated data sets with data simulated based on the model. the smallest dot correspond to a count of 0, the largest to a count of 20 breeding pairs. The panel in the upper left corner shows the data, the other panels are replicated data from the model. The spatial distribution of the replicated data sets seems to be similar to the observed one at first look (Figure 16.2). With a second look, we may detect in the middle of the study area the model may predict slightly larger numbers than observed. This pattern may motivate us to find the reason for the imperfect fit if the main interest is whitethroat density estimates. Are there important elements in the landscape that influence whitethroat densities and that we have not yet taken into account in the model? However, our main interest is finding the optimal age of wildflower fields for the whitethroat. Therefore, we look at the mean age of the 10% of the fields with the highest breeding densities. To do so, we first define a function that extracts the mean field age of the 10% largest whitethroat density values, and then we apply this function to the observed data and to the 4000 replicated data sets. magehighest <- function(x) { q90 <- quantile(x/dat$size.ha, prob=0.90) index <- (x/dat$size.ha)>=q90 mage <- mean(dat$age[index]) return(mage) } magehighest(dat$bp) ## [1] 4.4 mageyrep <- apply(yrep, 1, magehighest) quantile(mageyrep, prob=c(0.01, 0.5,0.99)) ## 1% 50% 99% ## 3.733333 4.714286 5.785714 The mean age of the 10% of the fields with the highest whitethroat densities is 4.4 years in the observed data set. In the replicated data set it is between 3.73 and 5.79 years. The Bayesian p-value is 0.79. Thus, in around 79% of the replicated data sets the mean age of the 10% fields with the highest whitethroat densities was higher than the observed one (Figure 16.3). hist(mageyrep) abline(v=magehighest(dat$bp), col="orange", lwd=2) Figure 16.3: Histogram of the average age of the 10% wildflower fields with the highest breeding densities in the replicated data sets. The orange line indicates the average age for the 10% fields with the highest observed whithethroat densities. In a publication, we could summarize the results of the posterior predictive model checking in a table or give the plots in an appendix. Here, we conclude that the model fits in the most important aspects well. However, the model may predict too high whitethroat densities in the central part of the study area. "],["model_comparison.html", "17 Model comparison and multimodel inference 17.1 Introduction 17.2 Summary", " 17 Model comparison and multimodel inference THIS CHAPTER IS UNDER CONSTRUCTION!!! 17.1 Introduction literature to refer to: Tredennick et al. (2021) 17.2 Summary xxx "],["stan.html", "18 MCMC using Stan 18.1 Background 18.2 Install rstan 18.3 Writing a Stan model 18.4 Run Stan from R Further reading", " 18 MCMC using Stan 18.1 Background Markov chain Monte Carlo (MCMC) simulation techniques were developed in the mid-1950s by physicists (Metropolis et al., 1953). Later, statisticians discovered MCMC (Hastings, 1970; Geman & Geman, 1984; Tanner & Wong, 1987; Gelfand et al., 1990; Gelfand & Smith, 1990). MCMC methods make it possible to obtain posterior distributions for parameters and latent variables (unobserved variables) of complex models. In parallel, personal computer capacities increased in the 1990s and user-friendly software such as the different programs based on the programming language BUGS (Spiegelhalter et al., 2003) came out. These developments boosted the use of Bayesian data analyses, particularly in genetics and ecology. 18.2 Install rstan In this book we use the program Stan to draw random samples from the joint posterior distribution of the model parameters given a model, the data, prior distributions, and initial values. To do so, it uses the “no-U-turn sampler,” which is a type of Hamiltonian Monte Carlo simulation (Hoffman and Gelman 2014; Betancourt 2013), and optimization-based point estimation. These algorithms are more efficient than the ones implemented in BUGS programs and they can handle larger data sets. Stan works particularly well for hierar- chical models (Betancourt and Girolami 2013). Stan runs on Windows, Mac, and Linux and can be used via the R interface rstan. Stan is automatically installed when the R package rstan is installed. For installing rstan, it is advised to follow closely the system-specific instructions. 18.3 Writing a Stan model The statistical model is written in the Stan language and saved in a text file. The Stan language is rather strict, forcing the user to write unambiguous models. Stan is very well documented and the Stan Documentation contains a comprehensive Language Manual, a Wiki documentation and various tutorials. We here provide a normal regression with one predictor variable as a worked example. The entire Stan model is as following (saved as linreg.stan) data { int<lower=0> n; vector[n] y; vector[n] x; } parameters { vector[2] beta; real<lower=0> sigma; } model { //priors beta ~ normal(0,5); sigma ~ cauchy(0,5); // likelihood y ~ normal(beta[1] + beta[2] * x, sigma); } A Stan model consists of different named blocks. These blocks are (from first to last): data, transformed data, parameters, trans- formed parameters, model, and generated quantities. The blocks must appear in this order. The model block is mandatory; all other blocks are optional. In the data block, the type, dimension, and name of every variable has to be declared. Optionally, the range of possible values can be specified. For example, vector[N] y; means that y is a vector (type real) of length N, and int<lower=0> N; means that N is an integer with nonnegative values (the bounds, here 0, are included). Note that the restriction to a possible range of values is not strictly necessary but this will help specifying the correct model and it will improve speed. We also see that each line needs to be closed by a column sign. In the parameters block, all model parameters have to be defined. The coefficients of the linear predictor constitute a vector of length 2, vector[2] beta;. Alternatively, real beta[2]; could be used. The sigma parameter is a one-number parameter that has to be positive, therefore real<lower=0> sigma;. The model block contains the model specification. Stan functions can handle vectors and we do not have to loop over all observations as typical for BUGS . Here, we use a Cauchy distribution as a prior distribution for sigma. This distribution can have negative values, but because we defined the lower limit of sigma to be 0 in the parameters block, the prior distribution actually used in the model is a truncated Cauchy distribution (truncated at zero). In Chapter 10.2 we explain how to choose prior distributions. Further characteristics of the Stan language that are good to know include: The variance parameter for the normal distribution is specified as the standard deviation (like in R but different from BUGS, where the precision is used). If no prior is specified, Stan uses a uniform prior over the range of possible values as specified in the parameter block. Variable names must not contain periods, for example, x.z would not be allowed, but x_z is allowed. To comment out a line, use double forward-slashes //. 18.4 Run Stan from R We fit the model to simulated data. Stan needs a vector containing the names of the data objects. In our case, x, y, and N are objects that exist in the R console. The function stan() starts Stan and returns an object containing MCMCs for every model parameter. We have to specify the name of the file that contains the model specification, the data, the number of chains, and the number of iterations per chain we would like to have. The first half of the iterations of each chain is declared as the warm-up. During the warm-up, Stan is not simulating a Markov chain, because in every step the algorithm is adapted. After the warm-up the algorithm is fixed and Stan simulates Markov chains. library(rstan) # Simulate fake data n <- 50 # sample size sigma <- 5 # standard deviation of the residuals b0 <- 2 # intercept b1 <- 0.7 # slope x <- runif(n, 10, 30) # random numbers of the covariate simresid <- rnorm(n, 0, sd=sigma) # residuals y <- b0 + b1*x + simresid # calculate y, i.e. the data # Bundle data into a list datax <- list(n=length(y), y=y, x=x) # Run STAN fit <- stan(file = "stanmodels/linreg.stan", data=datax, verbose = FALSE) ## ## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1). ## Chain 1: ## Chain 1: Gradient evaluation took 2.5e-05 seconds ## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds. ## Chain 1: Adjust your expectations accordingly! ## Chain 1: ## Chain 1: ## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup) ## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup) ## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup) ## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup) ## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup) ## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup) ## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling) ## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling) ## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling) ## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling) ## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling) ## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling) ## Chain 1: ## Chain 1: Elapsed Time: 0.055 seconds (Warm-up) ## Chain 1: 0.043 seconds (Sampling) ## Chain 1: 0.098 seconds (Total) ## Chain 1: ## ## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2). ## Chain 2: ## Chain 2: Gradient evaluation took 5e-06 seconds ## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds. ## Chain 2: Adjust your expectations accordingly! ## Chain 2: ## Chain 2: ## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup) ## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup) ## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup) ## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup) ## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup) ## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup) ## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling) ## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling) ## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling) ## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling) ## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling) ## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling) ## Chain 2: ## Chain 2: Elapsed Time: 0.049 seconds (Warm-up) ## Chain 2: 0.043 seconds (Sampling) ## Chain 2: 0.092 seconds (Total) ## Chain 2: ## ## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3). ## Chain 3: ## Chain 3: Gradient evaluation took 5e-06 seconds ## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds. ## Chain 3: Adjust your expectations accordingly! ## Chain 3: ## Chain 3: ## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup) ## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup) ## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup) ## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup) ## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup) ## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup) ## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling) ## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling) ## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling) ## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling) ## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling) ## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling) ## Chain 3: ## Chain 3: Elapsed Time: 0.049 seconds (Warm-up) ## Chain 3: 0.048 seconds (Sampling) ## Chain 3: 0.097 seconds (Total) ## Chain 3: ## ## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4). ## Chain 4: ## Chain 4: Gradient evaluation took 6e-06 seconds ## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds. ## Chain 4: Adjust your expectations accordingly! ## Chain 4: ## Chain 4: ## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup) ## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup) ## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup) ## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup) ## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup) ## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup) ## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling) ## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling) ## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling) ## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling) ## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling) ## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling) ## Chain 4: ## Chain 4: Elapsed Time: 0.051 seconds (Warm-up) ## Chain 4: 0.046 seconds (Sampling) ## Chain 4: 0.097 seconds (Total) ## Chain 4: Further reading Stan-Homepage: It contains the documentation for Stand a a lot of tutorials. "],["ridge_regression.html", "19 Ridge Regression 19.1 Introduction", " 19 Ridge Regression THIS CHAPTER IS UNDER CONSTRUCTION!!! We should provide an example in Stan. 19.1 Introduction # Settings library(R2OpenBUGS) bugslocation <- "C:/Program Files/OpenBUGS323/OpenBugs.exe" # location of OpenBUGS bugsworkingdir <- file.path(getwd(), "BUGS") # Bugs working directory #------------------------------------------------------------------------------- # Simulate fake data #------------------------------------------------------------------------------- library(MASS) n <- 50 # sample size b0 <- 1.2 b <- rnorm(5, 0, 2) Sigma <- matrix(c(10,3,3,2,1, 3,2,3,2,1, 3,3,5,3,2, 2,2,3,10,3, 1,1,2,3,15),5,5) Sigma x <- mvrnorm(n = n, rep(0, 5), Sigma) simresid <- rnorm(n, 0, sd=3) # residuals x.z <- x for(i in 1:ncol(x)) x.z[,i] <- (x[,i]-mean(x[,i]))/sd(x[,i]) y <- b0 + x.z%*%b + simresid # calculate y, i.e. the data #------------------------------------------------------------------------------- # Function to generate initial values #------------------------------------------------------------------------------- inits <- function() { list(b0=runif(1, -2, 2), b=runif(5, -2, 2), sigma=runif(1, 0.1, 2)) } #------------------------------------------------------------------------------- # Run OpenBUGS #------------------------------------------------------------------------------- parameters <- c("b0", "b", "sigma") lambda <- c(1, 2, 10, 25, 50, 100, 500, 1000, 10000) bs <- matrix(ncol=length(lambda), nrow=length(b)) bse <- matrix(ncol=length(lambda), nrow=length(b)) for(j in 1:length(lambda)){ datax <- list(y=as.numeric(y), x=x, n=n, mb=rep(0, 5), lambda=lambda[j]) fit <- bugs(datax, inits, parameters, model.file="ridge_regression.txt", n.thin=1, n.chains=2, n.burnin=5000, n.iter=10000, debug=FALSE, OpenBUGS.pgm = bugslocation, working.directory=bugsworkingdir) bs[,j] <- fit$mean$b bse[,j] <- fit$sd$b } range(bs) plot(1:length(lambda), seq(-2, 1, length=length(lambda)), type="n") colkey <- rainbow(length(b)) for(j in 1:nrow(bs)){ lines(1:length(lambda), bs[j,], col=colkey[j], lwd=2) lines(1:length(lambda), bs[j,]-2*bse[j,], col=colkey[j], lty=3) lines(1:length(lambda), bs[j,]+2*bse[j,], col=colkey[j], lty=3) } abline(h=0) round(fit$summary,2) #------------------------------------------------------------------------------- # Run WinBUGS #------------------------------------------------------------------------------- library(R2WinBUGS) bugsdir <- "C:/Users/fk/WinBUGS14" # mod <- bugs(datax, inits= inits, parameters, model.file="normlinreg.txt", n.chains=2, n.iter=1000, n.burnin=500, n.thin=1, debug=TRUE, bugs.directory=bugsdir, program="WinBUGS", working.directory=bugsworkingdir) #------------------------------------------------------------------------------- # Test convergence and make inference #------------------------------------------------------------------------------- library(blmeco) # Make Figure 12.2 par(mfrow=c(3,1)) historyplot(fit, "beta0") historyplot(fit, "beta1") historyplot(fit, "sigmaRes") # Parameter estimates print(fit$summary, 3) # Make predictions for covariate values between 10 and 30 newdat <- data.frame(x=seq(10, 30, length=100)) Xmat <- model.matrix(~x, data=newdat) predmat <- matrix(ncol=fit$n.sim, nrow=nrow(newdat)) for(i in 1:fit$n.sim) predmat[,i] <- Xmat%*%c(fit$sims.list$beta0[i], fit$sims.list$beta1[i]) newdat$lower.bugs <- apply(predmat, 1, quantile, prob=0.025) newdat$upper.bugs <- apply(predmat, 1, quantile, prob=0.975) plot(y~x, pch=16, las=1, cex.lab=1.4, cex.axis=1.2, type="n", main="") polygon(c(newdat$x, rev(newdat$x)), c(newdat$lower.bugs, rev(newdat$upper.bugs)), col=grey(0.7), border=NA) abline(c(fit$mean$beta0, fit$mean$beta1), lwd=2) box() points(x,y) "],["SEM.html", "20 Structural equation models 20.1 Introduction", " 20 Structural equation models THIS CHAPTER IS UNDER CONSTRUCTION!!! We should provide an example in Stan. 20.1 Introduction ------------------------------------------------------------------------------------------------------ # General settings #------------------------------------------------------------------------------------------------------ library(MASS) library(rjags) library(MCMCpack) #------------------------------------------------------------------------------------------------------ # Simulation #------------------------------------------------------------------------------------------------------ n <- 100 heffM <- 0.6 # effect of H on M heffCS <- 0.0 # effect of H on Clutch size meffCS <- 0.6 # effect of M on Clutch size SigmaM <- matrix(c(0.1,0.04,0.04,0.1),2,2) meffm1 <- 0.6 meffm2 <- 0.7 SigmaH <- matrix(c(0.1,0.04,0.04,0.1),2,2) meffh1 <- 0.6 meffh2 <- -0.7 # Latente Variablen H <- rnorm(n, 0, 1) M <- rnorm(n, heffM * H, 0.1) # Clutch size CS <- rnorm(n, heffCS * H + meffCS * M, 0.1) # Indicators eM <- cbind(meffm1 * M, meffm2 * M) datM <- matrix(NA, ncol = 2, nrow = n) eH <- cbind(meffh1 * H, meffh2 * H) datH <- matrix(NA, ncol = 2, nrow = n) for(i in 1:n) { datM[i,] <- mvrnorm(1, eM[i,], SigmaM) datH[i,] <- mvrnorm(1, eH[i,], SigmaH) } #------------------------------------------------------------------------------ # JAGS Model #------------------------------------------------------------------------------ dat <- list(datM = datM, datH = datH, n = n, CS = CS, #H = H, M = M, S3 = matrix(c(1,0,0,1),nrow=2)/1) # Function to create initial values inits <- function() { list( meffh = runif(2, 0, 0.1), meffm = runif(2, 0, 0.1), heffM = runif(1, 0, 0.1), heffCS = runif(1, 0, 0.1), meffCS = runif(1, 0, 0.1), tauCS = runif(1, 0.1, 0.3), tauMH = runif(1, 0.1, 0.3), tauH = rwish(3,matrix(c(.02,0,0,.04),nrow=2)), tauM = rwish(3,matrix(c(.02,0,0,.04),nrow=2)) # M = as.numeric(rep(0, n)) ) } t.n.thin <- 50 t.n.chains <- 2 t.n.burnin <- 20000 t.n.iter <- 50000 # Run JAGS jagres <- jags.model('JAGS/BUGSmod1.R',data = dat, n.chains = t.n.chains, inits = inits, n.adapt = t.n.burnin) params <- c("meffh", "meffm", "heffM", "heffCS", "meffCS") mod <- coda.samples(jagres, params, n.iter=t.n.iter, thin=t.n.thin) res <- round(data.frame(summary(mod)$quantiles[, c(3, 1, 5)]), 3) res$TRUEVALUE <- c(heffCS, heffM, meffCS, meffh1, meffh2, meffm1, meffm2) res # Traceplots post <- data.frame(rbind(mod[[1]], mod[[2]])) names(post) <- dimnames(mod[[1]])[[2]] par(mfrow = c(3,3)) param <- c("meffh[1]", "meffh[2]", "meffm[1]", "meffm[2]", "heffM", "heffCS", "meffCS") traceplot(mod[, match(param, names(post))]) "],["spatial_glmm.html", "21 Modeling spatial data using GLMM 21.1 Introduction 21.2 Summary", " 21 Modeling spatial data using GLMM THIS CHAPTER IS UNDER CONSTRUCTION!!! 21.1 Introduction 21.2 Summary xxx "],["PART-III.html", "22 Introduction to PART III 22.1 Model notations", " 22 Introduction to PART III This part is a collection of more complicated ecological models to analyse data that may not be analysed with the traditional linear models that we covered in PART I of this book. 22.1 Model notations It is unavoidable that different authors use different notations for the same thing, or that the same notation is used for different things. We try to use, whenever possible, notations that is commonly used at the International Statistical Ecology Congress ISEC. Resulting from an earlier ISEC, Thomson et al. (2009) give guidelines on what letter should be used for which parameter in order to achieve a standard notation at least among people working with classical mark-recapture models. However, the alphabet has fewer letters compared to the number of ecological parameters. Therefore, the same letter cannot stand for the same parameter across all papers, books and chapters. Here, we try to use the same letter for the same parameter within the same chapter. "],["zeroinflated-poisson-lmm.html", "23 Zero-inflated Poisson Mixed Model 23.1 Introduction 23.2 Example data 23.3 Model 23.4 Further packages and readings$", " 23 Zero-inflated Poisson Mixed Model 23.1 Introduction Usually we describe the outcome variable with a single distribution, such as the normal distribution in the case of linear (mixed) models, and Poisson or binomial distributions in the case of generalized linear (mixed) models. In life sciences, however, quite often the data are actually generated by more than one process. In such cases the distribution of the data could be the result of two or more different distributions. If we do not account for these different processes our inferences are likely to be biased. In this chapter, we introduce a mixture model that explicitly include two processes that generated the data. The zero-inflated Poisson model is a mixture of a binomial and a Poisson distribution. We belief that two (or more)-level models are very useful tools in life sciences because they can help uncover the different processes that generate the data we observe. 23.2 Example data We used the blackstork data from the blmeco-package. They contain the breeding success of Black-stork in Latvia. The data was collected and kindly provided by Maris Stradz. The data contains the number of nestlings of more then 300 Black-stork nests in different years. Counting animals or plants is a typical example of data that contain a lot of zero counts. For example, the number of nestlings produced by a breeding pair is often zero because the whole nest was depredated or because a catastrophic event occurred such as a flood. However, when the nest succeeds, the number of nestlings varies among the successful nests depending on how many eggs the female has laid, how much food the parents could bring to the nest, or other factors that affect the survival of a nestling in an intact nest. Thus the factors that determine how many zero counts there are in the data differ from the factors that determine how many nestlings there are, if a nest survives. Count data that are produced by two different processes–one produces the zero counts and the other the variance in the count for the ones that were not zero in the first process–are called zero-inflated data. Histograms of zero-inflated data look bimodal, with one peak at zero (Figure 23.1). Figure 23.1: Histogram of the number of nestlings counted in black stork nests Ciconia nigra in Latvia (n = 1130 observations of 279 nests). 23.3 Model The Poisson distribution does not fit well to such data, because the data contain more zero counts than expected under the Poisson distribution. Mullahy (1986) and Lambert (1992) formulated two different types of models that combine the two processes in one model and therefore account for the zero excess in the data and allow the analysis of the two processes separately. The hurdle model (Mullahy, 1986) combines a left-truncated count data model (Poisson or negative binomial distribution that only describes the distribution of data larger than zero) with a zero-hurdle model that describes the distribution of the data that are either zero or nonzero. In other words, the hurdle model divides the data into two data subsets, the zero counts and the nonzero counts, and fits two separate models to each subset of the data. To account for this division of the data, the two models assume left truncation (all measurements below 1 are missing in the data) and right censoring (all measurements larger than 1 have the value 1), respectively, in their error distributions. A hurdle model can be fitted in R using the function hurdle from the package pscl (Jackman, 2008). See the tutorial by Zeileis et al. (2008) for an introduction. In contrast to the hurdle model, the zero-inflated models (Mullahy, 1986; Lambert, 1992) combine a Bernoulli model (zero vs. nonzero) with a conditional Poisson model; conditional on the Bernoulli process being nonzero. Thus this model allows for a mixture of zero counts: some zero counts are zero because the outcome of the Bernoulli process was zero (these zero counts are sometimes called structural zero values), and others are zero because their outcome from the Poisson process was zero. The function `zeroinfl from the package pscl fits zero-inflated models (Zeileis et al., 2008). The zero-inflated model may seem to reflect the true process that has generated the data closer than the hurdle model. However, sometimes the fit of zero-inflated models is impeded because of high correlation of the model parameters between the zero model and the count model. In such cases, a hurdle model may cause less troubles. Both functions (hurdle and zeroinfl) from the package pscl do not allow the inclusion of random factors. The functions MCMCglmm from the package MCMCglmm (Hadfield, 2010) and glmmadmb from the package glmmADMB (http://glmmadmb.r-forge.r-project.org/) provide the possibility to account for zero-inflation with a GLMM. However, these functions are not very flexible in the types of zero-inflated models they can fit; for example, glmmadmb only includes a constant proportion of zero values. A zero-inflation model using BUGS is described in Ke ́ry and Schaub (2012). Here we use Stan to fit a zero- inflated model. Once we understand the basic model code, it is easy to add predictors and/or random effects to both the zero and the count model. The example data contain numbers of nestlings in black stork Ciconia nigra nests in Latvia collected by Maris Stradz and collaborators at 279 nests be- tween 1979 and 2010. Black storks build solid and large aeries on branches of large trees. The same aerie is used for up to 17 years until it collapses. The black stork population in Latvia has drastically declined over the last decades. Here, we use the nestling data as presented in Figure 14-2 to describe whether the number of black stork nestlings produced in Latvia decreased over time. We use a zero-inflated Poisson model to separately estimate temporal trends for nest survival and the number of nestlings in successful nests. Since the same nests have been measured repeatedly over 1 to 17 years, we add nest ID as a random factor to both models, the Bernoulli and the Poisson model. After the first model fit, we saw that the between-nest variance in the number of nest- lings for the successful nests was close to zero. Therefore, we decide to delete the random effect from the Poisson model. Here is our final model: zit is a latent (unobserved) variable that takes the values 0 or 1 for each nest i during year t. It indicates a “structural zero”, that is, if zit 1⁄4 1 the number of nestlings yit always is zero, because the expected value in the Poisson model lit(1 zit) becomes zero. If zit 1⁄4 0, the expected value in the Poisson model becomes lit. To fit this model in Stan, we first write the Stan model code and save it in a separated text-file with name “zeroinfl.stan”. Here is a handy package: https://cran.r-project.org/web/packages/GLMMadaptive/vignettes/ZeroInflated_and_TwoPart_Models.html 23.4 Further packages and readings$ If the model does not contain any random factor, the R functions from the package pscl can be used to fit zeroinflated binomial or Poisson models (Zeileis, Kleiber, and Jackman 2008). Zero-inflation typically occurs in count data. However, it can also occur in continuous measurements. For example, the amount of rain per day measured in mm is very often zero, and, when it is not zero, it is a number following a specific (possibly normal) continuous distribution. Such data may be analyzed using tobit models (Tobin, 1958). Several R packages provide tobit models, such as censReg (Henningsen, 2013), AER (Kleiber & Zeileis, 2008), and MCMCpack (Martin et al., 2011). "],["dailynestsurv.html", "24 Daily nest survival 24.1 Background 24.2 Models for estimating daily nest survival 24.3 Known fate model 24.4 The Stan model 24.5 Prepare data and run Stan 24.6 Check convergence 24.7 Look at results 24.8 Known fate model for irregular nest controls Further reading", " 24 Daily nest survival 24.1 Background Analyses of nest survival is important for understanding the mechanisms of population dynamics. The life-span of a nest could be used as a measure of nest survival. However, this measure very often is biased towards nests that survived longer because these nests are detected by the ornithologists with higher probability (Mayfield 1975). In order not to overestimate nest survival, daily nest survival conditional on survival to the previous day can be estimated. 24.2 Models for estimating daily nest survival What model is best used depends on the type of data available. Data may look: Regular (e.g. daily) nest controls, all nests monitored from their first egg onward Regular nest controls, nests found during the course of the study at different stages and nestling ages Irregular nest controls, all nests monitored from their first egg onward Irregular nest controls, nests found during the course of the study at different stages and nestling ages Table 24.1: Models useful for estimating daily nest survival. Data numbers correspond to the descriptions above. Model Data Software, R-code Binomial or Bernoulli model 1, (3) glm, glmer,… Cox proportional hazard model 1,2,3,4 brm, soon: stan_cox Known fate model 1, 2 Stan code below Known fate model 3, 4 Stan code below Logistic exposure model 1,2,3,4 glm, glmerusing a link function that depends on exposure time Shaffer (2004) explains how to adapt the link function in a Bernoulli model to account for having found the nests at different nest ages (exposure time). Ben Bolker explains how to implement the logistic exposure model in R here. 24.3 Known fate model A natural model that allows estimating daily nest survival is the known-fate survival model. It is a Markov model that models the state of a nest \\(i\\) at day \\(t\\) (whether it is alive, \\(y_{it}=1\\) or not \\(y_{it}=0\\)) as a Bernoulli variable dependent on the state of the nest the day before. \\[ y_{it} \\sim Bernoulli(y_{it-1}S_{it})\\] The daily nest survival \\(S_{it}\\) can be linearly related to predictor variables that are measured on the nest or on the day level. \\[logit(S_{it}) = \\textbf{X} \\beta\\] It is also possible to add random effects if needed. 24.4 The Stan model The following Stan model code is saved as daily_nest_survival.stan. data { int<lower=0> Nnests; // number of nests int<lower=0> last[Nnests]; // day of last observation (alive or dead) int<lower=0> first[Nnests]; // day of first observation (alive or dead) int<lower=0> maxage; // maximum of last int<lower=0> y[Nnests, maxage]; // indicator of alive nests real cover[Nnests]; // a covariate of the nest real age[maxage]; // a covariate of the date } parameters { vector[3] b; // coef of linear pred for S } model { real S[Nnests, maxage-1]; // survival probability for(i in 1:Nnests){ for(t in first[i]:(last[i]-1)){ S[i,t] = inv_logit(b[1] + b[2]*cover[i] + b[3]*age[t]); } } // priors b[1]~normal(0,5); b[2]~normal(0,3); b[3]~normal(0,3); // likelihood for (i in 1:Nnests) { for(t in (first[i]+1):last[i]){ y[i,t]~bernoulli(y[i,t-1]*S[i,t-1]); } } } 24.5 Prepare data and run Stan Data is from (Grendelmeier2018?). load("RData/nest_surv_data.rda") str(datax) ## List of 7 ## $ y : int [1:156, 1:31] 1 NA 1 NA 1 NA NA 1 1 1 ... ## $ Nnests: int 156 ## $ last : int [1:156] 26 30 31 27 31 30 31 31 31 31 ... ## $ first : int [1:156] 1 14 1 3 1 24 18 1 1 1 ... ## $ cover : num [1:156] -0.943 -0.215 0.149 0.149 -0.215 ... ## $ age : num [1:31] -1.65 -1.54 -1.43 -1.32 -1.21 ... ## $ maxage: int 31 datax$y[is.na(datax$y)] <- 0 # Stan does not allow for NA's in the outcome # Run STAN library(rstan) mod <- stan(file = "stanmodels/daily_nest_survival.stan", data=datax, chains=5, iter=2500, control=list(adapt_delta=0.9), verbose = FALSE) 24.6 Check convergence We love exploring the performance of the Markov chains by using the function launch_shinystan from the package shinystan. 24.7 Look at results It looks like cover does not affect daily nest survival, but daily nest survival decreases with the age of the nestlings. #launch_shinystan(mod) print(mod) ## Inference for Stan model: anon_model. ## 5 chains, each with iter=2500; warmup=1250; thin=1; ## post-warmup draws per chain=1250, total post-warmup draws=6250. ## ## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat ## b[1] 4.04 0.00 0.15 3.76 3.94 4.04 4.14 4.35 3828 1 ## b[2] 0.00 0.00 0.13 -0.25 -0.09 -0.01 0.08 0.25 4524 1 ## b[3] -0.70 0.00 0.16 -1.02 -0.81 -0.69 -0.59 -0.39 3956 1 ## lp__ -298.98 0.03 1.30 -302.39 -299.52 -298.65 -298.05 -297.53 2659 1 ## ## Samples were drawn using NUTS(diag_e) at Thu Jan 19 22:33:33 2023. ## For each parameter, n_eff is a crude measure of effective sample size, ## and Rhat is the potential scale reduction factor on split chains (at ## convergence, Rhat=1). # effect plot bsim <- as.data.frame(mod) nsim <- nrow(bsim) newdat <- data.frame(age=seq(1, datax$maxage, length=100)) newdat$age.z <- (newdat$age-mean(1:datax$maxage))/sd((1:datax$maxage)) Xmat <- model.matrix(~age.z, data=newdat) fitmat <- matrix(ncol=nsim, nrow=nrow(newdat)) for(i in 1:nsim) fitmat[,i] <- plogis(Xmat%*%as.numeric(bsim[i,c(1,3)])) newdat$fit <- apply(fitmat, 1, median) newdat$lwr <- apply(fitmat, 1, quantile, prob=0.025) newdat$upr <- apply(fitmat, 1, quantile, prob=0.975) plot(newdat$age, newdat$fit, ylim=c(0.8,1), type="l", las=1, ylab="Daily nest survival", xlab="Age [d]") lines(newdat$age, newdat$lwr, lty=3) lines(newdat$age, newdat$upr, lty=3) Figure 24.1: Estimated daily nest survival probability in relation to nest age. Dotted lines are 95% uncertainty intervals of the regression line. 24.8 Known fate model for irregular nest controls When nest are controlled only irregularly, it may happen that a nest is found predated or dead after a longer break in controlling. In such cases, we know that the nest was predated or it died due to other causes some when between the last control when the nest was still alive and when it was found dead. In such cases, we need to tell the model that the nest could have died any time during the interval when we were not controlling. To do so, we create a variable that indicates the time (e.g. day since first egg) when the nest was last seen alive (lastlive). A second variable indicates the time of the last check which is either the equal to lastlive when the nest survived until the last check, or it is larger than lastlive when the nest failure has been recorded. A last variable, gap, measures the time interval in which the nest failure occurred. A gap of zero means that the nest was still alive at the last control, a gapof 1 means that the nest failure occurred during the first day after lastlive, a gap of 2 means that the nest failure either occurred at the first or second day after lastlive. # time when nest was last observed alive lastlive <- apply(datax$y, 1, function(x) max(c(1:length(x))[x==1])) # time when nest was last checked (alive or dead) lastcheck <- lastlive+1 # here, we turn the above data into a format that can be used for # irregular nest controls. WOULD BE NICE TO HAVE A REAL DATA EXAMPLE! # when nest was observed alive at the last check, then lastcheck equals lastlive lastcheck[lastlive==datax$last] <- datax$last[lastlive==datax$last] datax1 <- list(Nnests=datax$Nnests, lastlive = lastlive, lastcheck= lastcheck, first=datax$first, cover=datax$cover, age=datax$age, maxage=datax$maxage) # time between last seen alive and first seen dead (= lastcheck) datax1$gap <- datax1$lastcheck-datax1$lastlive In the Stan model code, we specify the likelihood for each gap separately. data { int<lower=0> Nnests; // number of nests int<lower=0> lastlive[Nnests]; // day of last observation (alive) int<lower=0> lastcheck[Nnests]; // day of observed death or, if alive, last day of study int<lower=0> first[Nnests]; // day of first observation (alive or dead) int<lower=0> maxage; // maximum of last real cover[Nnests]; // a covariate of the nest real age[maxage]; // a covariate of the date int<lower=0> gap[Nnests]; // obsdead - lastlive } parameters { vector[3] b; // coef of linear pred for S } model { real S[Nnests, maxage-1]; // survival probability for(i in 1:Nnests){ for(t in first[i]:(lastcheck[i]-1)){ S[i,t] = inv_logit(b[1] + b[2]*cover[i] + b[3]*age[t]); } } // priors b[1]~normal(0,1.5); b[2]~normal(0,3); b[3]~normal(0,3); // likelihood for (i in 1:Nnests) { for(t in (first[i]+1):lastlive[i]){ 1~bernoulli(S[i,t-1]); } if(gap[i]==1){ target += log(1-S[i,lastlive[i]]); // } if(gap[i]==2){ target += log((1-S[i,lastlive[i]]) + S[i,lastlive[i]]*(1-S[i,lastlive[i]+1])); // } if(gap[i]==3){ target += log((1-S[i,lastlive[i]]) + S[i,lastlive[i]]*(1-S[i,lastlive[i]+1]) + prod(S[i,lastlive[i]:(lastlive[i]+1)])*(1-S[i,lastlive[i]+2])); // } if(gap[i]==4){ target += log((1-S[i,lastlive[i]]) + S[i,lastlive[i]]*(1-S[i,lastlive[i]+1]) + prod(S[i,lastlive[i]:(lastlive[i]+1)])*(1-S[i,lastlive[i]+2]) + prod(S[i,lastlive[i]:(lastlive[i]+2)])*(1-S[i,lastlive[i]+3])); // } } } # Run STAN mod1 <- stan(file = "stanmodels/daily_nest_survival_irreg.stan", data=datax1, chains=5, iter=2500, control=list(adapt_delta=0.9), verbose = FALSE) Further reading Helpful links: https://deepai.org/publication/bayesian-survival-analysis-using-the-rstanarm-r-package (Brilleman et al. 2020) https://www.hammerlab.org/2017/06/26/introducing-survivalstan/ "],["cjs_with_mix.html", "25 Capture-mark recapture model with a mixture structure to account for missing sex-variable for parts of the individuals 25.1 Introduction 25.2 Data description 25.3 Model description 25.4 The Stan code 25.5 Call Stan from R, check convergence and look at results", " 25 Capture-mark recapture model with a mixture structure to account for missing sex-variable for parts of the individuals 25.1 Introduction In some species the identification of the sex is not possible for all individuals without sampling DNA. For example, morphological dimorphism is absent or so weak that parts of the individuals cannot be assigned to one of the sexes. Particularly in ornithological long-term capture recapture data sets that typically are obtained by voluntary bird ringers who do normaly not have the possibilities to analyse DNA, often the sex identification is missing in parts of the individuals. For estimating survival, it would nevertheless be valuable to include data of all individuals, use the information on sex-specific effects on survival wherever possible but account for the fact that of parts of the individuals the sex is not known. We here explain how a Cormack-Jolly-Seber model can be integrated with a mixture model in oder to allow for a combined analyses of individuals with and without sex identified. An introduction to the Cormack-Jolly-Seber model we gave in Chapter 14.5 of the book Korner-Nievergelt et al. (2015). We here expand this model by a mixture structure that allows including individuals with a missing categorical predictor variable, such as sex. 25.2 Data description ## simulate data # true parameter values theta <- 0.6 # proportion of males nocc <- 15 # number of years in the data set b0 <- matrix(NA, ncol=nocc-1, nrow=2) b0[1,] <- rbeta((nocc-1), 3, 4) # capture probability of males b0[2,] <- rbeta((nocc-1), 2, 4) # capture probability of females a0 <- matrix(NA, ncol=2, nrow=2) a1 <- matrix(NA, ncol=2, nrow=2) a0[1,1]<- qlogis(0.7) # average annual survival for adult males a0[1,2]<- qlogis(0.3) # average annual survival for juveniles a0[2,1] <- qlogis(0.55) # average annual survival for adult females a0[2,2] <- a0[1,2] a1[1,1] <- 0 a1[1,2] <- -0.5 a1[2,1] <- -0.8 a1[2,2] <- a1[1,2] nindi <- 1000 # number of individuals with identified sex nindni <- 1500 # number of individuals with non-identified sex nind <- nindi + nindni # total number of individuals y <- matrix(ncol=nocc, nrow=nind) z <- matrix(ncol=nocc, nrow=nind) first <- sample(1:(nocc-1), nind, replace=TRUE) sex <- sample(c(1,2), nind, prob=c(theta, 1-theta), replace=TRUE) juvfirst <- sample(c(0,1), nind, prob=c(0.5, 0.5), replace=TRUE) juv <- matrix(0, nrow=nind, ncol=nocc) for(i in 1:nind) juv[i,first[i]] <- juv[i] x <- runif(nocc-1, -2, 2) # a time dependent covariate covariate p <- b0 # recapture probability phi <- array(NA, dim=c(2, 2, nocc-1)) # for ad males phi[1,1,] <- plogis(a0[1,1]+a1[1,1]*x) # for ad females phi[2,1,] <- plogis(a0[2,1]+a1[2,1]*x) # for juvs phi[1,2,] <- phi[2,2,] <- plogis(a0[2,2]+a1[2,2]*x) for(i in 1:nind){ z[i,first[i]] <- 1 y[i, first[i]] <- 1 for(t in (first[i]+1):nocc){ z[i, t] <- rbinom(1, size=1, prob=z[i,t-1]*phi[sex[i],juv[i,t-1]+1, t-1]) y[i, t] <- rbinom(1, size=1, prob=z[i,t]*p[sex[i],t-1]) } } y[is.na(y)] <- 0 The mark-recapture data set consists of capture histories of 2500 individuals over 15 time periods. For each time period \\(t\\) and individual \\(i\\) the capture history matrix \\(y\\) contains \\(y_{it}=1\\) if the individual \\(i\\) is captured during time period \\(t\\), or \\(y_{it}=0\\) if the individual \\(i\\) is not captured during time period \\(t\\). The marking time period varies between individuals from 1 to 14. At the marking time period, the age of the individuals was classified either as juvenile or as adult. Juveniles turn into adults after one time period, thus age is known for all individuals during all time periods after marking. For 1000 individuals of the 2500 individuals, the sex is identified, whereas for 1500 individuals, the sex is unknown. The example data contain one covariate \\(x\\) that takes on one value for each time period. # bundle the data for Stan i <- 1:nindi ni <- (nindi+1):nind datax <- list(yi=y[i,], nindi=nindi, sex=sex[i], nocc=nocc, yni=y[ni,], nindni=nindni, firsti=first[i], firstni=first[ni], juvi=juv[i,]+1, juvni=juv[ni,]+1, year=1:nocc, x=x) 25.3 Model description The observations \\(y_{it}\\), an indicator of whether individual i was recaptured during time period \\(t\\) is modelled conditional on the latent true state of the individual birds \\(z_{it}\\) (0 = dead or permanently emigrated, 1 = alive and at the study site) as a Bernoulli variable. The probability \\(P(y_{it} = 1)\\) is the product of the probability that an alive individual is recaptured, \\(p_{it}\\), and the state of the bird \\(z_{it}\\) (alive = 1, dead = 0). Thus, a dead bird cannot be recaptured, whereas for a bird alive during time period \\(t\\), the recapture probability equals \\(p_{it}\\): \\[y_{it} \\sim Bernoulli(z_{it}p_{it})\\] The latent state variable \\(z_{it}\\) is a Markovian variable with the state at time \\(t\\) being dependent on the state at time \\(t-1\\) and the apparent survival probability \\[\\phi_{it}\\]: \\[z_{it} \\sim Bernoulli(z_{it-1}\\phi_{it})\\] We use the term apparent survival in order to indicate that the parameter \\(\\phi\\) is a product of site fidelity and survival. Thus, individuals that permanently emigrated from the study area cannot be distinguished from dead individuals. In both models, the parameters \\(\\phi\\) and \\(p\\) were modelled as sex-specific. However, for parts of the individuals, sex could not be identified, i.e. sex was missing. Ignoring these missing values would most likely lead to a bias because they were not missing at random. The probability that sex can be identified is increasing with age and most likely differs between sexes. Therefore, we included a mixture model for the sex: \\[Sex_i \\sim Categorical(q_i)\\] where \\(q_i\\) is a vector of length 2, containing the probability of being a male and a female, respectively. In this way, the sex of the non-identified individuals was assumed to be male or female with probability \\(q[1]\\) and \\(q[2]=1-q[1]\\), respectively. This model corresponds to the finite mixture model introduced by Pledger, Pollock, and Norris (2003) in order to account for unknown classes of birds (heterogeneity). However, in our case, for parts of the individuals the class (sex) was known. In the example model, we constrain apparent survival to be linearly dependent on a covariate x with different slopes for males, females and juveniles using the logit link function. \\[logit(\\phi_{it}) = a0_{sex-age-class[it]} + a1_{sex-age-class[it]}x_i\\] Annual recapture probability was modelled for each year and age and sex class independently: \\[p_{it} = b0_{t,sex-age-class[it]}\\] Uniform prior distributions were used for all parameters with a parameter space limited to values between 0 and 1 (probabilities) and a normal distribution with a mean of 0 and a standard deviation of 1.5 for the intercept \\(a0\\), and a standard deviation of 5 was used for \\(a1\\). 25.4 The Stan code The trick for coding the CMR-mixture model in Stan is to formulate the model 3 times: 1. For the individuals with identified sex 2. For the males that were not identified 3. For the females that were not identified Then for the non-identified individuals a mixture model is formulated that assigns a probability of being a female or a male to each individual. data { int<lower=2> nocc; // number of capture events int<lower=0> nindi; // number of individuals with identified sex int<lower=0> nindni; // number of individuals with non-identified sex int<lower=0,upper=2> yi[nindi,nocc]; // CH[i,k]: individual i captured at k int<lower=0,upper=nocc-1> firsti[nindi]; // year of first capture int<lower=0,upper=2> yni[nindni,nocc]; // CH[i,k]: individual i captured at k int<lower=0,upper=nocc-1> firstni[nindni]; // year of first capture int<lower=1, upper=2> sex[nindi]; int<lower=1, upper=2> juvi[nindi, nocc]; int<lower=1, upper=2> juvni[nindni, nocc]; int<lower=1> year[nocc]; real x[nocc-1]; // a covariate } transformed data { int<lower=0,upper=nocc+1> lasti[nindi]; // last[i]: ind i last capture int<lower=0,upper=nocc+1> lastni[nindni]; // last[i]: ind i last capture lasti = rep_array(0,nindi); lastni = rep_array(0,nindni); for (i in 1:nindi) { for (k in firsti[i]:nocc) { if (yi[i,k] == 1) { if (k > lasti[i]) lasti[i] = k; } } } for (ii in 1:nindni) { for (kk in firstni[ii]:nocc) { if (yni[ii,kk] == 1) { if (kk > lastni[ii]) lastni[ii] = kk; } } } } parameters { real<lower=0, upper=1> theta[nindni]; // probability of being male for non-identified individuals real<lower=0, upper=1> b0[2,nocc-1]; // intercept of p real a0[2,2]; // intercept for phi real a1[2,2]; // coefficient for phi } transformed parameters { real<lower=0,upper=1>p_male[nindni,nocc]; // capture probability real<lower=0,upper=1>p_female[nindni,nocc]; // capture probability real<lower=0,upper=1>p[nindi,nocc]; // capture probability real<lower=0,upper=1>phi_male[nindni,nocc-1]; // survival probability real<lower=0,upper=1>chi_male[nindni,nocc+1]; // probability that an individual // is never recaptured after its // last capture real<lower=0,upper=1>phi_female[nindni,nocc-1]; // survival probability real<lower=0,upper=1>chi_female[nindni,nocc+1]; // probability that an individual // is never recaptured after its // last capture real<lower=0,upper=1>phi[nindi,nocc-1]; // survival probability real<lower=0,upper=1>chi[nindi,nocc+1]; // probability that an individual // is never recaptured after its // last capture { int k; int kk; for(ii in 1:nindi){ if (firsti[ii]>1) { for (z in 1:(firsti[ii]-1)){ phi[ii,z] = 1; } } for(tt in firsti[ii]:(nocc-1)) { // linear predictor for phi: phi[ii,tt] = inv_logit(a0[sex[ii], juvi[ii,tt]] + a1[sex[ii], juvi[ii,tt]]*x[tt]); } } for(ii in 1:nindni){ if (firstni[ii]>1) { for (z in 1:(firstni[ii]-1)){ phi_female[ii,z] = 1; phi_male[ii,z] = 1; } } for(tt in firstni[ii]:(nocc-1)) { // linear predictor for phi: phi_male[ii,tt] = inv_logit(a0[1, juvni[ii,tt]] + a1[1, juvni[ii,tt]]*x[tt]); phi_female[ii,tt] = inv_logit(a0[2, juvni[ii,tt]]+ a1[2, juvni[ii,tt]]*x[tt]); } } for(i in 1:nindi) { // linear predictor for p for identified individuals for(w in 1:firsti[i]){ p[i,w] = 1; } for(kkk in (firsti[i]+1):nocc) p[i,kkk] = b0[sex[i],year[kkk-1]]; chi[i,nocc+1] = 1.0; k = nocc; while (k > firsti[i]) { chi[i,k] = (1 - phi[i,k-1]) + phi[i,k-1] * (1 - p[i,k]) * chi[i,k+1]; k = k - 1; } if (firsti[i]>1) { for (u in 1:(firsti[i]-1)){ chi[i,u] = 0; } } chi[i,firsti[i]] = (1 - p[i,firsti[i]]) * chi[i,firsti[i]+1]; }// close definition of transformed parameters for identified individuals for(i in 1:nindni) { // linear predictor for p for non-identified individuals for(w in 1:firstni[i]){ p_male[i,w] = 1; p_female[i,w] = 1; } for(kkkk in (firstni[i]+1):nocc){ p_male[i,kkkk] = b0[1,year[kkkk-1]]; p_female[i,kkkk] = b0[2,year[kkkk-1]]; } chi_male[i,nocc+1] = 1.0; chi_female[i,nocc+1] = 1.0; k = nocc; while (k > firstni[i]) { chi_male[i,k] = (1 - phi_male[i,k-1]) + phi_male[i,k-1] * (1 - p_male[i,k]) * chi_male[i,k+1]; chi_female[i,k] = (1 - phi_female[i,k-1]) + phi_female[i,k-1] * (1 - p_female[i,k]) * chi_female[i,k+1]; k = k - 1; } if (firstni[i]>1) { for (u in 1:(firstni[i]-1)){ chi_male[i,u] = 0; chi_female[i,u] = 0; } } chi_male[i,firstni[i]] = (1 - p_male[i,firstni[i]]) * chi_male[i,firstni[i]+1]; chi_female[i,firstni[i]] = (1 - p_female[i,firstni[i]]) * chi_female[i,firstni[i]+1]; } // close definition of transformed parameters for non-identified individuals } // close block of transformed parameters exclusive parameter declarations } // close transformed parameters model { // priors theta ~ beta(1, 1); for (g in 1:(nocc-1)){ b0[1,g]~beta(1,1); b0[2,g]~beta(1,1); } a0[1,1]~normal(0,1.5); a0[1,2]~normal(0,1.5); a1[1,1]~normal(0,3); a1[1,2]~normal(0,3); a0[2,1]~normal(0,1.5); a0[2,2]~normal(a0[1,2],0.01); // for juveniles, we assume that the effect of the covariate is independet of sex a1[2,1]~normal(0,3); a1[2,2]~normal(a1[1,2],0.01); // likelihood for identified individuals for (i in 1:nindi) { if (lasti[i]>0) { for (k in firsti[i]:lasti[i]) { if(k>1) target+= (log(phi[i, k-1])); if (yi[i,k] == 1) target+=(log(p[i,k])); else target+=(log1m(p[i,k])); } } target+=(log(chi[i,lasti[i]+1])); } // likelihood for non-identified individuals for (i in 1:nindni) { real log_like_male = 0; real log_like_female = 0; if (lastni[i]>0) { for (k in firstni[i]:lastni[i]) { if(k>1){ log_like_male += (log(phi_male[i, k-1])); log_like_female += (log(phi_female[i, k-1])); } if (yni[i,k] == 1){ log_like_male+=(log(p_male[i,k])); log_like_female+=(log(p_female[i,k])); } else{ log_like_male+=(log1m(p_male[i,k])); log_like_female+=(log1m(p_female[i,k])); } } } log_like_male += (log(chi_male[i,lastni[i]+1])); log_like_female += (log(chi_female[i,lastni[i]+1])); target += log_mix(theta[i], log_like_male, log_like_female); } } 25.5 Call Stan from R, check convergence and look at results # Run STAN library(rstan) fit <- stan(file = "stanmodels/cmr_mixture_model.stan", data=datax, verbose = FALSE) # for above simulated data (25000 individuals x 15 time periods) # computing time is around 48 hours on an intel corei7 laptop # for larger data sets, we recommed moving the transformed parameters block # to the model block in order to avoid monitoring of p_male, p_female, # phi_male and phi_female producing memory problems # launch_shinystan(fit) # diagnostic plots summary(fit) ## mean se_mean sd 2.5% 25% ## b0[1,1] 0.60132367 0.0015709423 0.06173884 0.48042366 0.55922253 ## b0[1,2] 0.70098709 0.0012519948 0.04969428 0.60382019 0.66806698 ## b0[1,3] 0.50293513 0.0010904085 0.04517398 0.41491848 0.47220346 ## b0[1,4] 0.28118209 0.0008809447 0.03577334 0.21440931 0.25697691 ## b0[1,5] 0.34938289 0.0009901335 0.03647815 0.27819918 0.32351323 ## b0[1,6] 0.13158569 0.0006914740 0.02627423 0.08664129 0.11286629 ## b0[1,7] 0.61182981 0.0010463611 0.04129602 0.53187976 0.58387839 ## b0[1,8] 0.48535193 0.0010845951 0.04155762 0.40559440 0.45750793 ## b0[1,9] 0.52531291 0.0008790063 0.03704084 0.45247132 0.50064513 ## b0[1,10] 0.87174780 0.0007565552 0.03000936 0.80818138 0.85259573 ## b0[1,11] 0.80185454 0.0009425675 0.03518166 0.73173810 0.77865187 ## b0[1,12] 0.33152443 0.0008564381 0.03628505 0.26380840 0.30697293 ## b0[1,13] 0.42132288 0.0012174784 0.04140382 0.34062688 0.39305210 ## b0[1,14] 0.65180372 0.0015151039 0.05333953 0.55349105 0.61560493 ## b0[2,1] 0.34237039 0.0041467200 0.12925217 0.12002285 0.24717176 ## b0[2,2] 0.18534646 0.0023431250 0.07547704 0.05924694 0.12871584 ## b0[2,3] 0.61351083 0.0024140550 0.07679100 0.46647727 0.56242546 ## b0[2,4] 0.37140208 0.0024464965 0.06962399 0.24693888 0.32338093 ## b0[2,5] 0.19428215 0.0034618302 0.11214798 0.02800056 0.11146326 ## b0[2,6] 0.27371336 0.0026553769 0.09054020 0.11827243 0.20785316 ## b0[2,7] 0.18611173 0.0014387436 0.05328492 0.09122869 0.14789827 ## b0[2,8] 0.25648337 0.0018258589 0.05287800 0.16255769 0.21913271 ## b0[2,9] 0.20378754 0.0021367769 0.07380004 0.07777998 0.15215845 ## b0[2,10] 0.52679548 0.0024625568 0.08696008 0.36214334 0.46594844 ## b0[2,11] 0.47393354 0.0032593161 0.10555065 0.28843967 0.39781278 ## b0[2,12] 0.22289155 0.0017082729 0.05551514 0.12576797 0.18203335 ## b0[2,13] 0.26191486 0.0024159794 0.07016314 0.14106495 0.21234017 ## b0[2,14] 0.65111737 0.0055743944 0.18780555 0.29279480 0.50957591 ## a0[1,1] 0.95440670 0.0013771881 0.04808748 0.86301660 0.92146330 ## a0[1,2] 0.01529770 0.0469699511 1.46995922 -2.82218067 -0.95533706 ## a0[2,1] 0.16384995 0.0049928331 0.12634422 -0.06399631 0.07533962 ## a0[2,2] 0.01535679 0.0469634175 1.47006964 -2.81864060 -0.95515751 ## a1[1,1] 0.15937249 0.0028992587 0.08864790 -0.01288607 0.10017613 ## a1[1,2] 0.08055953 0.1007089857 3.02148727 -5.95525636 -1.96662599 ## a1[2,1] -0.83614134 0.0074143920 0.18655882 -1.21033848 -0.95698565 ## a1[2,2] 0.08071668 0.1006904255 3.02145647 -5.94617355 -1.96508733 ## 50% 75% 97.5% n_eff Rhat ## b0[1,1] 0.60206306 0.6431566 0.7206343 1544.5301 1.002331 ## b0[1,2] 0.70165494 0.7355204 0.7946280 1575.4617 1.001482 ## b0[1,3] 0.50367411 0.5330078 0.5898079 1716.3196 1.001183 ## b0[1,4] 0.27997512 0.3046483 0.3544592 1649.0040 1.000760 ## b0[1,5] 0.34936442 0.3751935 0.4191138 1357.3073 1.002072 ## b0[1,6] 0.12987449 0.1481661 0.1873982 1443.8040 1.003676 ## b0[1,7] 0.61203228 0.6397577 0.6933929 1557.5904 1.001458 ## b0[1,8] 0.48513822 0.5134314 0.5672066 1468.1355 1.002511 ## b0[1,9] 0.52534212 0.5501747 0.5994060 1775.7335 1.000824 ## b0[1,10] 0.87324112 0.8934047 0.9258033 1573.3747 1.000719 ## b0[1,11] 0.80300311 0.8261868 0.8675033 1393.1817 1.001172 ## b0[1,12] 0.33044476 0.3552199 0.4052902 1794.9956 1.000566 ## b0[1,13] 0.42116690 0.4492297 0.5026942 1156.5339 1.000289 ## b0[1,14] 0.64956850 0.6864706 0.7607107 1239.4056 1.004061 ## b0[2,1] 0.33493631 0.4251416 0.6150923 971.5524 1.004049 ## b0[2,2] 0.17981663 0.2358847 0.3446097 1037.6210 1.001474 ## b0[2,3] 0.61326419 0.6644156 0.7628427 1011.8737 1.005727 ## b0[2,4] 0.36837778 0.4158585 0.5190457 809.8949 1.003803 ## b0[2,5] 0.17910449 0.2591418 0.4533117 1049.4733 1.001499 ## b0[2,6] 0.26739172 0.3299594 0.4685139 1162.6006 1.001170 ## b0[2,7] 0.18254607 0.2198969 0.3003156 1371.6455 1.000878 ## b0[2,8] 0.25280556 0.2895585 0.3704113 838.7174 1.005624 ## b0[2,9] 0.19724053 0.2501298 0.3694806 1192.8747 1.003687 ## b0[2,10] 0.52587075 0.5845730 0.7061694 1247.0027 1.002851 ## b0[2,11] 0.46874445 0.5392302 0.7046892 1048.7425 0.999473 ## b0[2,12] 0.21961656 0.2580782 0.3397127 1056.1081 1.000907 ## b0[2,13] 0.25601959 0.3056204 0.4142888 843.3960 1.003130 ## b0[2,14] 0.65824835 0.7973674 0.9698829 1135.0669 1.003838 ## a0[1,1] 0.95368445 0.9862439 1.0515747 1219.2071 1.003898 ## a0[1,2] 0.01633534 0.9911055 2.9717839 979.4231 1.003726 ## a0[2,1] 0.15519648 0.2472483 0.4230776 640.3489 1.004625 ## a0[2,2] 0.01587281 0.9898084 2.9659552 979.8429 1.003744 ## a1[1,1] 0.15647489 0.2205720 0.3354845 934.8953 1.007190 ## a1[1,2] 0.06683287 2.1568781 6.0295208 900.1297 1.003701 ## a1[2,1] -0.83503982 -0.7075691 -0.4814539 633.1119 1.010568 ## a1[2,2] 0.06586905 2.1557247 6.0239735 900.4432 1.003704 "],["samplesize.html", "26 What sample size? 26.1 Introduction", " 26 What sample size? 26.1 Introduction What sample size is needed, is an important question when planning an empirical study? Some authorities even ask for a justification for the planned sample size of an animal experiment. "],["referenzen.html", "Referenzen", " Referenzen Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell. 2009. Statistical Modelling in r. Oxford: Oxford University Press. Almasi, B, A Roulin, S Jenni-Eiermann, C W Breuner, and L Jenni. 2009. “Regulation of Free Corticosterone and CBG Capacity Under Different Environmental Conditions in Altricial Nestlings.” Gen. Comp. Endocr. 164: 117–24. Amrhein, Valentin, Sander Greenland, and Blake McShane. 2019. “Retire Statistical Significance.” Nature 567: 305–7. Anderson, J A. 1974. “Diagnosis by Logistic Discriminant Function: Further Practical Problems and Results.” Journal of Applied Statistics 23: 397–404. Betancourt, M.~J. 2013. “Generalizing the No-U-Turn Sampler to Riemannian Manifolds.” ArXiv e-Prints, April. https://arxiv.org/abs/1304.1920. Betancourt, M.~J., and M. Girolami. 2013. “Hamiltonian Monte Carlo for Hierarchical Models.” ArXiv e-Prints. https://arxiv.org/abs/1312.0906. Brilleman, Samuel L., Eren M. Elci, Jacqueline Buros Novik, and Rory Wolfe. 2020. “Bayesian Survival Analysis Using the Rstanarm r Package.” http://arxiv.org/pdf/2002.09633v1. Davison, A C, and E J Snell. 1991. “Residuals and Diagnostics.” In Statistical Theory and Modelling. In Honour of Sir David Cox, FRS, edited by D V Hinkley, N Reid, and E J Snell. London: Chapman {\\&} Hall. Efron, Bradley, and Trevor Hastie. 2016. Computer age statistical inference: Algorithms, evidence, and data science. Institute of Mathematical Statistics Monographs. Ellenberg, H. 1953. “Physiologisches Und Oekologisches Verhalten Derselben Pflanzenarten.” Berichte Der Deutschen Botanischen Gesellschaft 65: 350361. Gabry, Jonah. 2017. “Shinystan: Interactive Visual and Numerical Diagnostics and Posterior Analysis for Bayesian Models.” Gelman, A. 2006. “Prior Distributions for Variance Parameters in Hierarchical Models.” Bayesian Analysis 1: 515–33. Gelman, A., John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2014a. Bayesian Data Analysis. Third. New York: CRC Press. Gelman, A, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. 2014b. Bayesian Data Analysis. Third. New York: CRC Press. Gelman, A, and J Hill. 2007. Data Analysis Using Regression and Multilevel / Hierarchical Models. Cambridge: Cambridge Universtiy Press. Gelman, Andrew, and Sander Greenland. 2019. “Are Confidence Intervals Better Termed Uncertainty Intervals?” BMJ (Clinical Research Ed.) 366: l5381. https://doi.org/10.1136/bmj.l5381. Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel / Hierarchical Models. Cambridge University Press. Gottschalk, Thomas, Klemens Ekschmitt, and Volkmar Wolters. 2011. “Efficient Placement of Nest Boxes for the Little Owl (Athene Noctua).” The Journal of Raptor Research 45: 1–14. Grüebler, Martin U, Fränzi Korner-Nievergelt, and Johann Von Hirschheydt. 2010. “The Reproductive Benefits of Livestock Farming in Barn Swallows Hirundo Rustica: Quality of Nest Site or Foraging Habitat?” Journal of Applied Ecology 47 (6): 1340–47. Harju, S. 2016. “Book review:~Bayesian Data Analysis in Ecology Using Linear Models with R, BUGS, and Stan.” The Journal of Wildlife Management 80: 771. Harrison, Xavier A. 2014. “Using Observation-Level Random Effects to Model Overdispersion in Count Data in Ecology and Evolution.” PeerJ 2: e616. https://doi.org/10.7717/peerj.616. Hastie, T, R Tibshirani, and J Friedman. 2009. The Elements of Statistical Learning, Data Mining, Inference, and Prediction. New York: Springer. Hemming, Victoria, Abbey E. Camaclang, Megan S. Adams, Mark Burgman, Katherine Carbeck, Josie Carwardine, Iadine Chadès, et al. 2022. “An Introduction to Decision Science for Conservation.” Conservation Biology. John Wiley; Sons Inc. https://doi.org/10.1111/cobi.13868. Hoffman, Matthew D, and Andrew Gelman. 2014. “The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15 (1): 1593–623. Hoyle, Rick H. 2012. Handbook of Structural Equation Modeling. New York: The Guildford Press. Jenni, L, and R Winkler. 1989. “The Feather-Length of Small Passerines: A Measurement for Wing-Length in Live Birds and Museum Skins.” Bird Study 36: 1–15. Korner-Nievergelt, F, T Roth, Stefanie von Felten, J Guélat, B Almasi, and P Korner-Nievergelt. 2015. Bayesian Data Analysis in Ecolog Using Linear Models with R, BUGS, and Stan. New York: Elsevier. Lemoine, Nathan P. 2019. “Moving Beyond Noninformative Priors: Why and How to Choose Weakly Informative Priors in Bayesian Analyses.” Oikos 128 (7): 912–28. https://doi.org/10.1111/oik.05985. MacKenzie, Darryl I, James D Nichols, G B Lachman, S Droege, J A Royle, and C A Langtimm. 2002. “Estimating Site Occupancy Rates When Detection Probabilities Are Less Than One.” Ecology 83: 2248–55. Manly, Bryan F J. 1994. Multivariate Statistical Methods, A Primer. London: 2nd ed. Chapman & Hall. Mayfield, Harold F. 1975. “Suggestions for Calculating Nest Success.” Wilson Bulletin 87: 456–66. McElreath, Richard. 2016. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. New York: Max Planck Institute for Evolutionary Anthropology; CRC Press. Nakagawa, Shinichi, and Holger Schielzeth. 2013. “A General and Simple Method for Obtaining R2 from Generalized Linear Mixed-Effects Models.” Methods in Ecology and Evolution 4: 133–42. https://doi.org/10.1111/j.2041-210x.2012.00261.x. Pledger, S., K. H. Pollock, and James L. Norris. 2003. “Open Capture-Recapture Models with Heterogeneity: I. Cormack-Jolly-Seber Model.” Biometrics 59: 786–94. Royle, J Andrew. 2004. “N-Mixture Models for Estimating Population Size from Spatially Replicated Counts.” Biometrics 60: 108–15. Schano, Christian, Carole Niffenegger, Tobias Jonas, and Fränzi Korner-Nievergelt. 2021. “Hatching phenology is lagging behind an advancing snowmelt pattern in a high-alpine bird.” Scientific Reports 11 (1): 20130016. https://doi.org/10.1038/s41598-021-01497-8. Shaffer, Terry L. 2004. “A Unified Approach to Analyzing Nest Success.” The Auk 121: 526–40. Shipley, Bill. 2009. “Confirmatory path analysis in a generalized multilevel context.” Ecology 90: 363–68. Thomson, D L, M J Conroy, D R Anderson, K P Burnham, E G Cooch, C M Francis, J.-D. Lebreton, et al. 2009. “Standardising Terminology and notation for the Analysis of Demographic Processes in Marked Populations.” In Modeling Demographic Processes in Marked Populations, edited by D L Thomson, E G Cooch, and M J Conroy, 1099–1106. Environmental and Ecological Statistics 3. Berlin: Springer. Tredennick, Andrew T., Giles Hooker, Stephen P. Ellner, and Peter B. Adler. 2021. “A practical guide to selecting models for exploration, inference, and prediction in ecology.” Ecology 102 (6). https://doi.org/10.1002/ecy.3336. Walters, G. 2012. “Customary Fire Regimes and Vegetation Structure in Gabon’s Bateke Plateaux.” Human Ecology 40: 943–55. Zbinden, Niklaus, Marco Salvioni, Fränzi Korner-Nievergelt, and Verena Keller. 2018. “Evidence for an Additive Effect of Hunting Mortality in an Alpine Black Grouse Lyrurus Tetrix Population.” Wildlife Biology 2018: xx–xxx. Zeileis, Achim, Christian Kleiber, and Simon Jackman. 2008. “Regression Models for Count Data in r.” Journal of Statistical Software 27: 1–25. Zollinger, J.-L., S. Birrer, N. Zbinden, and F. Korner-Nievergelt. 2013. “The Optimal Age of Sown Field Margins for Breeding Farmland Birds.” Ibis 155 (4). https://doi.org/10.1111/ibi.12072. Zuur, Alain F, Elena N Ieno, Neil J Walker, Anatoly A Saveliev, and Graham M Smith. 2009. Mixed Effects Models and Extensions in Ecology with r. Springer. "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]]
diff --git a/docs/spatial-glmm.html b/docs/spatial-glmm.html
deleted file mode 100644
index 5c1d62f..0000000
--- a/docs/spatial-glmm.html
+++ /dev/null
@@ -1,459 +0,0 @@
-
-
-
-
-
-
- 21 Modeling spatial data using GLMM | Bayesian Data Analysis in Ecology with R and Stan
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
The statistical model is written in the Stan language and saved in a text file. The Stan language is rather strict, forcing the user to write unambiguous models. Stan is very well documented and the Stan Documentation contains a comprehensive Language Manual, a Wiki documentation and various tutorials.
We here provide a normal regression with one predictor variable as a worked example. The entire Stan model is as following (saved as linreg.stan)
A Stan model consists of different named blocks. These blocks are (from first to last): data, transformed data, parameters, trans- formed parameters, model, and generated quantities. The blocks must appear in this order. The model block is mandatory; all other blocks are optional.
In the data block, the type, dimension, and name of every variable has to be declared. Optionally, the range of possible values can be specified. For example, vector[N] y; means that y is a vector (type real) of length N, and int<lower=0> N; means that N is an integer with nonnegative values (the bounds, here 0, are included). Note that the restriction to a possible range of values is not strictly necessary but this will help specifying the correct model and it will improve speed. We also see that each line needs to be closed by a column sign. In the parameters block, all model parameters have to be defined. The coefficients of the linear predictor constitute a vector of length 2, vector[2] beta;. Alternatively, real beta[2]; could be used. The sigma parameter is a one-number parameter that has to be positive, therefore real<lower=0> sigma;.
The model block contains the model specification. Stan functions can handle vectors and we do not have to loop over all observations as typical for BUGS . Here, we use a Cauchy distribution as a prior distribution for sigma. This distribution can have negative values, but because we defined the lower limit of sigma to be 0 in the parameters block, the prior distribution actually used in the model is a truncated Cauchy distribution (truncated at zero). In Chapter 10.2 we explain how to choose prior distributions.
We fit the model to simulated data. Stan needs a vector containing the names of the data objects. In our case, x, y, and N are objects that exist in the R console.
The function stan() starts Stan and returns an object containing MCMCs for every model parameter. We have to specify the name of the file that contains the model specification, the data, the number of chains, and the number of iterations per chain we would like to have. The first half of the iterations of each chain is declared as the warm-up. During the warm-up, Stan is not simulating a Markov chain, because in every step the algorithm is adapted. After the warm-up the algorithm is fixed and Stan simulates Markov chains.
-
library(rstan)
-
-# Simulate fake data
-n <-50# sample size
-sigma <-5# standard deviation of the residuals
-b0 <-2# intercept
-b1 <-0.7# slope
-
-x <-runif(n, 10, 30) # random numbers of the covariate
-simresid <-rnorm(n, 0, sd=sigma) # residuals
-
-y <- b0 + b1*x + simresid # calculate y, i.e. the data
-
-# Bundle data into a list
-datax <-list(n=length(y), y=y, x=x)
-
-# Run STAN
-fit <-stan(file ="stanmodels/linreg.stan", data=datax, verbose =FALSE)
+
library(rstan)
+
+# Simulate fake data
+n <-50# sample size
+sigma <-5# standard deviation of the residuals
+b0 <-2# intercept
+b1 <-0.7# slope
+
+x <-runif(n, 10, 30) # random numbers of the covariate
+simresid <-rnorm(n, 0, sd=sigma) # residuals
+
+y <- b0 + b1*x + simresid # calculate y, i.e. the data
+
+# Bundle data into a list
+datax <-list(n=length(y), y=y, x=x)
+
+# Run STAN
+fit <-stan(file ="stanmodels/linreg.stan", data=datax, verbose =FALSE)
##
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
## Chain 1:
diff --git a/docs/zeroinflated-poisson-lmm.html b/docs/zeroinflated-poisson-lmm.html
index a9b6432..2fa7ada 100644
--- a/docs/zeroinflated-poisson-lmm.html
+++ b/docs/zeroinflated-poisson-lmm.html
@@ -23,7 +23,7 @@
-
+
@@ -262,7 +262,7 @@
Usually we describe the outcome variable with a single distribution, such as the normal distribution in the case of linear (mixed) models, and Poisson or binomial distributions in the case of generalized linear (mixed) models. In life sciences, however, quite often the data are actually generated by more than one process. In such cases the distribution of the data could be the result of two or more different distributions. If we do not account for these different processes our inferences are likely to be biased. In this chapter, we introduce a mixture model that explicitly include two processes that generated the data. The zero-inflated Poisson model is a mixture of a binomial and a Poisson distribution. We belief that two (or more)-level models are very useful tools in life sciences because they can help uncover the different processes that generate the data we observe.
diff --git a/references/References_new.bib b/references/References_new.bib
index 6b24ecc..4070984 100644
--- a/references/References_new.bib
+++ b/references/References_new.bib
@@ -112,6 +112,13 @@ @article{Ellenberg1953
year = {1953},
}
+@misc{StanDevelopmentTeam.2017b,
+ title = {shinystan: Interactive Visual and Numerical Diagnostics and Posterior Analysis for Bayesian Models},
+ author = {Gabry, Jonah},
+ date = {2017},
+ note = {Place: https://mc-stan.org},
+}
+
@article{Gelfand1990,
author = {Gelfand, A. E. and Hills, S. E. and Racine-Poon, A. and Smith, A. F. M.},