Page not found
+The page you requested cannot be found (perhaps it was moved or renamed).
+You may want to try searching to find the page's new location, or use +the table of contents to find the page you are looking for.
+diff --git a/docs/1.0-PART-I.md b/docs/1.0-PART-I.md
new file mode 100644
index 0000000..6fc787d
--- /dev/null
+++ b/docs/1.0-PART-I.md
@@ -0,0 +1,13 @@
+# (PART) BASIC STATISTICS FOR ECOLOGISTS {-}
+
+# Introduction to PART I {#PART-I}
+
+
+------
+
+During our courses we are sometimes asked to give an introduction to some R-related stuff covering data analysis, presentation of results or rather specialist topics in ecology. In this part we present collected these introduction and try to keep them updated. This is also a commented collection of R-code that we documented for our own work. We hope this might be useful olso for other readers.
+
+
+## Further reading
+- [R for Data Science by Garrett Grolemund and Hadley Wickham](http://r4ds.had.co.nz): Introduces the tidyverse framwork. It explains how to get data into R, get it into the most useful structure, transform it, visualise it and model it.
+
diff --git a/docs/1.1-prerequisites.md b/docs/1.1-prerequisites.md
new file mode 100644
index 0000000..f5e62c3
--- /dev/null
+++ b/docs/1.1-prerequisites.md
@@ -0,0 +1,689 @@
+
+# Basics of statistics {#basics}
+This chapter introduces some important terms useful for doing data analyses.
+It also introduces the essentials of the classical frequentist tests such as t-test. Even though we will not use nullhypotheses tests later [@Amrhein.2019], we introduce them here because we need to understand the scientific literature. For each classical test, we provide a suggestion how to present the statistical results without using null hypothesis tests. We further discuss some differences between the Bayesian and frequentist statistics.
+
+## Variables and observations
+
+Empirical research involves data collection. Data are collected by recording measurements of variables for observational units. An observational unit may be, for example, an individual, a plot, a time interval or a combination of those. The collection of all units ideally build a random sample of the entire population of units in that we are interested. The measurements (or observations) of the random sample are stored in a data table (sometimes also called data set, but a data set may include several data tables. A collection of data tables belonging to the same study or system is normally bundled and stored in a data base). A data table is a collection of variables (columns). Data tables normally are handled as objects of class `data.frame` in R. All measurements on a row in a data table belong to the same observational unit. The variables can be of different scales (Table \@ref(tab:scalemeasurement)).
+
+
+Table: (\#tab:scalemeasurement) Scales of measurements
+
+Scale | Examples | Properties | Coding in R |
+:-------|:------------------|:------------------|:--------------------|
+Nominal | Sex, genotype, habitat | Identity (values have a unique meaning) | `factor()` |
+Ordinal | Elevational zones | Identity and magnitude (values have an ordered relationship) | `ordered()` |
+Numeric | Discrete: counts; continuous: body weight, wing length | Identity, magnitude, and intervals or ratios | `intgeger()` `numeric()` |
+
+
+The aim of many studies is to describe how a variable of interest ($y$) is related to one or more predictor variables ($x$). How these variables are named differs between authors. The y-variable is called "outcome variable", "response" or "dependent variable". The x-variables are called "predictors", "explanatory variables" or "independent variables". The choose of the terms for x and y is a matter of taste. We avoid the terms "dependent" and "independent" variables because often we do not know whether the variable $y$ is in fact depending on the $x$ variables and also, often the x-variables are not independent of each other. In this book, we try to use "outcome" and "predictor" variables because these terms sound most neutral to us in that they refer to how the statistical model is constructed rather than to a real life relationship.
+
+
+## Displaying and summarizing data
+
+### Histogram
+
+While nominal and ordinal variables are summarized by giving the absolute number or the proportion of observations for each category, numeric variables normally are summarized by a location and a scatter statistics, such as the mean and the standard deviation or the median and some quantiles. The distribution of a numeric variable is graphically displayed in a histogram (Fig. \@ref(fig:histogram)).
+
+
+
(\#fig:histogram)Histogram of the length of ell of statistics course participants.
+(\#fig:boxplot)Boxplot of the length of ell of statistics course participants who are or ar not owner of a car.
+(\#fig:scatterplot)Scatterplot of the length of ell and the comfort temperature of statistics course participants.
+(\#fig:principal)Principal components of a two dimensional data set based on the covariance matrix (green) and the correlation matrix (brown).
+(\#fig:histtruesample)Histogram of the number of statistics courses of 300000 virtual PhD students have taken before their PhD started. The rugs on the x-axis indicate the random sample of 12 out of the 300000 students. The black vertical line indicates the mean of the 300000 students (true mean) and the blue line indicates the mean of the sample (sample mean).
+(\#fig:CImean)Histogram of means of repeated samples from the true populations. The scatter of these means visualize the true uncertainty of the mean in this example. The blue vertical line indicates the mean of the original sample. The blue segment shows the 95% confidence interval (obtained by fequensist methods) and the violet line shows the posterior distribution of the mean (obtained by Bayesian methods).
+(\#fig:sesd)Illustration of the difference between SD and SE. The SD measures the scatter in the data (sample, tickmarks on the x-axis). The SD is an estimate for the scatter in the big population (grey histogram, normally not known). The SE measures the uncertainty of the sample mean (in blue). The SE measures approximately how far, in average the sample mean (blue) is from the true mean (brown).
+(\#fig:triplot)Hypothetical example showing the result of applying the Bayes theorem for obtaining a posterior distribution of a continuous parameter. The likelhood is defined by the data and the model, the prior is expressing the knowledge about the parameter before looking at the data. Combining the two distributions using the Bayes theorem results in the posterior distribution.
+(\#fig:ssp)Snowfinches stay above the treeline for winter. They come to feeders.
+(\#fig:jointpdist)Visualization of the joint posterior distribution for the mean and standard deviation of Snowfinch weights. The lower left panel shows the two-dimensional joint posterior distribution, whereas the upper and right panel show the marginal posterior distributions of each parameter separately.
+(\#fig:nht)Illustration of a one-sample t-test. The blue histogram shows the distribution of the measured weights with the sample mean (lightblue) indicated as a vertical line. The black line is the t-distribution that shows how hypothetical sample means are expected to be distributed if the big population of Snowfinches has a mean weight of 40g (i.e., if the null hypothesis was true). Orange area shows the area of the t-distribution that lays equal or farther away from 40g than the sample mean. The orange area is the p-value.
+(\#fig:unnamed-chunk-8)Illustration of the posterior distribution of the mean. The blue histogram shows the distribution of the measured weights with the sample mean (lightblue) indicated as a vertical line. The black line is the posterior distribution that shows what we know about the mean after having looked at the data. The area under the posterior density function that is larger than 40 is the posterior probability of the hypothesis that the true mean Snwofinch weight is larger than 40g.
+(\#fig:boxplt)Ell length of car owners (Y) and people not owning a car (N). Horizontal bar = median, box = interquartile range, whiskers = extremest observation within 1.5 times the interquartile range from the quartile, circles=observations farther than 1.5 times the interquartile range from the quartile. Filled brown circles = means, vertical brown bars = 95% compatibility interval.
+(\#fig:ranhist)Histogram if differences between the means of randomly assigned groups (grey) and the difference between the means of the two observed groups (red)
+(\#fig:histboot)Histogram of the boostrapped differences between the group means (grey) and the observed difference.
+(\#fig:unnamed-chunk-1)Two examples of a binomial distribution. size: number of trials (the argument in the corresponding R function, for example in rbinom, is called size). p: success probability.
+(\#fig:pois)A natural process that produces Poisson distributed data is the number of raindrops falling (at random) into equally sized cells of a grid. Left: spatial distribution of raindrops, right: corresponding distribution of the number of raindrops per cell.
+(\#fig:betadist)Beta distributions with different parameter values.
+(\#fig:normdistplot)Standard normal distribution
+(\#fig:unnamed-chunk-2)Different density functions of the F statistics
+(\#fig:chisqdist)Two examples of Chisquare distributions.
+(\#fig:unnamed-chunk-7)Estimated proportion of students that prefer flowers over wine as a birthday present among the car-free students (N) and the car owners (Y). Given are the median of the posterior distribution (circle). The bar extends between the 2.5% and 97.5% quantiles of the posterior distribution.
+(\#fig:unnamed-chunk-12)Number of stats courses students have taken before starting a PhD in relation to their feeling about statistics.
+(\#fig:unnamed-chunk-13)Visualisation of the total, between-group and within-group sum of squares. Points are observations; long horizontal line is the overall mean; short horizontal lines are group specific means.
+(\#fig:unnamed-chunk-16)Posterior distributions of the mean number of stats courses PhD students visited before starting the PhD grouped according to their feelings about statistics.
+(\#fig:gaussdistpd)Density function of a Gaussian (=normal) distribution with mean 1 kg and standard deviation 0.2 kg. The vertical lines indicate the three observations in the data. The curve has been produced using the function dnorm.
+(\#fig:ml)Likelihood function for y=0.8, 1.2, 1.1 and the data model y~normal(mu, sigma). The cross indicates the highest likelihood value.
+(\#fig:tdjags)Folded t-distributions with different precisions and degrees of freedom. The panel titles give the jags code of the distribution. Dark blue vertical lines indicate 90% quantiles, light-blue lines indicate 98% quantiles.
+(\#fig:illlm)Illustration of a linear regression. The blue line represents the deterministic part of the model, i.e., here regression line. The stochastic part is represented by a probability distribution, here the normal distribution. The normal distribution changes its mean but not the variance along the x-axis, and it describes how the data are distributed. The blue line and the orange distribution together are a statistical model, i.e., an abstract representation of the data which is given in black.
+(\#fig:figlm)Linear regression. Black dots = observations, blue solid line = regression line, orange dotted lines = residuals. The fitted values lie where the orange dotted lines touch the blue regression line.
+(\#fig:simfirstexample)Joint (scatterplots) and marginal (histograms) posterior distribution of the model parameters. The six scatterplots show, using different axes, the three-dimensional cloud of 1000 simulations from the joint posterior distribution of the three parameters.
+(\#fig:figlmer1)Regression with 1000 lines based on draws form the joint posterior distribution for the intercept and slope parameters to visualize the uncertainty of the estimated regression line.
+(\#fig:figlmer2)Regression with 95% credible interval of the posterior distribution of the fitted values.
+(\#fig:figlmer3)Regression line with 95% uncertainty interval (dotted lines) and the 95% interval of the simulated predictive distribution (broken lines). Note that we increased the number of simulations to 50,000 to produce smooth lines.
+(\#fig:figanova)Weights (g) of the 30 individuals in each group. The dark horizontal line is the median, the box contains 50% of the observations (i.e., the interquartile range), the whiskers mark the range of all observations that are less than 1.5 times the interquartile range away from the edge of the box.
+(\#fig:figanovares)Distribution of the simulated values from the posterior distributions of the group means (left); group means with 95% uncertainty intervals obtained from the simulated distributions (right).
+(\#fig:fgwingpa)Wing length measurements on 19 museumm skins of coal tits per age class and sex. Fitted values are from the additive model (black triangles) and from the model including an interaction (black dots). Vertical bars = 95% uncertainty intervals.
+(\#fig:fgbiom)Aboveground biomass (g, log-transformed) in relation to distance to ground water and species (two grass species). Fitted values from a model including an interaction species x water (left) and a model without interaction (right) are added. The dotted line indicates water=0.
+(\#fig:sfexfig)Illustration of the partial coefficient of snow melt date in a model of hatching date. Panel A shows the entire raw data together with the regression lines drawn for three different elevations. The regression lines span the range of snow melt dates occurring at the respective elevation (shown in panel C). Panel B is the same as panel A, but zoomed in to the better see the regression lines and with an additional regression line (in black) from the model that does not take elevation into account.
+(\#fig:diagplotdg)Standard diagnostic residual plots of a linear regression for the biomass data of D. glomerata.
+(\#fig:diagplotap)Standard diagnostic residual plots of a linear regression for the biomass data of A. pratensis.
+(\#fig:diagplotaplog)Standard diagnostic residual plots of a linear regression for the logarithm of the biomass data of A. pratensis.
+(\#fig:pooling)Three possibilities to obtain group means for grouped data: complete pooling, partial pooling, and no pooling. Open symbols = data, orange dots with vertical bars = group means with 95% uncertainty intervals, horizontal black line with shaded interval = population mean with 95% uncertainty interval.
+(\#fig:corttest)Total corticosterone before and at day 2 and 20 after implantation of a corticosterone or a placebo implant. Lines connect measurements of the same individual.
+(\#fig:illlink)Left panel: Shape of different link functions commonly used for modelling probabilities. Right panel: The relationship between the predictor x (x-axis) and p on the scale of the link function (y-axis) is assumed to be linear.
+(\#fig:lrgof)Goodness of fit plot for the Bernoulli model fitted to little owl presence-absence data. Open circles = observed presence (1) or absence (0) jittered in the vertical direction; orange dots = mean (and 95% compatibility intervals given as vertical bards) of the observations within classes of width 0.1 along the x-axis. The dotted line indicates perfect coincidence between observation and fitted values. Orange larger points are from the model assuming a linear effect of elevation, wheras the smaller light blue points are from a model assuming a non-linear effect.
+(\#fig:unnamed-chunk-4)Little owl presence data versus elevation with regression line and 95% compatibility interval (dotted lines). Open circles = observed presence (1) or abesnce (0) jittered in the vertical direction.
+(\#fig:diagplotsprout)The four standard residual plots obtained by using the plot-function.
+(\#fig:overdisp)Histogram of a binomial distribution without overdispersion (orange) and one with the same total number of trials and average success probability, but with overdispersion (blue).
+(\#fig:fittree1)Proportion of surviving trees (circles) for three fire lag treatments with estimated mean proportion of survivors using a quasi-binomial model. Gray dots = fitted values. Vertical bars = 95% compatibility intervals.
+(\#fig:simpois)Simulated data (dots) with a Poisson regression line (solid) and the lower and upper bound of the 95% compatibility interval.
+(\#fig:diagpois)Standard residual plots for the Poisson model fitted to simulated data, thus they fit perfectly.
+(\#fig:wildfl)Whitethroat densities are highest in wildflower fields that are around 4 to 6 years old. Dots are the raw data, the bold line give the fitted values (with the 95% compatibility interval given with dotted lines) for wildflower fields of different ages (years). The fitted values are given for average field sizes of 1.4 ha.
+(\#fig:assessmodelassumptions)Diagnostic plots to assess model assumptions for mod.glmer. Uppper left: quantile-quantile plot of the residuals vs. theoretical quantiles of the normal distribution. Upper rihgt: quantile-quantile plot of the random effects "farm". Lower left: residuals vs. fitted values. Lower right: observed vs. fitted values.
+(\#fig:checkconvergencemodelbrm)Traceplot of the Markov chains. After convergence, both Markov chains should sample from the same stationary distribution. Indications of non-convergence would be, if the two chains diverge or vary around different means.
+(\#fig:ppbinomial)Posterior predictive model checking: Histogram of the number of fledglings simulated from the model together with a histogram of the real data, and a histogram of the standard deviations of replicated data from the model together with the standard deviation of the data (vertical line in red). The third plot gives the fitted vs. observed values.
+(\#fig:histpp)Histograms of 8 out of 4000 replicated data sets and of the observed data (dat$bp). The arguments breaks and ylim have been used in the function hist to produce the same scale of the x- and y-axis in all plots. This makes comparison among the plots easier.
+(\#fig:spatpp)Spatial distribution of the whitethroat breeding pair counts and of 8 randomly chosen replicated data sets with data simulated based on the model. the smallest dot correspond to a count of 0, the largest to a count of 20 breeding pairs. The panel in the upper left corner shows the data, the other panels are replicated data from the model.
+(\#fig:agepp)Histogram of the average age of the 10% wildflower fields with the highest breeding densities in the replicated data sets. The orange line indicates the average age for the 10% fields with the highest observed whithethroat densities.
+(\#fig:histblackstork)Histogram of the number of nestlings counted in black stork nests *Ciconia nigra* in Latvia (n = 1130 observations of 279 nests).
+(\#fig:effplots)Estimated daily nest survival probability in relation to nest age. Dotted lines are 95% uncertainty intervals of the regression line.
+The page you requested cannot be found (perhaps it was moved or renamed).
+You may want to try searching to find the page's new location, or use +the table of contents to find the page you are looking for.
+During our courses we are sometimes asked to give an introduction to some R-related stuff covering data analysis, presentation of results or rather specialist topics in ecology. In this part we present collected these introduction and try to keep them updated. This is also a commented collection of R-code that we documented for our own work. We hope this might be useful olso for other readers.
+A really good introductory book to Bayesian data analyses is (McElreath 2016). This book starts with a thorough introduction to applying the Bayes theorem for drawing inference from data. In addition, it carefully discusses what can and what cannot be concluded from statistical results. We like this very much.
+The developer of the brms
package, Paul Bürkner, is writing a book that is already partly available online. It is a helpful cookbook with understandable explanations. We very much look forward to the finished book, that may bundle all the helpful vignettes and help-files to the functions of the brms
package.
We like looking up statistical methods in papers and books written by Andrew Gelman (e.g. A. Gelman et al. 2014b) and Trevor Hastie (e.g. Efron and Hastie (2016)) because both explain complicated things in a concise and understandable way.
+ +This part is a collection of more complicated ecological models to analyse data that may not be analysed with the traditional linear models that we covered in PART I of this book.
+It is unavoidable that different authors use different notations for the same thing, or that the same notation is used for different things. We try to use, whenever possible, notations that is commonly used at the International Statistical Ecology Congress ISEC. Resulting from an earlier ISEC, Thomson et al. (2009) give guidelines on what letter should be used for which parameter in order to achieve a standard notation at least among people working with classical mark-recapture models. However, the alphabet has fewer letters compared to the number of ecological parameters. Therefore, the same letter cannot stand for the same parameter across all papers, books and chapters. Here, we try to use the same letter for the same parameter within the same chapter.
+ +THIS CHAPTER IS UNDER CONSTRUCTION!!!
+++We should provide an example in Stan.
+
------------------------------------------------------------------------------------------------------
+# General settings
+#------------------------------------------------------------------------------------------------------
+library(MASS)
+library(rjags)
+library(MCMCpack)
+
+#------------------------------------------------------------------------------------------------------
+# Simulation
+#------------------------------------------------------------------------------------------------------
+n <- 100
+heffM <- 0.6 # effect of H on M
+heffCS <- 0.0 # effect of H on Clutch size
+meffCS <- 0.6 # effect of M on Clutch size
+
+SigmaM <- matrix(c(0.1,0.04,0.04,0.1),2,2)
+meffm1 <- 0.6
+meffm2 <- 0.7
+
+SigmaH <- matrix(c(0.1,0.04,0.04,0.1),2,2)
+meffh1 <- 0.6
+meffh2 <- -0.7
+
+# Latente Variablen
+H <- rnorm(n, 0, 1)
+M <- rnorm(n, heffM * H, 0.1)
+
+# Clutch size
+CS <- rnorm(n, heffCS * H + meffCS * M, 0.1)
+
+# Indicators
+eM <- cbind(meffm1 * M, meffm2 * M)
+datM <- matrix(NA, ncol = 2, nrow = n)
+eH <- cbind(meffh1 * H, meffh2 * H)
+datH <- matrix(NA, ncol = 2, nrow = n)
+for(i in 1:n) {
+ datM[i,] <- mvrnorm(1, eM[i,], SigmaM)
+ datH[i,] <- mvrnorm(1, eH[i,], SigmaH)
+}
+
+#------------------------------------------------------------------------------
+# JAGS Model
+#------------------------------------------------------------------------------
+dat <- list(datM = datM, datH = datH, n = n, CS = CS, #H = H, M = M,
+ S3 = matrix(c(1,0,0,1),nrow=2)/1)
+
+# Function to create initial values
+inits <- function() {
+ list(
+ meffh = runif(2, 0, 0.1),
+ meffm = runif(2, 0, 0.1),
+ heffM = runif(1, 0, 0.1),
+ heffCS = runif(1, 0, 0.1),
+ meffCS = runif(1, 0, 0.1),
+ tauCS = runif(1, 0.1, 0.3),
+ tauMH = runif(1, 0.1, 0.3),
+ tauH = rwish(3,matrix(c(.02,0,0,.04),nrow=2)),
+ tauM = rwish(3,matrix(c(.02,0,0,.04),nrow=2))
+# M = as.numeric(rep(0, n))
+ )
+}
+
+t.n.thin <- 50
+t.n.chains <- 2
+t.n.burnin <- 20000
+t.n.iter <- 50000
+
+# Run JAGS
+jagres <- jags.model('JAGS/BUGSmod1.R',data = dat, n.chains = t.n.chains, inits = inits, n.adapt = t.n.burnin)
+
+params <- c("meffh", "meffm", "heffM", "heffCS", "meffCS")
+mod <- coda.samples(jagres, params, n.iter=t.n.iter, thin=t.n.thin)
+res <- round(data.frame(summary(mod)$quantiles[, c(3, 1, 5)]), 3)
+res$TRUEVALUE <- c(heffCS, heffM, meffCS, meffh1, meffh2, meffm1, meffm2)
+res
+
+# Traceplots
+post <- data.frame(rbind(mod[[1]], mod[[2]]))
+names(post) <- dimnames(mod[[1]])[[2]]
+par(mfrow = c(3,3))
+param <- c("meffh[1]", "meffh[2]", "meffm[1]", "meffm[2]", "heffM", "heffCS", "meffCS")
+traceplot(mod[, match(param, names(post))])
THIS CHAPTER IS UNDER CONSTRUCTION!!!
+When testing for correlations between two categorical variables, then the nullhypothesis is “there is no correlation”. The data can be displayed in cross-tables.
+# Example: correlation between birthday preference and car ownership
+load("RData/datacourse.RData")
+table(dat$birthday, dat$car)
##
+## N Y
+## flowers 6 1
+## wine 9 6
+Given the nullhypothesis was true, we expect that the distribution of the data in each column of the cross-table is similar to the distribution of the row-sums. And, the distribution of the data in each row should be similar to the distribution of the column-sums. The chisquare test statistics \(\chi^2\) measures the deviation of the data from this expected distribution of the data in the cross-table.
+For calculating the chisquare test statistics \(\chi^2\), we first have to obtain for each cell in the cross-table the expected value \(E_{ij}\) = rowsum*colsum/total.
+\(\chi^2\) measures the difference between the observed \(O_{ij}\) and expected \(E_{ij}\) values as:
+\(\chi^2=\sum_{i=1}^{m}\sum_{j=1}^{k}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\) where \(m\) is the number of rows and \(k\) is the number of columns.
+The \(\chi^2\)-distribution has 1 parameter, the degrees of freedom \(v\) = \((m-1)(k-1)\).
+Figure 5.1: Two examples of Chisquare distributions. +
+R is calculating the \(\chi^2\) value for specific cross-tables, and it is also giving the p-values, i.e., the probability of obtaining the observed or a higher \(\chi^2\) value given the nullhypothesis was true by comparing the observed \(\chi^2\) with the corresponding chisquare distribution.
+ +##
+## Pearson's Chi-squared test with Yates' continuity correction
+##
+## data: table(dat$birthday, dat$car)
+## X-squared = 0.51084, df = 1, p-value = 0.4748
+The warning (that is suppressed in the rmarkdown version, but that you will see if you run the code on your own computer) is given, because in our example some cells have counts less than 5. In such cases, the Fisher’s exact test should be preferred. This test calculates the p-value analytically using probability theory, whereas the chisquare test relies on the assumption that the \(\chi^2\) value follows a chisquare distribution. The latter assumption holds better for larger sample sizes.
+ +##
+## Fisher's Exact Test for Count Data
+##
+## data: table(dat$birthday, dat$car)
+## p-value = 0.3501
+## alternative hypothesis: true odds ratio is not equal to 1
+## 95 percent confidence interval:
+## 0.3153576 213.8457248
+## sample estimates:
+## odds ratio
+## 3.778328
+For a Bayesian analysis of cross-table data, a data model has to be found. There are several possibilities that could be used:
+These models provide possibilities to explore the patterns in the data in more details than a chisquare test.
+# We arrange the data into a cross-table in a data-frame
+# format. That is, the counts in each cell of the
+# cross-table become a variable and the row and column names
+# are also given in separate variables
+datagg <- aggregate(dat$name_fictive, list(birthday=dat$birthday, car=dat$car),
+ length, drop=FALSE)
+datagg$x[is.na(datagg$x)] <- 0
+names(datagg) <- c("birthday", "car", "count")
+datagg
## birthday car count
+## 1 flowers N 6
+## 2 wine N 9
+## 3 flowers Y 1
+## 4 wine Y 6
+# log-linear model
+library(arm)
+nsim <- 5000
+
+mod <- glm(count~birthday+car + birthday:car,
+ data=datagg, family=poisson)
+bsim <- sim(mod, n.sim=nsim)
+round(t(apply(bsim@coef, 2, quantile,
+ prob=c(0.025, 0.5, 0.975))),2)
## 2.5% 50% 97.5%
+## (Intercept) 1.00 1.79 2.58
+## birthdaywine -0.64 0.41 1.48
+## carY -3.94 -1.79 0.29
+## birthdaywine:carY -0.94 1.41 3.76
+The interaction parameter measures the strength of the correlation. To quantitatively understand what a parameter value of 1.39 means, we have to look at the interpretation of all parameter values. We do that here quickly without a thorough explanation, because we already explained the Poisson model in chapter 8 of (Korner-Nievergelt et al. 2015).
+The intercept 1.79 corresponds to the logarithm of the count in the cell “flowers” and “N” (number of students who prefer flowers as a birthday present and who do not have a car), i.e., \(exp(\beta_0)\) = 6. The exponent of the second parameter corresponds to the multiplicative difference between the counts in the cells “flowers and N” and “wine and N”, i.e., count in the cell “wine and N” = \(exp(\beta_0)exp(\beta_1)\) = exp(1.79)exp(0.41) = 9. The third parameter measures the multiplicative difference in the counts between the cells “flowers and N” and “flowers and Y”, i.e., count in the cell “flowers and Y” = \(exp(\beta_0)exp(\beta_2)\) = exp(1.79)exp(-1.79) = 1. Thus, the third parameter is the difference in the logarithm of the counts between the car owners and the car-free students for those who prefer flowers. The interaction parameter is the difference of this difference between the students who prefer wine and those who prefer flowers. This is difficult to intuitively understand. Here is another try to formulate it: The interaction parameter measures the difference in the logarithm of the counts in the cross-table between the row-differences between the columns. Maybe it becomes clear, when we extract the count in the cell “wine and Y” from the model parameters: \(exp(\beta_0)exp(\beta_1)exp(\beta_2)exp(\beta_3)\) = exp(1.79)exp(0.41)exp(-1.79)exp(1.39) = 6.
+Alternatively, we could estimate the proportions of students prefering flower and wine within each group of car owners and car-free students using a binomial model. For an explanation of the binomial model, see chapter 8 of (Korner-Nievergelt et al. 2015).
+# binomial model
+tab <- table(dat$car,dat$birthday)
+mod <- glm(tab~rownames(tab), family=binomial)
+bsim <- sim(mod, n.sim=nsim)
+Figure 5.2: Estimated proportion of students that prefer flowers over wine as a birthday present among the car-free students (N) and the car owners (Y). Given are the median of the posterior distribution (circle). The bar extends between the 2.5% and 97.5% quantiles of the posterior distribution. +
+Monte Carlo integration: numerical solution of \(\int_{-1}^{1.5} F(x) dx\)
+
sim is solving a mathematical problem by simulation +How sim is simulating to get the marginal distribution of \(\mu\):
+\(p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)}\)
+For example, one coin flip (Bernoulli model)
+data: y=0 (a tail)
+likelihood: \(p(y|\theta)=\theta^y(1-\theta)^{(1-y)}\)
The aim of an ANOVA is to compare means of groups. In a frequentist analysis, this is done by comparing the between-group with the within-group variance. The result of a Bayesian analysis is the joint posterior distribution of the group means.
++Figure 5.3: Number of stats courses students have taken before starting a PhD in relation to their feeling about statistics. +
+In the frequentist ANOVA, the following three sum of squared distances (SS) are used to calculate the total, the between- and within-group variances:
+Total sum of squares = SST = \(\sum_1^n{(y_i-\bar{y})^2}\)
+Within-group SS = SSW = \(\sum_1^n{(y_i-\bar{y_g})^2}\): unexplained variance
+Between-group SS = SSB = \(\sum_1^g{n_g(\bar{y_g}-\bar{y})^2}\): explained variance
The between-group and within-group SS sum to the total sum of squares: SST=SSB+SSW. Attention: this equation is only true in any case for a simple one-way ANOVA (just one grouping factor). If the data are grouped according to more than one factor (such as in a two- or three-way ANOVA), then there is one single solution for the equation only when the data is completely balanced, i.e. when there are the same number of observations in all combinations of factor levels. For non-balanced data with more than one grouping factor, there are different ways of calculating the SSBs, and the result of the F-test described below depends on the order of the predictors in the model.
++Figure 5.4: Visualisation of the total, between-group and within-group sum of squares. Points are observations; long horizontal line is the overall mean; short horizontal lines are group specific means. +
+In order to make SSB and SSW comparable, we have to divide them by their degrees of freedoms. For the within-group SS, SSW, the degrees of freedom is the number of obervations minus the number of groups (\(g\)), because \(g\) means have been estimated from the data. If the \(g\) means are fixed and \(n-g\) data points are known, then the last \(g\) data points are defined, i.e., they cannot be chosen freely. For the between-group SS, SSB, the degrees of freedom is the number of groups minus 1 (the minus 1 stands for the overall mean).
+It can be shown (by mathematicians) that, given the nullhypothesis, the mean of all groups are equal \(m_1 = m_2 = m_3\), then the mean squared errors between groups (MSB) is expected to be equal to the mean squared errors within the groups (MSW). Therefore, the ration MSB/MSW is expected to follow an F-distribution given the nullhypothesis is true.
+The Bayesian analysis for comparing group means consists of calculating the posterior distribution for each group mean and then drawing inference from these posterior distributions.
+A Bayesian one-way ANOVA involves the following steps:
+1. Decide for a data model: We, here, assume that the measurements are normally distributed around the group means. In this example here, we transform the outcome variable in order to better meet the normal assumption. Note: the frequentist ANOVA makes exactly the same assumptions. We can write the data model: \(y_i\sim Norm(\mu_i,\sigma)\) with \(mu_i= \beta_0 + \beta_1I(group=2) +\beta_1I(group=3)\), where the \(I()\)-function is an indicator function taking on 1 if the expression is true and 0 otherwise. This model has 4 parameters: \(\beta_0\), \(\beta_1\), \(\beta_2\) and \(\sigma\).
Choose a prior distribution for each model parameter: In this example, we choose flat prior distributions for each parameter. By using these priors, the result should not remarkably be affected by the prior distributions but almost only reflect the information in the data. We choose so-called improper prior distributions. These are completely flat distributions that give all parameter values the same probability. Such distributions are called improper because the area under the curve is not summing to 1 and therefore, they cannot be considered to be proper probability distributions. However, they can still be used to solve the Bayesian theorem.
Solve the Bayes theorem: The solution of the Bayes theorem for the above priors and model is implemented in the function sim of the package arm.
# calculate numerically the posterior distributions of the model
+# parameters using flat prior distributions
+nsim <- 5000
+set.seed(346346)
+bsim <- sim(mod, n.sim=nsim)
# calculate group means from the model parameters
+newdat <- data.frame(statsfeeling=levels(factor(dat$statsfeeling)))
+X <- model.matrix(~statsfeeling, data=newdat)
+fitmat <- matrix(ncol=nsim, nrow=nrow(newdat))
+for(i in 1:nsim) fitmat[,i] <- X%*%bsim@coef[i,]
+hist(fitmat[1,], freq=FALSE, breaks=seq(-2.5, 4.2, by=0.1), main=NA, xlab="Group mean of log(number of courses +1)", las=1, ylim=c(0, 2.2))
+hist(fitmat[2,], freq=FALSE, breaks=seq(-2.5, 4.2, by=0.1), main=NA, xlab="", las=1, add=TRUE, col=rgb(0,0,1,0.5))
+hist(fitmat[3,], freq=FALSE, breaks=seq(-2.5, 4.2, by=0.1), main=NA, xlab="", las=1, add=TRUE, col=rgb(1,0,0,0.5))
+legend(2,2, fill=c("white",rgb(0,0,1,0.5), rgb(1,0,0,0.5)), legend=levels(factor(dat$statsfeeling)))
+Figure 5.5: Posterior distributions of the mean number of stats courses PhD students visited before starting the PhD grouped according to their feelings about statistics. +
+Based on the posterior distributions of the group means, we can extract derived quantities depending on our interest and questions. Here, for example, we could extract the posterior probability of the hypothesis that students with a positive feeling about statistics have a better education in statistics than those with a neutral or negative feeling about statistics.
+ +## [1] 0.8754
+
+## [1] 0.9798
+In this chapter we provide a checklist with some guidance for data analysis. However, do not expect the list to be complete and for different studies, a different order of the steps may make more sense. We usually repeat steps 3.2 to 3.8 until we find a model that fit the data well and that is realistic enough to be useful for the intended purpose. Data analysis is always a lot of work and, often, the following steps have to be repeated many times until we find a useful model.
+There is a chance and danger at the same time: we may find interesting results that answer different questions than we asked originally. They may be very exciting and important, however they may be biased. We can report such findings, but we should state that they appeared (more or less by chance) during the data exploration and model fitting phase, and we have to be aware that the estimates may be biased because the study was not optimally designed with respect to these findings. It is important to always keep the original aim of the study in mind. Do not adjust the study question according to the data. We also recommend reporting what the model started with at the first iteration and describing the strategy and reasoning behind the model development process.
+Prepare the data and check graphically, or via summary statistics, whether all the data are plausible. Prepare the data so that errors (typos, etc.) are minimal, for example, by double-checking the entries. See chapter 5 for useful R-code that can be used for data preparation and to make plausibility controls.
+Think about the direct and indirect relationships among the variables of the study. We normally start a data analysis by drawing a sketch of the model including all explanatory variables and interactions that may be biologically meaningful. We will most likely repeat this step after having looked at the model fit. To make the data analysis transparent we should report every model that was considered. A short note about why a specific model was considered and why it was discarded helps make the modeling process reproducible.
+What is the nature of the variable of interest (outcome, dependent variable)? At this stage, there is no use of formally comparing the distribution of the outcome variable to a statistical distribution, because the rawdata is not required to follow a specific distribution. The models assume that conditional on the explanatory variables and the model structure, the outcome variable follows a specific distribution. Therefore, checking how well the chosen distribution fits to the data is done after the model fit 3.8. This first choice is solely done based on the nature of the data. Normally, our first choice is one of the classical distributions for which robust software for model fitting is available.
+Here is a rough guideline for this first choice:
+++continuous measurements \(\Longrightarrow\) normal distribution
+
+> exceptions: time-to-event data \(\Longrightarrow\) see survival analysis
++count \(\Longrightarrow\) Poisson or negative-binomial distribution
+
++count with upper bound (proportion) \(\Longrightarrow\) binomial distribution
+
++binary \(\Longrightarrow\) Bernoully distribution
+
++rate (count by a reference) \(\Longrightarrow\) Poisson including an offset
+
++nominal \(\Longrightarrow\) multinomial distribution
+
Chapter 4 gives an overview of the distributions that are most relevant for ecologists.
+Is the variance (of the explanatory variable) big enough so that an effect of the variable can be measured?
Is the distribution skewed? If an explanatory variable is highly skewed, it may make sense to transform the variable (e.g., log, square-root).
Does it show a bimodal distribution? Consider making the variable binary.
Is it expected that a change of 1 at lower values for x has the same biological effect as a change of 1 at higher values of x? If not, a trans- formation (e.g., log) could linearize the relationship between x and y.
Centering: Centering (\(x.c = x-mean(x)\)) is a transformation that produces a variable with a mean of 0. Centering is optional. We have two reasons to center a predictor variable. First, it helps the model fitting algorithm to better converge because it reduces correlations among model parameters. Second, with centered predictors, the intercept and main effects in the linear model are better interpretable (they are measured at the center of the data instead of at the covariate value of 0 which may be far off).
Scaling: Scaling (\(x.s = x/c\), where \(c\) is a constant) is a transformation that changes the unit of the variable. Also scaling is optional. We have three reasons to scale an predictor variable. First, to make the effect sizes better understandable. For example, a population change from one year to the next may be very small and hard to interpret. When we give the change for a 10-year period, its ecological meaning is better understandable. Second, to make the estimate of the effect sizes comparable between variables, we may use \(x.s = x/sd(x)\). The resulting variable has a unit of one standard deviation. A standard deviation may be comparable between variables that oritinally are measured in different units (meters, seconds etc). A. Gelman and Hill (2007) (p. 55 f) propose to scale the variables by two times the standard deviation (\(x.s = x/(2*sd(x))\)) to make effect sizes comparable between numeric and binary variables. Third, scaling can be important for model convergence, especially when polynomials are included. Also, consider the use of orthogonal polynomials, see Chapter 4.2.9 in Korner-Nievergelt et al. (2015).
Collinearity: Look at the correlation among the explanatory variables (pairs plot or correlation matrix). If the explanatory variables are correlated, go back to step 2 and add this relationship. Further, read our thoughts about collinearity.
Are polynomial terms needed in the model? If not already +done in step 2, think about the relationship between each explanatory variable and the dependent variable. Is it linear or do polynomial terms have to be included in the model? If the relationship cannot be described appropriately by polynomial terms, think of a nonlinear model or a generalized additive model (GAM).
Interactions: May the effect of one explanatory variable depend on the value of +another explanatory variable (interaction)?
After having taken into account all of the (fixed effect) terms from step 4: are the observations independent or grouped/structured? What random factors are needed in the model? Are the data obviously temporally or spatially correlated? Or, are other correlation structures present, such as phylogenetic relationships? +Our strategy is to start with a rather simple model that may not account for all correlation structures that in fact are present in the data. We first, only include those that are known to be important a priory. Only when residual analyses reveals important additional correlation structures, we include them in the model.
+Decide whether we would like to use informative prior distributions or whether we would like use priors that only have a negligible effect on the results. When the results are later used for informing authorities or for making a decision (as usual in applied sciences), then we would like to base the results on all information available. Information from the literature is then used to construct informative prior distributions. In contrast to applied sciences, in basic research we often would like to show only the information in the data that should not be influenced by earlier results. Therefore, in basic research we look for priors that do not influence the results.
+We assess model fit by graphical analyses of the residuals and by predictive model checking.
+For non-Gaussian models it is often easier to assess model fit using posterior predictive checks rather than residual analyses. Posterior predictive checks usually show clearly in which aspect the model failed so we can go back to step 2 of the analysis. Recognizing in what aspect a model does not fit the data based on residual plots improves with experience. Therefore, we list in Chapter 16 of Korner-Nievergelt et al. (2015) some patterns that can appear in residual plots together with what these patterns possibly indicate. We also indicate what could be done in the specific cases.
+If, while working through steps 1 to 8, possibly repeatedly, we came up with one or more models that fit the data reasonably well, we model comparison is needed or inference may be drawn from more than one model. If we have only one model, we proceed to 3.10.
+Simulate values from the joint posterior distribution of the model parameters (sim or Stan). Use these samples to present parameter uncertainty, to obtain posterior distributions for predictions, probabilities of specific hypotheses, and derived quantities.
+This chapter introduces some important terms useful for doing data analyses. +It also introduces the essentials of the classical frequentist tests such as t-test. Even though we will not use nullhypotheses tests later (Amrhein, Greenland, and McShane 2019), we introduce them here because we need to understand the scientific literature. For each classical test, we provide a suggestion how to present the statistical results without using null hypothesis tests. We further discuss some differences between the Bayesian and frequentist statistics.
+Empirical research involves data collection. Data are collected by recording measurements of variables for observational units. An observational unit may be, for example, an individual, a plot, a time interval or a combination of those. The collection of all units ideally build a random sample of the entire population of units in that we are interested. The measurements (or observations) of the random sample are stored in a data table (sometimes also called data set, but a data set may include several data tables. A collection of data tables belonging to the same study or system is normally bundled and stored in a data base). A data table is a collection of variables (columns). Data tables normally are handled as objects of class data.frame
in R. All measurements on a row in a data table belong to the same observational unit. The variables can be of different scales (Table 2.1).
Scale | +Examples | +Properties | +Coding in R | +
---|---|---|---|
Nominal | +Sex, genotype, habitat | +Identity (values have a unique meaning) | +factor() |
+
Ordinal | +Elevational zones | +Identity and magnitude (values have an ordered relationship) | +ordered() |
+
Numeric | +Discrete: counts; continuous: body weight, wing length | +Identity, magnitude, and intervals or ratios | +intgeger() numeric() |
+
The aim of many studies is to describe how a variable of interest (\(y\)) is related to one or more predictor variables (\(x\)). How these variables are named differs between authors. The y-variable is called “outcome variable”, “response” or “dependent variable”. The x-variables are called “predictors”, “explanatory variables” or “independent variables”. The choose of the terms for x and y is a matter of taste. We avoid the terms “dependent” and “independent” variables because often we do not know whether the variable \(y\) is in fact depending on the \(x\) variables and also, often the x-variables are not independent of each other. In this book, we try to use “outcome” and “predictor” variables because these terms sound most neutral to us in that they refer to how the statistical model is constructed rather than to a real life relationship.
+While nominal and ordinal variables are summarized by giving the absolute number or the proportion of observations for each category, numeric variables normally are summarized by a location and a scatter statistics, such as the mean and the standard deviation or the median and some quantiles. The distribution of a numeric variable is graphically displayed in a histogram (Fig. 2.1).
++Figure 2.1: Histogram of the length of ell of statistics course participants. +
+To draw a histogram, the variable is displayed on the x-axis and the \(x_i\)-values are assigned to classes. The edges of the classes are called ‘breaks’. They can be set with the argument breaks=
within the function hist
. The values given in the breaks=
argument must at least span the values of the variable. If the argument breaks=
is not specified, R searches for breaks-values that make the histogram look smooth. The number of observations falling in each class is given on the y-axis. The y-axis can be re-scaled so that the area of the histogram equals 1 by setting the argument density=TRUE
. In that case, the values on the y-axis correspond to the density values of a probability distribution (Chapter 4).
Location statistics are mean, median or mode. A common mean is the
+mean
),The median is the 50% quantile. We find 50% of the measurements below and 50% above the median. If \(x_1,..., x_n\) are the ordered measurements of a variable, then the median is:
+median
).The mode is the value that is occurring with highest frequency or that has the highest density.
+Scatter also is called spread, scale or variance. Variance parameters describe how far away from the location parameter single observations can be found, or how the measurements are scattered around their mean. The variance is defined as the average squared difference between the observations and the mean:
+The variance is used to compare the degree of scatter among different groups. However, its values are difficult to interpret because of the squared unit. Therefore, the square root of the variance, the standard deviation is normally reported:
+sd
)The standard deviation is approximately the average deviation of an observation from the sample mean. In the case of a normal distribution, about two thirds (68%) of the data are expected within one standard deviation around the mean.
+The variance and standard deviation each describe the scatter with a single value. Thus, we have to assume that the observations are scattered symmetrically around their mean in order to get a picture of the distribution of the measurements. When the measurements are spread asymmetrically (skewed distribution), then it may be more precise to describe the scatter with more than one value. Such statistics could be quantiles from the lower and upper tail of the data.
+Quantiles inform us about both location and spread of a distribution. The \(p\)th-quantile is the value with the property that a proportion \(p\) of all values are less than or equal to the value of the quantile. The median is the 50% quantile. The 25% quantile and the 75% quantile are also called the lower and upper quartiles, respectively. The range between the 25% and the 75% quartiles is called the interquartile range. This range includes 50% of the distribution and is also used as a measure of scatter. The R function quantile
extracts sample quantiles. The median, the quartiles, and the interquartile range can be graphically displayed using box and-whisker plots (boxplots in short, R function boxplot
). The horizontal fat bars are the medians (Fig. 2.2). The boxes mark the interquartile range. The whiskers reach out to the last observation within 1.5 times the interquartile range from the quartile. Circles mark observations beyond 1.5 times the interquartile range from the quartile.
par(mar=c(5,4,1,1))
+boxplot(ell~car, data=dat, las=1, ylab="Lenght of ell [cm]",
+col="tomato", xaxt="n")
+axis(1, at=c(1,2), labels=c("Not owing a car", "Car owner"))
+n <- table(dat$car)
+axis(1, at=c(1,2), labels=paste("n=", n, sep=""), mgp=c(3,2, 0))
+Figure 2.2: Boxplot of the length of ell of statistics course participants who are or ar not owner of a car. +
+The boxplot is an appealing tool for comparing location, variance and distribution of measurements among groups.
+A correlation measures the strength with which characteristics of two variables are associated with each other (co-occur). If both variables are numeric, we can visualize the correlation using a scatterplot.
+par(mar=c(5,4,1,1))
+plot(temp~ell, data=dat, las=1, xlab="Lenght of ell [cm]",
+ylab="Comfort temperature [°C]",
+pch=16)
+Figure 2.3: Scatterplot of the length of ell and the comfort temperature of statistics course participants. +
+The covariance between variable \(x\) and \(y\) is defined as:
+cov
)As for the variance, also the units of the covariance are sqared and therefore covariance values are difficult to interpret. A standardized covariance is the Pearson correlation coefficient:
+cor
)Means, variances, standard deviations, covariances and correlations are sensible for outliers. Single observations containing extreme values normally have a overproportional influence on these statistics. When outliers are present in the data, we may prefer a more robust correlation measure such as the Spearman correlation or Kendall’s tau. Both are based on the ranks of the measurements instead of the measurements themselves.
+Spearman correlation coefficient: correlation between rank(x) and rank(y) (R function cor(x,y, method="spearman")
)
Kendall’s tau: \(\tau = 1-\frac{4I}{(n(n-1))}\), where \(I\) = number of pairs \((i,k)\) for which \((x_i < x_k)\) & \((y_i > y_k)\) or viceversa. (R function cor(x,y, method="kendall")
)
The principal components analysis (PCA) is a multivariate correlation analysis. A multidimensional data set with \(k\) variables can be seen as a cloud of points (observations) in a \(k\)-dimensional space. Imagine, we could move around in the space and look at the cloud from different locations. From some locations, the data looks highly correlated, whereas from others, we cannot see the correlation. That is what PCA is doing. It is rotating the coordinate system (defined by the original variables) of the data cloud so that the correlations are no longer visible. The axes of the new coordinates system are linear combinations of the original variables. They are called principal components. There are as many principal coordinates as there are original variables, i.e. \(k\), \(p_1, ..., p_k\). The principal components meet further requirements:
+For example, in a two-dimensional data set \((x_1, x_2)\) the principal components become
+\(pc_{1i} = b_{11}x_{1i} + b_{12}x_{2i}\) +\(pc_{2i} = b_{21}x_{1i} + b_{22}x_{2i}\) with \(b_{jk}\) being loadings of principal component \(j\) and original variable \(k\). Fig. 2.4 shows the two principal components for a two-dimensional data set. They can be calculated using matrix algebra: principal components are eigenvectors of the covariance or correlation matrix.
++Figure 2.4: Principal components of a two dimensional data set based on the covariance matrix (green) and the correlation matrix (brown). +
+The choice between correlation or covariance matrix is essential and important. The covariance matrix is an unstandardized correlation matrix. Therefore, the units, i.e., whether cm or m are used, influence the results of the PCA if it is based on the covariance matrix. When the PCA is based on the covariance matrix, the results will change, when we change the units of one variable, e.g., from cm to m. Basing the PCA on the covariance matrix only makes sense, when the variances are comparable among the variables, i.e., if all variables are measured in the same unit and we would like to weight each variable according to its variance. If this is not the case, the PCA must be based on the correlation matrix.
+pca <- princomp(cbind(x1,x2)) # PCA based on covariance matrix
+
+pca <- princomp(cbind(x1,x2), cor=TRUE) # PCA based on correlation matrix
+
+loadings(pca)
##
+## Loadings:
+## Comp.1 Comp.2
+## x1 0.707 0.707
+## x2 0.707 -0.707
+##
+## Comp.1 Comp.2
+## SS loadings 1.0 1.0
+## Proportion Var 0.5 0.5
+## Cumulative Var 0.5 1.0
+The loadings measure the correlation of each variable with the principal components. They inform about what aspects of the data each component is measuring. The signs of the loadings are arbitrary, thus we can multiplied them by -1 without changing the PCA. Sometimes this can be handy for describing the meaning of the principal component in a paper. For example, Zbinden et al. (2018) combined the number of hunting licenses, the duration of the hunting period and the number of black grouse cocks that were allowed to be hunted per hunter in a principal component in order to measure hunting pressure. All three variables had a negative loading in the first component, so that high values of the component meant low hunting pressure. Before the subsequent analyses, for which a measure of hunting pressure was of interest, the authors changed the signs of the loadings so that this component measured hunting pressure.
+The proportion of variance explained by each component is, beside the loadings, an important information. If the first few components explain the main part of the variance, it means that maybe not all \(k\) variables are necessary to describe the data, or, in other words, the original \(k\) variables contain a lot of redundant information.
+ +## Importance of components:
+## Comp.1 Comp.2
+## Standard deviation 1.2679406 0.6263598
+## Proportion of Variance 0.8038367 0.1961633
+## Cumulative Proportion 0.8038367 1.0000000
+Ridge regression is similar to doing a PCA within a linear model while components with low variance are shrinked to a higher degree than components with a high variance.
+++there is never a “yes-or-no” answer
+
+there will always be uncertainty
+Amrhein (2017)[https://peerj.com/preprints/26857]
The decision whether an effect is important or not cannot not be done based on data alone. For making a decision we should, beside the data, carefully consider the consequences of each decision, the aims we would like to achieve, and the risk, i.e. how bad it is to make the wrong decision. Structured decision making or decision analyses provide methods to combine consequences of decisions, objectives of different stakeholders and risk attitudes of decision makers to make optimal decisions (Hemming et al. 2022, Runge2020). In most data analyses, particularly in basic research and when working on case studies, we normally do not consider consequences of decisions. However, the results will be more useful when presented in a way that other scientists can use them for a meta-analysis, or stakeholders and politicians can use them for making better decisions. Useful results always include information on the size of a parameter of interest, e.g. an effect of a drug or an average survival, together with an uncertainty measure.
+Therefore, statistics is describing patterns of the process that presumably has generated the data and quantifying the uncertainty of the described patterns that is due to the fact that the data is just a random sample from the larger population we would like to know the patterns of.
+Quantification of uncertainty is only possible if:
+1. the mechanisms that generated the data are known
+2. the observations are a random sample from the population of interest
Most studies aim at understanding the mechanisms that generated the data, thus they are most likely not known beforehand. To overcome that problem, we construct models, e.g. statistical models, that are (strong) abstractions of the data generating process. And we report the model assumptions. All uncertainty measures are conditional on the model we used to analyze the data, i.e., they are only reliable, if the model describes the data generating process realistically. Because most statistical models do not describe the data generating process well, the true uncertainty almost always is much higher than the one we report.
+In order to obtain a random sample from the population under study, a good study design is a prerequisite.
To illustrate how inference about a big population is drawn from a small sample, we here use simulated data. The advantage of using simulated data is that the mechanism that generated the data is known as well as the big population.
+Imagine there are 300000 PhD students on the world and we would like to know how many statistics courses they have taken in average before they started their PhD (Fig. 2.5). We use random number generators (rpois
and rgamma
) to simulate for each of the 300000 virtual students a number. We here use these 300000 numbers as the big population that in real life we almost never can sample in total. Normally, we know the number of courses students have taken just for a small sample of students. To simulate that situation we draw 12 numbers at random from the 300000 (R function sample
). Then, we estimate the average number of statistics courses students take before they start a PhD from the sample of 12 students and we compare that mean to the true mean of the 300000 students.
# simulate the virtual true population
+set.seed(235325) # set seed for random number generator
+
+# simulate fake data of the whole population
+# using an overdispersed Poisson distribution,
+# i.e. a Poisson distribution of whicht the mean
+# has a gamma distribution
+
+statscourses <- rpois(300000, rgamma(300000, 2, 3))
+
+# draw a random sample from the population
+n <- 12 # sample size
+y <- sample(statscourses, 12, replace=FALSE)
+Figure 2.5: Histogram of the number of statistics courses of 300000 virtual PhD students have taken before their PhD started. The rugs on the x-axis indicate the random sample of 12 out of the 300000 students. The black vertical line indicates the mean of the 300000 students (true mean) and the blue line indicates the mean of the sample (sample mean). +
+We observe the sample mean, what do we know about the population mean? There are two different approaches to answer this question. 1) We could ask us, how much the sample mean would scatter, if we repeat the study many times? This approach is called the frequentist’ statistics. 2) We could ask us for any possible value, what is the probability that it is the true population mean? To do so, we use probability theory and that is called the Bayesian statistics.
+Both approaches use (essentially similar) models. Only the mathematical techniques to calculate uncertainty measures differ between the two approaches. In cases when beside the data no other information is used to construct the model, then the results are approximately identical (at least for large enough sample sizes).
+A frequentist 95% confidence interval (blue horizontal segment in Fig. 2.6) is constructed such that, if you were to (hypothetically) repeat the experiment or sampling many many times, 95% of the intervals constructed would contain the true value of the parameter (here the mean number of courses). From the Bayesian posterior distribution (pink in Fig. 2.6) we could construct a 95% interval (e.g., by using the 2.5% and 97.5% quantiles). This interval has traditionally been called credible interval. It can be interpreted that we are 95% sure that the true mean is inside that interval.
+Both, confidence interval and posterior distribution, correspond to the statistical uncertainty of the sample mean, i.e., they measure how far away the sample mean could be from the true mean. In this virtual example, we know the true mean is 0.66, thus somewhere at the lower part of the 95% CI or in the lower quantiles of the posterior distribution. In real life, we do not know the true mean. The grey histogram in Fig. 2.6 shows how means of many different virtual samples of 12 students scatter around the true mean. The 95% interval of these virtual means corresponds to the 95% CI, and the variance of these virtual means correspond to the variance of the posterior distribution. This virtual example shows that posterior distribution and 95% CI correctly measure the statistical uncertainty (variance, width of the interval), however we never know exactly how far the sample mean is from the true mean.
++Figure 2.6: Histogram of means of repeated samples from the true populations. The scatter of these means visualize the true uncertainty of the mean in this example. The blue vertical line indicates the mean of the original sample. The blue segment shows the 95% confidence interval (obtained by fequensist methods) and the violet line shows the posterior distribution of the mean (obtained by Bayesian methods). +
+Uncertainty intervals only are reliable if the model is a realistic abstraction of the data generating process (or if the model assumptions are realistic).
+Because both terms, confidence and credible interval, suggest that the interval indicates confidence or credibility but the intervals actually show uncertainty, it has been suggested to rename the interval into compatibility or uncertainty interval (Andrew Gelman and Greenland 2019).
+The standard error SE is, beside the uncertainty interval, an alternative possibility to measure uncertainty. It measures an average deviation of the sample mean from the (unknown) true population mean. The frequentist method for obtaining the SE is based on the central limit theorem. According to the central limit theorem the sum of independent, not necessarily normally distributed random numbers are normally distributed when sample size is large enough (Chapter 4). Because the mean is a sum (divided by a constant, the sample size) it can be assumed that the distribution of many means of samples is normal. The standard deviation SD of the many means is called the standard error SE. It can be mathematically shown that the standard error SE equals the standard deviation SD of the sample divided by the square root of the sample size:
+frequentist SE = SD/sqrt(n) = \(\frac{\hat{\sigma}}{\sqrt{n}}\)
Bayesian SE: Using Bayesian methods, the SE is the SD of the posterior distribution.
It is very important to keep the difference between SE and SD in mind! SD measures the scatter of the data, whereas SE measures the statistical uncertainty of the mean (or of another estimated parameter, Fig. 2.7). SD is a descriptive statistics describing a characteristics of the data, whereas SE is an inferential statistics showing us how far away the sample mean possibly is from the true mean. When sample size increases, SE becomes smaller, whereas SD does not change (given the added observations are drawn at random from the same big population as the ones already in the sample).
++Figure 2.7: Illustration of the difference between SD and SE. The SD measures the scatter in the data (sample, tickmarks on the x-axis). The SD is an estimate for the scatter in the big population (grey histogram, normally not known). The SE measures the uncertainty of the sample mean (in blue). The SE measures approximately how far, in average the sample mean (blue) is from the true mean (brown). +
+The Bayes theorem describes the probability of event A conditional on event B (the probability of A after B has already occurred) from the probability of B conditional on A and the two probabilities of the events A and B:
+\(P(A|B) = \frac{P(B|A)P(A)}{P(B)}\)
+Imagine, event A is “The person likes wine as a birthday present.” and event B “The person has no car.”. The conditional probability of A given B is the probability that a person not owing a car likes wine. Answers from students whether they have a car and what they like as a birthday present are summarized in Table 2.2.
+car/birthday | +flowers | +wine | +sum | +
---|---|---|---|
no car | +6 | +9 | +15 | +
car | +1 | +6 | +7 | +
sum | +7 | +15 | +22 | +
We can apply the Bayes theorem to obtain the probability that the student likes wine given it has no car, \(P(A|B)\). Let’s assume that only the ones who prefer wine go together for having a glass of wine at the bar after the statistics course. While they drink wine, the tell each other about their cars and they obtain the probability that a student who likes wine has no car, \(P(B|A) = 0.6\). During the statistics class the teacher asked the students about their car ownership and birthday preference. Therefore, they know that \(P(A) =\) likes wine \(= 0.68\) and \(P(B) =\) no car \(= 0.68\). With these information, they can obtain the probability that a student likes wine given it has no car, even if not all students without cars went to the bar: \(P(A|B) = \frac{0.6*0.68}{0.68} = 0.6\).
+When we use the Bayes theorem for analyzing data, then the aim is to make probability statements for parameters. Because most parameters are measured at a continuous scale we use probability density functions to describe what we know about them. The Bayes theorem can be formulated for probability density functions denoted with \(p(\theta)\), e.g. for a parameter \(\theta\) (for example probability density functions see Chapter 4). +What we are interested in is the probability of the parameter \(\theta\) given the data, i.e., \(p(\theta|y)\). This probability density function is called the posterior distribution of the parameter \(\theta\). Here is the Bayes theorem formulated for obtaining the posterior distribution of a parameter from the data \(y\), the prior distribution of the parameter \(p(\theta)\) and assuming a model for the data generating process. The data model is defined by the likelihood that specifies how the data \(y\) is distributed given the parameters \(p(y|\theta)\):
+\(p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)} = \frac{p(y|\theta)p(\theta)}{\int p(y|\theta)p(\theta) d\theta}\)
+The probability of the data \(p(y)\) is also called the scaling constant, because it is a constant. It is the integral of the likelihood over all possible values of the parameter(s) of the model.
+For illustration, we first describe a simple (unrealistic) example for which it is almost possible to follow the mathematical steps for solving the Bayes theorem even for non-mathematicians. Even if we cannot follow all steps, this example will illustrate the principle how the Bayesian theorem works for continuous parameters. The example is unrealistic because we assume that the variance \(\sigma^2\) in the data \(y\) is known. +We construct a data model by assuming that \(y\) is normally distributed:
+\(p(y|\theta) = normal(\theta, \sigma)\), with \(\sigma\) known. The function \(normal\) defines the probability density function of the normal distribution (Chapter 4).
+The parameter, for which we would like to get the posterior distribution is \(\theta\), the mean. We specify a prior distribution for \(\theta\). Because the normal distribution is a conjugate prior for a normal data model with known variance, we use the normal distribution. Conjugate priors have nice mathematical properties (see Chapter 10) and are therefore preferred when the posterior distribution is obtained algebraically. +That is the prior: +\(p(\theta) = normal(\mu_0, \tau_0)\)
+With the above data, data model and prior, the posterior distribution of the mean \(\theta\) is defined by: +\(p(\theta|y) = normal(\mu_n, \tau_n)\), where +\(\mu_n= \frac{\frac{1}{\tau_0^2}\mu_0 + \frac{n}{\sigma^2}\bar{y}}{\frac{1}{\tau_0^2}+\frac{n}{\sigma^2}}\) and +\(\frac{1}{\tau_n^2} = \frac{1}{\tau_0^2} + \frac{n}{\sigma^2}\)
+\(\bar{y}\) is the arithmetic mean of the data. Because only this value is needed in order to obtain the posterior distribution, this value is called the sufficient statistics.
+From the mathematical formulas above and also from Fig. 2.8 we see that the mean of the posterior distribution is a weighted average between the prior mean and \(\bar{y}\) with weights equal to the precisions (\(\frac{1}{\tau_0^2}\) and \(\frac{n}{\sigma^2}\)).
++Figure 2.8: Hypothetical example showing the result of applying the Bayes theorem for obtaining a posterior distribution of a continuous parameter. The likelhood is defined by the data and the model, the prior is expressing the knowledge about the parameter before looking at the data. Combining the two distributions using the Bayes theorem results in the posterior distribution. +
+We now move to a more realistic example, which is estimating the mean and the variance of a sample of weights of Snowfinches Montifringilla nivalis (Fig. 2.9). To analyze those data, a model with two parameters (the mean and the variance or standard deviation) is needed. The data model (or likelihood) is specified as \(p(y|\theta, \sigma) = normal(\theta, \sigma)\).
++Figure 2.9: Snowfinches stay above the treeline for winter. They come to feeders. +
+Because there are two parameters, we need to specify a two-dimensional prior distribution. We looked up in A. Gelman et al. (2014b) that the conjugate prior distribution in our case is an Normal-Inverse-Chisquare distribution:
+\(p(\theta, \sigma) = N-Inv-\chi^2(\mu_0, \sigma_0^2/\kappa_0; v_0, \sigma_0^2)\)
+From the same reference we looked up how the posterior distribution looks like in our case:
+\(p(\theta,\sigma|y) = \frac{p(y|\theta, \sigma)p(\theta, \sigma)}{p(y)} = N-Inv-\chi^2(\mu_n, \sigma_n^2/\kappa_n; v_n, \sigma_n^2)\), with
+\(\mu_n= \frac{\kappa_0}{\kappa_0+n}\mu_0 + \frac{n}{\kappa_0+n}\bar{y}\),
+\(\kappa_n = \kappa_0+n\), +\(v_n = v_0 +n\),
+\(v_n\sigma_n^2=v_0\sigma_0^2+(n-1)s^2+\frac{\kappa_0n}{\kappa_0+n}(\bar{y}-\mu_0)^2\)
+For this example, we need the arithmetic mean \(\bar{y}\) and standard deviation \(s^2\) from the sample for obtaining the posterior distribution. Therefore, these two statistics are the sufficient statistics.
+The above formula look intimidating, but we never really do that calculations. We let R
doing that for us in most cases by simulating many numbers from the posterior distribution, e.g., using the function sim
from the package arm (Andrew Gelman and Hill 2007). In the end, we can visualize the distribution of these many numbers to have a look at the posterior distribution.
In Fig. 2.10 the two-dimensional \((\theta, \sigma)\) posterior distribution is visualized by using simulated values. The two dimensional distribution is called the joint posterior distribution. The mountain of dots in Fig. 2.10 visualize the Normal-Inverse-Chisquare that we calculated above. When all values of one parameter is displayed in a histogram ignoring the values of the other parameter, it is called the marginal posterior distribution. Algebraically, the marginal distribution is obtained by integrating one of the two parameters out over the joint posterior distribution. This step is definitively way easier when simulated values from the posterior distribution are available!
++Figure 2.10: Visualization of the joint posterior distribution for the mean and standard deviation of Snowfinch weights. The lower left panel shows the two-dimensional joint posterior distribution, whereas the upper and right panel show the marginal posterior distributions of each parameter separately. +
+The marginal posterior distributions of every parameter is what we normally report in a paper to report statistical uncertainty.
+In our example, the marginal distribution of the mean is a t-distribution (Chapter 4). Frequentist statistical methods also use a t-distribution to describe the uncertainty of an estimated mean for the case when the variance is not known. Thus, frequentist methods came to the same solution using a completely different approach and different techniques. Doesn’t that increase dramatically our trust in statistical methods?
+Null hypothesis testing is constructing a model that describes how the data would look like in case of what we expect to be would not be. Then, the data is compared to how the model thinks the data should look like. If the data does not look like the model thinks they should, we reject that model and accept that our expectation may be true.
+To decide whether the data looks like the null-model thinks the data should look like the p-value is used. The p-value is the probability of observing the data or more extreme data given the null hypothesis is true.
+Small p-values indicate that it is rather unlikely to observe the data or more extreme data given the null hypothesis \(H_0\) is true.
+Null hypothesis testing is problematic. We discuss some of the problems after having introduces the most commonly used classical tests.
+In some studies, we would like to compare the data to a theoretical value. The theoretical value is a fixed value, e.g. calculated based on physical, biochemical, ecological or any other theory. The statistical task is then to compare the mean of the data including its uncertainty with the theoretical value. The result of such a comparison may be an estimate of the mean of the data with its uncertainty or an estimate of the difference of the mean of the data to the theoretical value with the uncertainty of this difference.
+ +For example, a null hypothesis could be \(H_0:\)“The mean of Snowfinch weights is exactly 40g.” +A normal distribution with a mean of \(\mu_0=40\) and a variance equal to the estimated variance in the data \(s^2\) is then assumed to describe how we would expect the data to look like given the null hypothesis was true. From that model it is possible to calculate the distribution of hypothetical means of many different hypothetical samples of sample size \(n\). The result is a t-distribution (Fig. 2.11). In classical tests, the distribution is standardized so that its variance was one. Then the sample mean, or in classical tests a standardized difference between the mean and the hypothetical mean of the null hypothesis (here 40g), called test statistics \(t = \frac{\bar{y}-\mu_0}{\frac{s}{\sqrt{n}}}\), is compared to that (standardized) t-distribution. If the test statistics falls well within the expected distribution the null hypothesis is accepted. Then, the data is well compatible with the null hypothesis. However, if the test statistics falls in the tails or outside the distribution, then the null hypothesis is rejected and we could write that the mean weight of Snowfinches is statistically significantly different from 40g. Unfortunately, we cannot infer about the probability of the null hypothesis and how relevant the result is.
++Figure 2.11: Illustration of a one-sample t-test. The blue histogram shows the distribution of the measured weights with the sample mean (lightblue) indicated as a vertical line. The black line is the t-distribution that shows how hypothetical sample means are expected to be distributed if the big population of Snowfinches has a mean weight of 40g (i.e., if the null hypothesis was true). Orange area shows the area of the t-distribution that lays equal or farther away from 40g than the sample mean. The orange area is the p-value. +
+We can use the r-function t.test
to calculate the p-value of a one sample t-test.
##
+## One Sample t-test
+##
+## data: y
+## t = 3.0951, df = 7, p-value = 0.01744
+## alternative hypothesis: true mean is not equal to 40
+## 95 percent confidence interval:
+## 40.89979 46.72521
+## sample estimates:
+## mean of x
+## 43.8125
+The output of the r-function t.test
also includes the mean and the 95% confidence interval (or compatibility or uncertainty interval) of the mean. The CI could, alternatively, be obtained as the 2.5% and 97.5% quantiles of a t-distribution with a mean equal to the sample mean, degrees of freedom equal to the sample size minus one and a standard deviation equal to the standard error of the mean.
## [1] 40.89979
+
+## [1] 46.72521
+When applying the Bayes theorem for obtaining the posterior distribution of the mean we end up with the same t-distribution as described above, in case we use flat prior distributions for the mean and the standard deviation. Thus, the two different approaches end up with the same result!
+par(mar=c(4.5, 5, 2, 2))
+hist(y, col="blue", xlim=c(30,52), las=1, freq=FALSE, main=NA, ylim=c(0, 0.3))
+abline(v=mean(y), lwd=2, col="lightblue")
+abline(v=40, lwd=2)
+lines(density(bsim@coef))
+text(45, 0.3, "posterior distribution\nof the mean of y", cex=0.8, adj=c(0,1), xpd=NA)
+Figure 2.12: Illustration of the posterior distribution of the mean. The blue histogram shows the distribution of the measured weights with the sample mean (lightblue) indicated as a vertical line. The black line is the posterior distribution that shows what we know about the mean after having looked at the data. The area under the posterior density function that is larger than 40 is the posterior probability of the hypothesis that the true mean Snwofinch weight is larger than 40g. +
+The posterior probability of the hypothesis that the true mean Snowfinch weight is larger than 40g, \(P(H:\mu>40) =\), is equal to the proportion of simulated random values from the posterior distribution, saved in the vector bsim@coef
, that are larger than 40.
# Two ways of calculating the proportion of values
+# larger than a specific value within a vector of values
+round(sum(bsim@coef[,1]>40)/nrow(bsim@coef),2)
## [1] 0.99
+
+## [1] 0.99
+# Note: logical values TRUE and FALSE become
+# the numeric values 1 and 0 within the functions sum() and mean()
We, thus, can be pretty sure that the mean Snowfinch weight (in the big world population) is larger than 40g. Such a conclusion is not very informative, because it does not tell us how much larger we can expect the mean Snowfinch weight to be. Therefore, we prefer reporting a credible interval (or compatibility interval or uncertainty interval) that tells us what values for the mean Snowfinch weight are compatible with the data (given the data model we used realistically reflects the data generating process). Based on such an interval, we can conclude that we are pretty sure that the mean Snowfinch weight is between 40 and 48g.
+# 80% credible interval, compatibility interval, uncertainty interval
+quantile(bsim@coef[,1], probs=c(0.1, 0.9))
## 10% 90%
+## 42.07725 45.54080
+# 95% credible interval, compatibility interval, uncertainty interval
+quantile(bsim@coef[,1], probs=c(0.025, 0.975))
## 2.5% 97.5%
+## 40.90717 46.69152
+# 99% credible interval, compatibility interval, uncertainty interval
+quantile(bsim@coef[,1], probs=c(0.005, 0.995))
## 0.5% 99.5%
+## 39.66181 48.10269
+Many research questions aim at measuring differences between groups. For example, we could be curious to know how different in size car owner are from people not owning a car. +A boxplot is a nice possibility to visualize the ell length measurements of two (or more) groups (Fig. 2.13). From the boxplot, we do not see how many observations are in the two samples. We can add that information to the plot. The boxplot visualizes the samples but it does not show what we know about the big (unmeasured) population mean. To show that, we need to add a compatibility interval (or uncertainty interval, credible interval, confidence interval, in brown in Fig. 2.13).
++Figure 2.13: Ell length of car owners (Y) and people not owning a car (N). Horizontal bar = median, box = interquartile range, whiskers = extremest observation within 1.5 times the interquartile range from the quartile, circles=observations farther than 1.5 times the interquartile range from the quartile. Filled brown circles = means, vertical brown bars = 95% compatibility interval. +
+When we added the two means with a compatibility interval, we see what we know about the two means, but we do still not see what we know about the difference between the two means. The uncertainties of the means do not show the uncertainty of the difference between the means. To do so, we need to extract the difference between the two means from a model that describes (abstractly) how the data has been generated. Such a model is a linear model that we will introduce in Chapter 11. The second parameter measures the differences in the means of the two groups. And from the simulated posterior distribution we can extract a 95% compatibility interval. +Thus, we can conclude that the average ell length of car owners is with high probability between 0.5 cm smaller and 2.5 cm larger than the averag ell of people not having a car.
+ +##
+## Call:
+## lm(formula = ell ~ car, data = dat)
+##
+## Coefficients:
+## (Intercept) carY
+## 43.267 1.019
+
+## 2.5% 50% 97.5%
+## -0.501348 1.014478 2.494324
+The corresponding two-sample t-test gives a p-value for the null hypothesis: “The difference between the two means equals zero.”, a confidence interval for the difference and the two means. While the function lm
gives the difference Y minus N, the function t.test
gives the difference N minus Y. Luckily the two means are also given in the output, so we know which group mean is the larger one.
##
+## Two Sample t-test
+##
+## data: ell by car
+## t = -1.4317, df = 20, p-value = 0.1677
+## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
+## 95 percent confidence interval:
+## -2.5038207 0.4657255
+## sample estimates:
+## mean in group N mean in group Y
+## 43.26667 44.28571
+In both possibilities, we used to compare the to means, the Bayesian posterior distribution of the difference and the t-test or the confidence interval of the difference, we used a data model. We thus assumed that the observations are normally distributed. In some cases, such an assumption is not a reasonable assumption. Then the result is not reliable. In such cases, we can either search for a more realistic model or use non-parametric (also called distribution free) methods. Nowadays, we have almost infinite possibilities to construct data models (e.g. generalized linear models and beyond). Therefore, we normally start looking for a model that fits the data better. However, in former days, all these possiblities did not exist (or were not easily available for non-mathematicians). Therefore, we here introduce two of such non-parametric methods, the Wilcoxon-test (or Mann-Whitney-U-test) and the randomisation test.
+Some of the distribution free statistical methods are based on the rank instead of the value of the observations. The principle of the Wilcoxon-test is to rank the observations and sum the ranks per group. It is not completely true that the non-parametric methods do not have a model. The model of the Wilcoxon-test “knows” how the difference in the sum of the ranks between two groups is distributed given the mean of the two groups do not differ (null hypothesis). Therefore, it is possible to get a p-value, e.g. by the function wilcox.test
.
##
+## Wilcoxon rank sum test with continuity correction
+##
+## data: ell by car
+## W = 34.5, p-value = 0.2075
+## alternative hypothesis: true location shift is not equal to 0
+The note in the output tells us that ranking is ambiguous, when some values are equal. Equal values are called ties when they should be ranked.
+The result of the Wilcoxon-test tells us how probable it is to observe the difference in the rank sum between the two sample or a more extreme difference given the means of the two groups are equal. That is at least something.
+A similar result is obtained by using a randomisation test. This test is not based on ranks but on the original values. The aim of the randomisation is to simulate a distribution of the difference in the arithmetic mean between the two groups assuming this difference would be zero. To do so, the observed values are randomly distributed among the two groups. Because of the random distribution among the two groups, we expect that, if we repeat that virtual experiment many times, the average difference between the group means would be zero (both virtual samples are drawn from the same big population).
+We can use a loop in R for repeating the random re-assignement to the two groups and, each time, extracting the difference between the group means. As a result, we have a vector of many (nsim
) values that all are possible differences between group means given the two samples were drawn from the same population. The proportion of these values that have an equal or larger absolute value give the probability that the observed or a larger difference between the group means is observed given the null hypothesis would be true, thus that is a p-value.
diffH0 <- numeric(nsim)
+for(i in 1:nsim){
+ randomcars <- sample(dat$car)
+ rmod <- lm(ell~randomcars, data=dat)
+ diffH0[i] <- coef(rmod)["randomcarsY"]
+}
+mean(abs(diffH0)>abs(coef(mod)["carY"])) # p-value
## [1] 0.1858
+Visualizing the possible differences between the group means given the null hypothesis was true shows that the observed difference is well within what is expected if the two groups would not differ in their means (Fig. 2.14).
++Figure 2.14: Histogram if differences between the means of randomly assigned groups (grey) and the difference between the means of the two observed groups (red) +
+The randomization test results in a p-value and, we could also report the observed difference between the group means. However, it does not tell us, what values of the difference all would be compatible with the data. We do not get an uncertainty measurement for the difference.
+In order to get a compatibility interval without assuming a distribution for the data (thus non-parametric) we could bootstrap the samples.
+Bootstrapping is sampling observations from the data with replacement. For example, if we have a sample of 8 observations, we draw 8 times a random observation from the 8 observation. Each time, we assume that all 8 observations are available. Thus a bootstrapped sample could include some observations several times, whereas others are missing. In this way, we simulate the variance in the data that is due to the fact that our data do not contain the whole big population.
+Also bootstrapping can be programmed in R using a loop.
+diffboot <- numeric(1000)
+for(i in 1:nsim){
+ ngroups <- 1
+ while(ngroups==1){
+ bootrows <- sample(1:nrow(dat), replace=TRUE)
+ ngroups <- length(unique(dat$car[bootrows]))
+ }
+ rmod <- lm(ell~car, data=dat[bootrows,])
+ diffboot[i] <- coef(rmod)[2]
+}
+quantile(diffboot, prob=c(0.025, 0.975))
## 2.5% 97.5%
+## -0.3395643 2.4273810
+The resulting values for the difference between the two group means can be interpreted as the distribution of those differences, if we had repeated the study many times (Fig. 2.15). +A 95% interval of the distribution corresponds to a 95% compatibility interval (or confidence interval or uncertainty interval).
+ ++Figure 2.15: Histogram of the boostrapped differences between the group means (grey) and the observed difference. +
+For both methods, randomisation test and bootstrapping, we have to assume that all observations are independent. Randomization and bootstrapping becomes complicated or even unfeasible when data are structured.
+Reverend Thomas Bayes (1701 or 1702 - 1761) developed the Bayes theorem. Based on this theorem, he described how to obtain the probability of a hypothesis given an observation, that is, data. However, he was so worried whether it would be acceptable to apply his theory to real-world examples that he did not dare to publish it. His methods were only published posthumously (Bayes 1763). Without the help of computers, Bayes’ methods were applicable to just simple problems.
+Much later, the concept of null hypothesis testing was introduced by
+Ronald A. Fisher (1890 - 1962) in his book Statistical Methods for Research Workers (Fisher 1925) and many other publications. Egon Pearson (1895 - 1980) and others developed the frequentist statistical methods, which are based on the probability of the data given a null hypothesis. These methods are solvable for many simple and some moderately complex examples.
+The rapidly improving capacity of computers in recent years now enables us to use Bayesian methods also for more (and even very) complex problems using simulation techniques Gilks, Richardson, and Spiegelhalter (1996).
Bayesian methods use Bayes’ theorem to update prior knowledge about a parameter with information coming from the data to obtain posterior knowledge. The prior and posterior knowledge are mathematically described by a probability distribution (prior and posterior distributions of the parameter). Bayes’ theorem for discrete events says that the probability of event \(A\) given event \(B\) has occurred, \(P(A|B)\), equals the probability of event \(A\), \(P(A)\), times the probability of event \(B\) conditional on \(A\), \(P(B|A)\), divided by the +probability of event \(B\), \(P(B)\): +\[ P(A|B) = \frac{P(A)P(B|A)}{P(B)}\] +When using Bayes’ theorem for drawing inference from data, we are interested in the probability distribution of one or several parameters, \(\theta\) (called “theta”), after having looked at the data \(y\), that is, the posterior distribution, \(p(\theta|y)\). To this end, Bayes’ theorem is reformulated for continuous parameters using probability distributions rather than probabilities for discrete +events: +\[ p(\theta|y) = \frac{P(\theta)p(y|\theta)}{p(y)}\] +The posterior distribution, \(p(\theta|y)\), describes what we know about the parameter (or about the set of parameters), \(\theta\), after having looked at the data and given the prior knowledge and the model. The prior distribution of \(\theta\), \(p(\theta)\), describes what we know about \(\theta\) before having looked at the data. This is often +very little but it can include information from earlier studies. The probability of the data conditional on \(\theta\), \(p(y|\theta)\), is called likelihood.
+The word likelihood is used in Bayesian statistics with a slightly different meaning than it is used in frequentist statistics. The frequentist likelihood, \(L(\theta|y)\), is a relative measure for the probability of the observed data given specific parameter values. The likelihood is a number often close to zero. In contrast, Bayesians use likelihood for the density distribution of the data conditional on the parameters of a model. Thus, in Bayesian statistics the likelihood is a distribution (i.e., the area under the curve is 1) whereas in frequentist statistics it is a scalar.
+The prior probability of the data, \(p(y)\), equals the integral of \(p(y|\theta)p(\theta)\) over all possible values of \(\theta\); thus \(p(y)\) is a constant.
+The integral can be solved numerically only for a few simple cases. For this reason, Bayesian statistics were not widely applied before the computer age. Nowadays, a variety of different simulation algorithms exist that allow sampling from distributions that are only known to proportionality (e.g., Gilks, Richardson, and Spiegelhalter (1996)). Dropping the term \(p(y)\) in the Bayes theorem leads to a term that is proportional to the posterior distribution: \(p(\theta|y)\propto p(\theta)p(y|\theta)\). Simulation algorithms such as Markov chain Monte Carlo simulation (MCMC) can, therefore, sample from the posterior distribution without having to know \(p(y)\). A large enough sample of the posterior distribution can then be used to draw inference about the parameters of interest.
+In Bayesian statistics the likelihood is the probability +distribution of the data given the model \(p(y|\theta)\), also called the predictive density. In contrast, frequentists use the likelihood as a relative measure of the probability of the data given a specific model (i.e., a model with specified parameter values). Often, we see the notation \(L(\theta|y) = p(y|\theta)\) for thelikelihood of a model. Let’s look at an example. According to values that we found on the internet, black-tailed prairie dogs Cynomys ludovicianus weigh on average 1 kg with a standard deviation of 0.2 kg.
+Using these values, we construct a model for the weight of prairie dogs \(y_i \sim normal(\mu, \sigma)\) with \(\mu\) = 1 kg and \(\sigma\) = 0.2 kg. We then go to the prairie and catch three prairie dogs. Their weights are 0.8, 1.2, and 1.1 kg. What is the likelihood of our model given this data? The likelihood is the product of the density function (the Gaussian function with \(\mu\) = 1 and \(\sigma\) = 0.2) defined by the model and evaluated for each data point (Figure 9.1).
+ +## [1] 2.576671
+x <- seq(0,2, length=100)
+dx <- dnorm(x, mean=1, sd=0.2)
+plot(x, dx, type="l", lwd=2, las=1, xlab="Weight [kg]")
+y <- c(0.8, 1.1, 1.2)
+segments(y, dnorm(y, mean=1, sd=0.2), y, rep(0, length(y)))
+Figure 9.1: Density function of a Gaussian (=normal) distribution with mean 1 kg and standard deviation 0.2 kg. The vertical lines indicate the three observations in the data. The curve has been produced using the function dnorm. +
+The likelihood (2.6) does not tell us much. But, another web page says that prairie dogs weigh on average 1.2 kg with a standard deviation of 0.4 kg. The likelihood for this alternative model is:
+l2 <-dnorm(x=0.8, mean=1.2, sd=0.4)*dnorm(x=1.2, mean=1.2, sd=0.4)*
+dnorm(x=1.1, mean=1.2, sd=0.4)
+l2
## [1] 0.5832185
+Thus, the probability of observing the three weights (0.8, 1.2, and 1.1 kg) is four times larger given the first model than given the second model.
+The normal density function is a function of \(y\) with fixed parameters \(\mu\) and \(\sigma\). When we fix the data to value \(y\) (i.e., we have a data file with \(n = 1\)) and vary the parameter, we get the likelihood function given the data and we assume the normal distribution as our data model. When the data contain more than one observation, the likelihood function becomes a product of the normal density function for each observation.
+\[ L(\mu, \sigma|y) = \prod_{i=1}^n p(y_i|\mu,\sigma) \]
+We can insert the three observations into this function.
Applying the likelihood function to the model from the first web page, we get the following likelihood.
+ +## [1] 2.576671
+A common method of obtaining parameter estimates is the maximum likelihood method (ML). Specifically, we search for the parameter combination from the likelihood function for which the likelihood is maximized. Also in Bayesian statistics, ML is used to optimize algorithms that produce posterior distributions (e.g., in the function sim
).
To obtain ML estimates for the mean and standard deviation of the prairie dog weights, we calculate the likelihood function for many possible combinations of means and standard deviations and look for the pair of parameter values that is associated with the largest likelihood value.
+mu <- seq(0.6, 1.4, length=100)
+sigma <- seq(0.05, 0.6, length=100)
+lik <- matrix(nrow=length(mu), ncol=length(sigma))
+for(i in 1:length(mu)){
+ for(j in 1:length(sigma)){
+ lik[i,j] <- lf(mu[i], sigma[j])
+ }
+}
+contour(mu, sigma, lik, nlevels=20, xlab=expression(mu),
+ylab=expression(sigma), las=1, cex.lab=1.4)
+
+# find the ML estimates
+neglf <- function(x) -prod(dnorm(y, x[1], x[2]))
+MLest <- optim(c(0.5, 0.5), neglf) # the vector with numbers are the initial values
+
+
+points(MLest$par[1],MLest$par[2], pch=13)
+Figure 9.2: Likelihood function for y=0.8, 1.2, 1.1 and the data model y~normal(mu, sigma). The cross indicates the highest likelihood value. +
+## [1] 1.0333330 0.1699668
+We can get the \(\mu\) and \(\sigma\) values that produce the largest likelihood by the function optim
. This function finds the parameter values associated with the lowest value of a user defined function. Therefore, we define a function that calculates the negative likelihood.
The combination of mean = 1.03 and standard deviation = 0.17 is associated with the highest likelihood value (the “mountain top” in Figure 9.2). In +the case of models with more than two parameters, the likelihood function cannot be visualized easily. Often, some parameters are fixed to one value, for example, their ML estimate, to show how the likelihood function changes in relation to one or two other parameters.
+Likelihoods are often numbers very close to zero, especially when sample size is large. Then, likelihood values drop below the computing ability of computers (underflow). Therefore, the logarithm of the likelihood is often used. In addition, many standard optimizers are, in fact, minimizers such that the negative log-likelihood is used to find the ML estimates.
+The ratio between the likelihoods of two different models - for example, $L(= 1, = 0.2|y)/L(= 1.2, = 0.4)|y) = 2.58/0.58 = 4.42 - is called the +likelihood ratio, and the difference in the logarithm of the likelihoods is called the log-likelihood ratio. It means, as just stated, that the probability of the data is 4.42 times higher given model 1 than given model 2 (note that with “model” we mean here a parameterized model - i.e., with specific values for the parameters \(\mu\) and \(\sigma\)). From this, we may conclude that there is more support in the data for model 1 than for model 2.
+The likelihood function can be used to obtain confidence intervals for the ML estimates. If the “mountain” in Figure 9.2 is very steep, the uncertainty about the ML estimate is low and the confidence intervals are small, whereas when the mountain has shallow slopes, the intervals will be larger. Such confidence intervals are called profile likelihood confidence intervals (e.g., Venzon and Moolgavkar (1988)).
+Up to now, we have discussed likelihood in a frequentist framework because it is so important in statistics. Let’s now turn to the Bayesian framework. When Bayesians use the word likelihood, they mean the probability distribution of the data given the model, \(p(y|\theta)\). However, the log pointwise predictive density (lppd) could be seen as a Bayesian analog to the “frequentist” log-likelihood. The lppd is the log of the posterior density integrated over the posterior distribution of the model parameters and summed over all observations in the data:
+\[ lppd = \sum_{i=1}^{n}log\int p(y_i|\theta)p(\theta|y)d\theta \] +Based on simulated values from the joint posterior distribution of the model parameters, the log pointwise predictive density can be calculated in R as follows.
+library(arm)
+mod <- lm(y~1) # fit model by LS method
+nsim <- 2000
+bsim <- sim(mod, n.sim=nsim) # simulate from posterior dist. of parameters
We prepare a matrix “pyi” to be filled up with the posterior densities for every observation \(y_i\) and for each of the simulated sets of model parameters.
+pyi <- matrix(nrow=length(y), ncol=nsim)
+for(i in 1:nsim) pyi[,i] <- dnorm(y, mean=bsim@coef[i,1], sd=bsim@sigma[i])
Then, we average the posterior density values over all simulations (this corresponds to the integral in the previous formula). Finally, we calculate the sum of the logarithm of the averaged density values.
+ +## [1] 0.2197509
+The log posterior density can be used as a measure of model fit. We will come back to this measurement when we discuss cross-validation as a method for model comparison.
+Link and Barker (2010) provide a very understandable chapter on the likelihood.
+Also, K. P. Burnham and White (2002) give a thorough introduction to the likelihood and how it is used in the multimodel inference framework. An introduction to likelihood for mathematicians, but also quite helpful for biologists, is contained in Dekking et al. (2005).
+Davidson and Solomon (1974) discuss the relationship between LS and ML methods.
+The difference between the ML- and LS-estimates of the variance is explained in J. Andrew Royle and Dorazio (2008).
+A. Gelman et al. (2014b) introduce the log predictive density and explain how it is computed (p. 167).
+ +In some species the identification of the sex is not possible for all individuals without sampling DNA. For example, morphological dimorphism is absent or so weak that parts of the individuals cannot be assigned to one of the sexes. Particularly in ornithological long-term capture recapture data sets that typically are obtained by voluntary bird ringers who do normaly not have the possibilities to analyse DNA, often the sex identification is missing in parts of the individuals. For estimating survival, it would nevertheless be valuable to include data of all individuals, use the information on sex-specific effects on survival wherever possible but account for the fact that of parts of the individuals the sex is not known. We here explain how a Cormack-Jolly-Seber model can be integrated with a mixture model in oder to allow for a combined analyses of individuals with and without sex identified. +An introduction to the Cormack-Jolly-Seber model we gave in Chapter 14.5 of the book Korner-Nievergelt et al. (2015). We here expand this model by a mixture structure that allows including individuals with a missing categorical predictor variable, such as sex.
+## simulate data
+# true parameter values
+theta <- 0.6 # proportion of males
+nocc <- 15 # number of years in the data set
+b0 <- matrix(NA, ncol=nocc-1, nrow=2)
+b0[1,] <- rbeta((nocc-1), 3, 4) # capture probability of males
+b0[2,] <- rbeta((nocc-1), 2, 4) # capture probability of females
+a0 <- matrix(NA, ncol=2, nrow=2)
+a1 <- matrix(NA, ncol=2, nrow=2)
+a0[1,1]<- qlogis(0.7) # average annual survival for adult males
+a0[1,2]<- qlogis(0.3) # average annual survival for juveniles
+a0[2,1] <- qlogis(0.55) # average annual survival for adult females
+a0[2,2] <- a0[1,2]
+a1[1,1] <- 0
+a1[1,2] <- -0.5
+a1[2,1] <- -0.8
+a1[2,2] <- a1[1,2]
+
+nindi <- 1000 # number of individuals with identified sex
+nindni <- 1500 # number of individuals with non-identified sex
+nind <- nindi + nindni # total number of individuals
+y <- matrix(ncol=nocc, nrow=nind)
+z <- matrix(ncol=nocc, nrow=nind)
+first <- sample(1:(nocc-1), nind, replace=TRUE)
+sex <- sample(c(1,2), nind, prob=c(theta, 1-theta), replace=TRUE)
+juvfirst <- sample(c(0,1), nind, prob=c(0.5, 0.5), replace=TRUE)
+juv <- matrix(0, nrow=nind, ncol=nocc)
+for(i in 1:nind) juv[i,first[i]] <- juv[i]
+
+x <- runif(nocc-1, -2, 2) # a time dependent covariate covariate
+p <- b0 # recapture probability
+phi <- array(NA, dim=c(2, 2, nocc-1))
+# for ad males
+phi[1,1,] <- plogis(a0[1,1]+a1[1,1]*x)
+# for ad females
+phi[2,1,] <- plogis(a0[2,1]+a1[2,1]*x)
+# for juvs
+phi[1,2,] <- phi[2,2,] <- plogis(a0[2,2]+a1[2,2]*x)
+for(i in 1:nind){
+ z[i,first[i]] <- 1
+ y[i, first[i]] <- 1
+ for(t in (first[i]+1):nocc){
+ z[i, t] <- rbinom(1, size=1, prob=z[i,t-1]*phi[sex[i],juv[i,t-1]+1, t-1])
+ y[i, t] <- rbinom(1, size=1, prob=z[i,t]*p[sex[i],t-1])
+ }
+}
+y[is.na(y)] <- 0
The mark-recapture data set consists of capture histories of 2500 individuals over 15 time periods. For each time period \(t\) and individual \(i\) the capture history matrix \(y\) contains \(y_{it}=1\) if the individual \(i\) is captured during time period \(t\), or \(y_{it}=0\) if the individual \(i\) is not captured during time period \(t\). The marking time period varies between individuals from 1 to 14. At the marking time period, the age of the individuals was classified either as juvenile or as adult. Juveniles turn into adults after one time period, thus age is known for all individuals during all time periods after marking. For 1000 individuals of the 2500 individuals, the sex is identified, whereas for 1500 individuals, the sex is unknown. The example data contain one covariate \(x\) that takes on one value for each time period.
+ +The observations \(y_{it}\), an indicator of whether individual i was recaptured during time period \(t\) is modelled conditional on the latent true state of the individual birds \(z_{it}\) (0 = dead or permanently emigrated, 1 = alive and at the study site) as a Bernoulli variable. The probability \(P(y_{it} = 1)\) is the product of the probability that an alive individual is recaptured, \(p_{it}\), and the state of the bird \(z_{it}\) (alive = 1, dead = 0). Thus, a dead bird cannot be recaptured, whereas for a bird alive during time period \(t\), the recapture probability equals \(p_{it}\): +\[y_{it} \sim Bernoulli(z_{it}p_{it})\] +The latent state variable \(z_{it}\) is a Markovian variable with the state at time \(t\) being dependent on the state at time \(t-1\) and the apparent survival probability \[\phi_{it}\]: +\[z_{it} \sim Bernoulli(z_{it-1}\phi_{it})\] +We use the term apparent survival in order to indicate that the parameter \(\phi\) is a product of site fidelity and survival. Thus, individuals that permanently emigrated from the study area cannot be distinguished from dead individuals. +In both models, the parameters \(\phi\) and \(p\) were modelled as sex-specific. However, for parts of the individuals, sex could not be identified, i.e. sex was missing. Ignoring these missing values would most likely lead to a bias because they were not missing at random. The probability that sex can be identified is increasing with age and most likely differs between sexes. Therefore, we included a mixture model for the sex: +\[Sex_i \sim Categorical(q_i)\] +where \(q_i\) is a vector of length 2, containing the probability of being a male and a female, respectively. In this way, the sex of the non-identified individuals was assumed to be male or female with probability \(q[1]\) and \(q[2]=1-q[1]\), respectively. This model corresponds to the finite mixture model introduced by Pledger, Pollock, and Norris (2003) in order to account for unknown classes of birds (heterogeneity). However, in our case, for parts of the individuals the class (sex) was known.
+In the example model, we constrain apparent survival to be linearly dependent on a covariate x with different slopes for males, females and juveniles using the logit link function. +\[logit(\phi_{it}) = a0_{sex-age-class[it]} + a1_{sex-age-class[it]}x_i\]
+Annual recapture probability was modelled for each year and age and sex class independently: +\[p_{it} = b0_{t,sex-age-class[it]}\] +Uniform prior distributions were used for all parameters with a parameter space limited to values between 0 and 1 (probabilities) and a normal distribution with a mean of 0 and a standard deviation of 1.5 for the intercept \(a0\), and a standard deviation of 5 was used for \(a1\).
+The trick for coding the CMR-mixture model in Stan is to formulate the model 3 times:
+1. For the individuals with identified sex
+2. For the males that were not identified
+3. For the females that were not identified
Then for the non-identified individuals a mixture model is formulated that assigns a probability of being a female or a male to each individual.
+data {
+ int<lower=2> nocc; // number of capture events
+ int<lower=0> nindi; // number of individuals with identified sex
+ int<lower=0> nindni; // number of individuals with non-identified sex
+ int<lower=0,upper=2> yi[nindi,nocc]; // CH[i,k]: individual i captured at k
+ int<lower=0,upper=nocc-1> firsti[nindi]; // year of first capture
+ int<lower=0,upper=2> yni[nindni,nocc]; // CH[i,k]: individual i captured at k
+ int<lower=0,upper=nocc-1> firstni[nindni]; // year of first capture
+ int<lower=1, upper=2> sex[nindi];
+ int<lower=1, upper=2> juvi[nindi, nocc];
+ int<lower=1, upper=2> juvni[nindni, nocc];
+ int<lower=1> year[nocc];
+ real x[nocc-1]; // a covariate
+}
+
+transformed data {
+ int<lower=0,upper=nocc+1> lasti[nindi]; // last[i]: ind i last capture
+ int<lower=0,upper=nocc+1> lastni[nindni]; // last[i]: ind i last capture
+ lasti = rep_array(0,nindi);
+ lastni = rep_array(0,nindni);
+ for (i in 1:nindi) {
+ for (k in firsti[i]:nocc) {
+ if (yi[i,k] == 1) {
+ if (k > lasti[i]) lasti[i] = k;
+ }
+ }
+ }
+ for (ii in 1:nindni) {
+ for (kk in firstni[ii]:nocc) {
+ if (yni[ii,kk] == 1) {
+ if (kk > lastni[ii]) lastni[ii] = kk;
+ }
+ }
+ }
+
+}
+
+
+parameters {
+ real<lower=0, upper=1> theta[nindni]; // probability of being male for non-identified individuals
+ real<lower=0, upper=1> b0[2,nocc-1]; // intercept of p
+ real a0[2,2]; // intercept for phi
+ real a1[2,2]; // coefficient for phi
+}
+
+transformed parameters {
+ real<lower=0,upper=1>p_male[nindni,nocc]; // capture probability
+ real<lower=0,upper=1>p_female[nindni,nocc]; // capture probability
+ real<lower=0,upper=1>p[nindi,nocc]; // capture probability
+
+ real<lower=0,upper=1>phi_male[nindni,nocc-1]; // survival probability
+ real<lower=0,upper=1>chi_male[nindni,nocc+1]; // probability that an individual
+ // is never recaptured after its
+ // last capture
+ real<lower=0,upper=1>phi_female[nindni,nocc-1]; // survival probability
+ real<lower=0,upper=1>chi_female[nindni,nocc+1]; // probability that an individual
+ // is never recaptured after its
+ // last capture
+ real<lower=0,upper=1>phi[nindi,nocc-1]; // survival probability
+ real<lower=0,upper=1>chi[nindi,nocc+1]; // probability that an individual
+ // is never recaptured after its
+ // last capture
+
+ {
+ int k;
+ int kk;
+ for(ii in 1:nindi){
+ if (firsti[ii]>1) {
+ for (z in 1:(firsti[ii]-1)){
+ phi[ii,z] = 1;
+ }
+ }
+ for(tt in firsti[ii]:(nocc-1)) {
+ // linear predictor for phi:
+ phi[ii,tt] = inv_logit(a0[sex[ii], juvi[ii,tt]] + a1[sex[ii], juvi[ii,tt]]*x[tt]);
+
+ }
+ }
+
+ for(ii in 1:nindni){
+ if (firstni[ii]>1) {
+ for (z in 1:(firstni[ii]-1)){
+ phi_female[ii,z] = 1;
+ phi_male[ii,z] = 1;
+ }
+ }
+ for(tt in firstni[ii]:(nocc-1)) {
+ // linear predictor for phi:
+ phi_male[ii,tt] = inv_logit(a0[1, juvni[ii,tt]] + a1[1, juvni[ii,tt]]*x[tt]);
+ phi_female[ii,tt] = inv_logit(a0[2, juvni[ii,tt]]+ a1[2, juvni[ii,tt]]*x[tt]);
+
+ }
+ }
+
+ for(i in 1:nindi) {
+ // linear predictor for p for identified individuals
+ for(w in 1:firsti[i]){
+ p[i,w] = 1;
+ }
+ for(kkk in (firsti[i]+1):nocc)
+ p[i,kkk] = b0[sex[i],year[kkk-1]];
+ chi[i,nocc+1] = 1.0;
+ k = nocc;
+ while (k > firsti[i]) {
+ chi[i,k] = (1 - phi[i,k-1]) + phi[i,k-1] * (1 - p[i,k]) * chi[i,k+1];
+ k = k - 1;
+ }
+ if (firsti[i]>1) {
+ for (u in 1:(firsti[i]-1)){
+ chi[i,u] = 0;
+ }
+ }
+ chi[i,firsti[i]] = (1 - p[i,firsti[i]]) * chi[i,firsti[i]+1];
+ }// close definition of transformed parameters for identified individuals
+
+ for(i in 1:nindni) {
+ // linear predictor for p for non-identified individuals
+ for(w in 1:firstni[i]){
+ p_male[i,w] = 1;
+ p_female[i,w] = 1;
+ }
+ for(kkkk in (firstni[i]+1):nocc){
+ p_male[i,kkkk] = b0[1,year[kkkk-1]];
+ p_female[i,kkkk] = b0[2,year[kkkk-1]];
+ }
+ chi_male[i,nocc+1] = 1.0;
+ chi_female[i,nocc+1] = 1.0;
+ k = nocc;
+ while (k > firstni[i]) {
+ chi_male[i,k] = (1 - phi_male[i,k-1]) + phi_male[i,k-1] * (1 - p_male[i,k]) * chi_male[i,k+1];
+ chi_female[i,k] = (1 - phi_female[i,k-1]) + phi_female[i,k-1] * (1 - p_female[i,k]) * chi_female[i,k+1];
+ k = k - 1;
+ }
+ if (firstni[i]>1) {
+ for (u in 1:(firstni[i]-1)){
+ chi_male[i,u] = 0;
+ chi_female[i,u] = 0;
+ }
+ }
+ chi_male[i,firstni[i]] = (1 - p_male[i,firstni[i]]) * chi_male[i,firstni[i]+1];
+ chi_female[i,firstni[i]] = (1 - p_female[i,firstni[i]]) * chi_female[i,firstni[i]+1];
+ } // close definition of transformed parameters for non-identified individuals
+
+
+ } // close block of transformed parameters exclusive parameter declarations
+} // close transformed parameters
+
+model {
+ // priors
+ theta ~ beta(1, 1);
+ for (g in 1:(nocc-1)){
+ b0[1,g]~beta(1,1);
+ b0[2,g]~beta(1,1);
+ }
+ a0[1,1]~normal(0,1.5);
+ a0[1,2]~normal(0,1.5);
+ a1[1,1]~normal(0,3);
+ a1[1,2]~normal(0,3);
+
+ a0[2,1]~normal(0,1.5);
+ a0[2,2]~normal(a0[1,2],0.01); // for juveniles, we assume that the effect of the covariate is independet of sex
+ a1[2,1]~normal(0,3);
+ a1[2,2]~normal(a1[1,2],0.01);
+
+ // likelihood for identified individuals
+ for (i in 1:nindi) {
+ if (lasti[i]>0) {
+ for (k in firsti[i]:lasti[i]) {
+ if(k>1) target+= (log(phi[i, k-1]));
+ if (yi[i,k] == 1) target+=(log(p[i,k]));
+ else target+=(log1m(p[i,k]));
+ }
+ }
+ target+=(log(chi[i,lasti[i]+1]));
+ }
+
+ // likelihood for non-identified individuals
+ for (i in 1:nindni) {
+ real log_like_male = 0;
+ real log_like_female = 0;
+
+ if (lastni[i]>0) {
+ for (k in firstni[i]:lastni[i]) {
+ if(k>1){
+ log_like_male += (log(phi_male[i, k-1]));
+ log_like_female += (log(phi_female[i, k-1]));
+ }
+ if (yni[i,k] == 1){
+ log_like_male+=(log(p_male[i,k]));
+ log_like_female+=(log(p_female[i,k]));
+ }
+ else{
+ log_like_male+=(log1m(p_male[i,k]));
+ log_like_female+=(log1m(p_female[i,k]));
+ }
+
+ }
+ }
+ log_like_male += (log(chi_male[i,lastni[i]+1]));
+ log_like_female += (log(chi_female[i,lastni[i]+1]));
+
+ target += log_mix(theta[i], log_like_male, log_like_female);
+ }
+
+}
# Run STAN
+library(rstan)
+fit <- stan(file = "stanmodels/cmr_mixture_model.stan", data=datax, verbose = FALSE)
+# for above simulated data (25000 individuals x 15 time periods)
+# computing time is around 48 hours on an intel corei7 laptop
+# for larger data sets, we recommed moving the transformed parameters block
+# to the model block in order to avoid monitoring of p_male, p_female,
+# phi_male and phi_female producing memory problems
+
+# launch_shinystan(fit) # diagnostic plots
+summary(fit)
## mean se_mean sd 2.5% 25%
+## b0[1,1] 0.60132367 0.0015709423 0.06173884 0.48042366 0.55922253
+## b0[1,2] 0.70098709 0.0012519948 0.04969428 0.60382019 0.66806698
+## b0[1,3] 0.50293513 0.0010904085 0.04517398 0.41491848 0.47220346
+## b0[1,4] 0.28118209 0.0008809447 0.03577334 0.21440931 0.25697691
+## b0[1,5] 0.34938289 0.0009901335 0.03647815 0.27819918 0.32351323
+## b0[1,6] 0.13158569 0.0006914740 0.02627423 0.08664129 0.11286629
+## b0[1,7] 0.61182981 0.0010463611 0.04129602 0.53187976 0.58387839
+## b0[1,8] 0.48535193 0.0010845951 0.04155762 0.40559440 0.45750793
+## b0[1,9] 0.52531291 0.0008790063 0.03704084 0.45247132 0.50064513
+## b0[1,10] 0.87174780 0.0007565552 0.03000936 0.80818138 0.85259573
+## b0[1,11] 0.80185454 0.0009425675 0.03518166 0.73173810 0.77865187
+## b0[1,12] 0.33152443 0.0008564381 0.03628505 0.26380840 0.30697293
+## b0[1,13] 0.42132288 0.0012174784 0.04140382 0.34062688 0.39305210
+## b0[1,14] 0.65180372 0.0015151039 0.05333953 0.55349105 0.61560493
+## b0[2,1] 0.34237039 0.0041467200 0.12925217 0.12002285 0.24717176
+## b0[2,2] 0.18534646 0.0023431250 0.07547704 0.05924694 0.12871584
+## b0[2,3] 0.61351083 0.0024140550 0.07679100 0.46647727 0.56242546
+## b0[2,4] 0.37140208 0.0024464965 0.06962399 0.24693888 0.32338093
+## b0[2,5] 0.19428215 0.0034618302 0.11214798 0.02800056 0.11146326
+## b0[2,6] 0.27371336 0.0026553769 0.09054020 0.11827243 0.20785316
+## b0[2,7] 0.18611173 0.0014387436 0.05328492 0.09122869 0.14789827
+## b0[2,8] 0.25648337 0.0018258589 0.05287800 0.16255769 0.21913271
+## b0[2,9] 0.20378754 0.0021367769 0.07380004 0.07777998 0.15215845
+## b0[2,10] 0.52679548 0.0024625568 0.08696008 0.36214334 0.46594844
+## b0[2,11] 0.47393354 0.0032593161 0.10555065 0.28843967 0.39781278
+## b0[2,12] 0.22289155 0.0017082729 0.05551514 0.12576797 0.18203335
+## b0[2,13] 0.26191486 0.0024159794 0.07016314 0.14106495 0.21234017
+## b0[2,14] 0.65111737 0.0055743944 0.18780555 0.29279480 0.50957591
+## a0[1,1] 0.95440670 0.0013771881 0.04808748 0.86301660 0.92146330
+## a0[1,2] 0.01529770 0.0469699511 1.46995922 -2.82218067 -0.95533706
+## a0[2,1] 0.16384995 0.0049928331 0.12634422 -0.06399631 0.07533962
+## a0[2,2] 0.01535679 0.0469634175 1.47006964 -2.81864060 -0.95515751
+## a1[1,1] 0.15937249 0.0028992587 0.08864790 -0.01288607 0.10017613
+## a1[1,2] 0.08055953 0.1007089857 3.02148727 -5.95525636 -1.96662599
+## a1[2,1] -0.83614134 0.0074143920 0.18655882 -1.21033848 -0.95698565
+## a1[2,2] 0.08071668 0.1006904255 3.02145647 -5.94617355 -1.96508733
+## 50% 75% 97.5% n_eff Rhat
+## b0[1,1] 0.60206306 0.6431566 0.7206343 1544.5301 1.002331
+## b0[1,2] 0.70165494 0.7355204 0.7946280 1575.4617 1.001482
+## b0[1,3] 0.50367411 0.5330078 0.5898079 1716.3196 1.001183
+## b0[1,4] 0.27997512 0.3046483 0.3544592 1649.0040 1.000760
+## b0[1,5] 0.34936442 0.3751935 0.4191138 1357.3073 1.002072
+## b0[1,6] 0.12987449 0.1481661 0.1873982 1443.8040 1.003676
+## b0[1,7] 0.61203228 0.6397577 0.6933929 1557.5904 1.001458
+## b0[1,8] 0.48513822 0.5134314 0.5672066 1468.1355 1.002511
+## b0[1,9] 0.52534212 0.5501747 0.5994060 1775.7335 1.000824
+## b0[1,10] 0.87324112 0.8934047 0.9258033 1573.3747 1.000719
+## b0[1,11] 0.80300311 0.8261868 0.8675033 1393.1817 1.001172
+## b0[1,12] 0.33044476 0.3552199 0.4052902 1794.9956 1.000566
+## b0[1,13] 0.42116690 0.4492297 0.5026942 1156.5339 1.000289
+## b0[1,14] 0.64956850 0.6864706 0.7607107 1239.4056 1.004061
+## b0[2,1] 0.33493631 0.4251416 0.6150923 971.5524 1.004049
+## b0[2,2] 0.17981663 0.2358847 0.3446097 1037.6210 1.001474
+## b0[2,3] 0.61326419 0.6644156 0.7628427 1011.8737 1.005727
+## b0[2,4] 0.36837778 0.4158585 0.5190457 809.8949 1.003803
+## b0[2,5] 0.17910449 0.2591418 0.4533117 1049.4733 1.001499
+## b0[2,6] 0.26739172 0.3299594 0.4685139 1162.6006 1.001170
+## b0[2,7] 0.18254607 0.2198969 0.3003156 1371.6455 1.000878
+## b0[2,8] 0.25280556 0.2895585 0.3704113 838.7174 1.005624
+## b0[2,9] 0.19724053 0.2501298 0.3694806 1192.8747 1.003687
+## b0[2,10] 0.52587075 0.5845730 0.7061694 1247.0027 1.002851
+## b0[2,11] 0.46874445 0.5392302 0.7046892 1048.7425 0.999473
+## b0[2,12] 0.21961656 0.2580782 0.3397127 1056.1081 1.000907
+## b0[2,13] 0.25601959 0.3056204 0.4142888 843.3960 1.003130
+## b0[2,14] 0.65824835 0.7973674 0.9698829 1135.0669 1.003838
+## a0[1,1] 0.95368445 0.9862439 1.0515747 1219.2071 1.003898
+## a0[1,2] 0.01633534 0.9911055 2.9717839 979.4231 1.003726
+## a0[2,1] 0.15519648 0.2472483 0.4230776 640.3489 1.004625
+## a0[2,2] 0.01587281 0.9898084 2.9659552 979.8429 1.003744
+## a1[1,1] 0.15647489 0.2205720 0.3354845 934.8953 1.007190
+## a1[1,2] 0.06683287 2.1568781 6.0295208 900.1297 1.003701
+## a1[2,1] -0.83503982 -0.7075691 -0.4814539 633.1119 1.010568
+## a1[2,2] 0.06586905 2.1557247 6.0239735 900.4432 1.003704
+
+Analyses of nest survival is important for understanding the mechanisms of population dynamics. The life-span of a nest could be used as a measure of nest survival. However, this measure very often is biased towards nests that survived longer because these nests are detected by the ornithologists with higher probability (Mayfield 1975). In order not to overestimate nest survival, daily nest survival conditional on survival to the previous day can be estimated.
+What model is best used depends on the type of data available. Data may look:
+Model | +Data | +Software, R-code | +
---|---|---|
Binomial or Bernoulli model | +1, (3) | +glm , glmer ,… |
+
Cox proportional hazard model | +1,2,3,4 | +brm , soon: stan_cox |
+
Known fate model | +1, 2 | +Stan code below | +
Known fate model | +3, 4 | +Stan code below | +
Logistic exposure model | +1,2,3,4 | +glm , glmer using a link function that depends on exposure time |
+
Shaffer (2004) explains how to adapt the link function in a Bernoulli model to account for having found the nests at different nest ages (exposure time). Ben Bolker explains how to implement the logistic exposure model in R here.
+A natural model that allows estimating daily nest survival is the known-fate survival model. It is a Markov model that models the state of a nest \(i\) at day \(t\) (whether it is alive, \(y_{it}=1\) or not \(y_{it}=0\)) as a Bernoulli variable dependent on the state of the nest the day before.
+\[ y_{it} \sim Bernoulli(y_{it-1}S_{it})\]
+The daily nest survival \(S_{it}\) can be linearly related to predictor variables that are measured on the nest or on the day level.
\[logit(S_{it}) = \textbf{X} \beta\] +It is also possible to add random effects if needed.
+The following Stan model code is saved as daily_nest_survival.stan
.
data {
+ int<lower=0> Nnests; // number of nests
+ int<lower=0> last[Nnests]; // day of last observation (alive or dead)
+ int<lower=0> first[Nnests]; // day of first observation (alive or dead)
+ int<lower=0> maxage; // maximum of last
+ int<lower=0> y[Nnests, maxage]; // indicator of alive nests
+ real cover[Nnests]; // a covariate of the nest
+ real age[maxage]; // a covariate of the date
+}
+
+parameters {
+ vector[3] b; // coef of linear pred for S
+}
+
+model {
+ real S[Nnests, maxage-1]; // survival probability
+
+ for(i in 1:Nnests){
+ for(t in first[i]:(last[i]-1)){
+ S[i,t] = inv_logit(b[1] + b[2]*cover[i] + b[3]*age[t]);
+ }
+ }
+
+ // priors
+ b[1]~normal(0,5);
+ b[2]~normal(0,3);
+ b[3]~normal(0,3);
+
+ // likelihood
+ for (i in 1:Nnests) {
+ for(t in (first[i]+1):last[i]){
+ y[i,t]~bernoulli(y[i,t-1]*S[i,t-1]);
+ }
+ }
+}
Data is from (Grendelmeier2018?).
+ +## List of 7
+## $ y : int [1:156, 1:31] 1 NA 1 NA 1 NA NA 1 1 1 ...
+## $ Nnests: int 156
+## $ last : int [1:156] 26 30 31 27 31 30 31 31 31 31 ...
+## $ first : int [1:156] 1 14 1 3 1 24 18 1 1 1 ...
+## $ cover : num [1:156] -0.943 -0.215 0.149 0.149 -0.215 ...
+## $ age : num [1:31] -1.65 -1.54 -1.43 -1.32 -1.21 ...
+## $ maxage: int 31
+
+
+We love exploring the performance of the Markov chains by using the function launch_shinystan
from the package shinystan
.
It looks like cover does not affect daily nest survival, but daily nest survival decreases with the age of the nestlings.
+ +## Inference for Stan model: anon_model.
+## 5 chains, each with iter=2500; warmup=1250; thin=1;
+## post-warmup draws per chain=1250, total post-warmup draws=6250.
+##
+## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
+## b[1] 4.04 0.00 0.15 3.76 3.94 4.04 4.14 4.35 3828 1
+## b[2] 0.00 0.00 0.13 -0.25 -0.09 -0.01 0.08 0.25 4524 1
+## b[3] -0.70 0.00 0.16 -1.02 -0.81 -0.69 -0.59 -0.39 3956 1
+## lp__ -298.98 0.03 1.30 -302.39 -299.52 -298.65 -298.05 -297.53 2659 1
+##
+## Samples were drawn using NUTS(diag_e) at Thu Jan 19 22:33:33 2023.
+## For each parameter, n_eff is a crude measure of effective sample size,
+## and Rhat is the potential scale reduction factor on split chains (at
+## convergence, Rhat=1).
+# effect plot
+bsim <- as.data.frame(mod)
+nsim <- nrow(bsim)
+
+newdat <- data.frame(age=seq(1, datax$maxage, length=100))
+newdat$age.z <- (newdat$age-mean(1:datax$maxage))/sd((1:datax$maxage))
+Xmat <- model.matrix(~age.z, data=newdat)
+fitmat <- matrix(ncol=nsim, nrow=nrow(newdat))
+for(i in 1:nsim) fitmat[,i] <- plogis(Xmat%*%as.numeric(bsim[i,c(1,3)]))
+newdat$fit <- apply(fitmat, 1, median)
+newdat$lwr <- apply(fitmat, 1, quantile, prob=0.025)
+newdat$upr <- apply(fitmat, 1, quantile, prob=0.975)
+
+plot(newdat$age, newdat$fit, ylim=c(0.8,1), type="l",
+ las=1, ylab="Daily nest survival", xlab="Age [d]")
+lines(newdat$age, newdat$lwr, lty=3)
+lines(newdat$age, newdat$upr, lty=3)
+Figure 24.1: Estimated daily nest survival probability in relation to nest age. Dotted lines are 95% uncertainty intervals of the regression line. +
+When nest are controlled only irregularly, it may happen that a nest is found predated or dead after a longer break in controlling. In such cases, we know that the nest was predated or it died due to other causes some when between the last control when the nest was still alive and when it was found dead. In such cases, we need to tell the model that the nest could have died any time during the interval when we were not controlling.
+To do so, we create a variable that indicates the time (e.g. day since first egg) when the nest was last seen alive (lastlive
). A second variable indicates the time of the last check which is either the equal to lastlive
when the nest survived until the last check, or it is larger than lastlive
when the nest failure has been recorded. A last variable, gap
, measures the time interval in which the nest failure occurred. A gap
of zero means that the nest was still alive at the last control, a gap
of 1 means that the nest failure occurred during the first day after lastlive
, a gap
of 2 means that the nest failure either occurred at the first or second day after lastlive
.
# time when nest was last observed alive
+lastlive <- apply(datax$y, 1, function(x) max(c(1:length(x))[x==1]))
+
+# time when nest was last checked (alive or dead)
+lastcheck <- lastlive+1
+# here, we turn the above data into a format that can be used for
+# irregular nest controls. WOULD BE NICE TO HAVE A REAL DATA EXAMPLE!
+
+# when nest was observed alive at the last check, then lastcheck equals lastlive
+lastcheck[lastlive==datax$last] <- datax$last[lastlive==datax$last]
+
+datax1 <- list(Nnests=datax$Nnests,
+ lastlive = lastlive,
+ lastcheck= lastcheck,
+ first=datax$first,
+ cover=datax$cover,
+ age=datax$age,
+ maxage=datax$maxage)
+# time between last seen alive and first seen dead (= lastcheck)
+datax1$gap <- datax1$lastcheck-datax1$lastlive
In the Stan model code, we specify the likelihood for each gap separately.
+data {
+ int<lower=0> Nnests; // number of nests
+ int<lower=0> lastlive[Nnests]; // day of last observation (alive)
+ int<lower=0> lastcheck[Nnests]; // day of observed death or, if alive, last day of study
+ int<lower=0> first[Nnests]; // day of first observation (alive or dead)
+ int<lower=0> maxage; // maximum of last
+ real cover[Nnests]; // a covariate of the nest
+ real age[maxage]; // a covariate of the date
+ int<lower=0> gap[Nnests]; // obsdead - lastlive
+}
+
+parameters {
+ vector[3] b; // coef of linear pred for S
+}
+
+model {
+ real S[Nnests, maxage-1]; // survival probability
+
+ for(i in 1:Nnests){
+ for(t in first[i]:(lastcheck[i]-1)){
+ S[i,t] = inv_logit(b[1] + b[2]*cover[i] + b[3]*age[t]);
+ }
+ }
+
+ // priors
+ b[1]~normal(0,1.5);
+ b[2]~normal(0,3);
+ b[3]~normal(0,3);
+
+ // likelihood
+ for (i in 1:Nnests) {
+ for(t in (first[i]+1):lastlive[i]){
+ 1~bernoulli(S[i,t-1]);
+ }
+ if(gap[i]==1){
+ target += log(1-S[i,lastlive[i]]); //
+ }
+ if(gap[i]==2){
+ target += log((1-S[i,lastlive[i]]) + S[i,lastlive[i]]*(1-S[i,lastlive[i]+1])); //
+ }
+ if(gap[i]==3){
+ target += log((1-S[i,lastlive[i]]) + S[i,lastlive[i]]*(1-S[i,lastlive[i]+1]) +
+ prod(S[i,lastlive[i]:(lastlive[i]+1)])*(1-S[i,lastlive[i]+2])); //
+ }
+ if(gap[i]==4){
+ target += log((1-S[i,lastlive[i]]) + S[i,lastlive[i]]*(1-S[i,lastlive[i]+1]) +
+ prod(S[i,lastlive[i]:(lastlive[i]+1)])*(1-S[i,lastlive[i]+2]) +
+ prod(S[i,lastlive[i]:(lastlive[i]+2)])*(1-S[i,lastlive[i]+3])); //
+ }
+
+ }
+}
Helpful links:
+https://deepai.org/publication/bayesian-survival-analysis-using-the-rstanarm-r-package (Brilleman et al. 2020)
+https://www.hammerlab.org/2017/06/26/introducing-survivalstan/
+ +In Bayesian statistics, probability distributions are used for two fundamentally different purposes. First, they are used to describe distributions of data. These distributions are also called data distributions. Second, probability distributions are used to express information or knowledge about parameters. Such distributions are called prior or posterior distributions. The data distributions are part of descriptive statistics, whereas prior and posterior distributions are part of inferential statistics. The usage of probability distributions for describing data does not differ between frequentist and Bayesian statistics. Classically, the data distribution is known as “model assumption”. Specifically to Bayesian statistics is the formal expression of statistical uncertainty (or “information” or “knowledge”) by prior and posterior distributions. We here introduce some of the most often used probability distributions and present how they are used in statistics.
+Probability distributions are grouped into discrete and continuous distributions. Discrete distributions define for any discrete value the probability that exactly this value occurs. They are usually used as data distributions for discrete data such as counts. The function that describes a discrete distribution is called a probability function (their values are probabilities, i.e. a number between 0 and 1). Continuous distributions describe how continuous values are distributed. They are used as data distributions for continuous measurements such as body size and also as prior or posterior distributions for parameters such as the mean body size. Most parameters are measured on a continuous scale. The function that describes continuous distributions is called density function. Its values are non-negative and the area under the density function equals one. The area under a density function corresponds to probabilities. For example, the area under the density function above the value 2 corresponds to the proportion of data with values above 2 if the density function describes data, or it corresponds to the probability that the parameter takes on a value bigger than 2 if the density function is a posterior distribution.
+Bernoulli distributed data take on the exact values 0 or 1. The value 1 occurs with probability \(p\).
+\(x \sim Bernoulli(p)\)
+The probability function is \(p(x) = p^x(1-p)^{1-x}\). +The expected value is \(E(x) = p\) and the variance is \(Var(x) = p(1-p)\).
+The flipping experiment of a fair coin produces Bernoulli distributed data with \(p=0.5\) if head is taken as one and tail is taken as zero. The Bernoulli distribution is usually used as a data model for binary data such as whether a nest box is used or not, whether a seed germinated or not, whether a species occurs or not in a plot etc.
+The binomial distribution describes the number of ones among a predefined number of Bernoulli trials. For example, the number of heads among 20 coin flips, the number of used nest boxes among the 50 nest boxes of the study area, or the number of seed that germinated among the 10 seeds in the pot. Binomially distributed data are counts with an upper limit (\(n\)).
+\(x \sim binomial(p,n)\)
+The probability function is \(p(x) = {n\choose x} p^x(1-p)^{(n-x)}\). +The expected value is \(E(x) = np\) and the variance is \(Var(x) = np(1-p)\).
++Figure 4.1: Two examples of a binomial distribution. size: number of trials (the argument in the corresponding R function, for example in rbinom, is called size). p: success probability. +
+The Poisson distribution describes the distribution of counts without upper boundary, i.e., when we know how many times something happened but we do not know how many times it did not happen. A typical Poisson distributed variable is the number of raindrops in equally-sized grid cells on the floor, if we can assume that every rain drop falls down completely independent of the other raindrops and at a completely random point (Figure 4.2).
+\(x \sim Poisson(\lambda)\)
+The probability function is \(p(x) = \frac{1}{x!}\lambda^xexp(-\lambda)\). It is implemented in the R-function dpois
.
+The expected values is \(E(x) = \lambda\) and the variance is \(Var(x) = \lambda\).
set.seed(1338)
+n <- 500 # simulate 500 raindrops
+x <- runif(n) # they fall at some random point (x,y) in space
+y <- runif(n)
+
+par(mfrow=c(1,2))
+
+par(mar=c(1,1,1,1))
+plot(c(0,1), c(0,1), type="n", xaxs="i", yaxs="i", xlab="", ylab="", axes=F)
+box()
+points(x,y, pch=16)
+
+# add a grid
+grid(10, 10, col=1)
+
+# number of points per grid-cell
+xcell <- cut(x, breaks=seq(0,1, by=0.1))
+ycell <- cut(y, breaks=seq(0,1, by=0.1))
+counts <- as.numeric(table(xcell, ycell))
+
+par(mar=c(4,4,1,1))
+hist(counts, col="blue", cex.lab=1.4, las=1, cex.axis=1.2, main="")
+Figure 4.2: A natural process that produces Poisson distributed data is the number of raindrops falling (at random) into equally sized cells of a grid. Left: spatial distribution of raindrops, right: corresponding distribution of the number of raindrops per cell. +
+An important property of the Poisson distribution is that it has only one parameter \(\lambda\). As a consequence, it does not allow for any combination of means and variances. In fact, they are assumed to be the same. In the real world, most count data do not behave like rain drops, that means variances of count data are in most real world examples not equal to the mean as assumed by the Poisson distribution. Therefore, when using the Poisson distribution as a data model, it is important to check for overdispersion.
+The property that in a Poisson distribution the mean equals the variance can be used to quickly assess whether the spatial distribution of observations, for example, nest locations, is clustered, random, or equally spaced. Animal locations could be clustered due to coloniality, social, or other attraction. More equally spaced location may be due to territoriality. +Let \(x\) be the number of observations per grid cell; if \(var(x)/mean(x)>>1\) the observations are clustered, whereas if \(var(x)/mean(x)<<1\), the observations are more equally spaced than expected by chance. Clustering will lead to overdispersion in the counts whereas more equally spaced locations will lead +to underdispersion.
+Further, note that not all variables measured as an integer number are count data! For example, the number of days an animal spends in a specific area before moving away looks like a count. However, it is a continuous measurement. The duration an animal spends in a specific areas could also be measured in hours or minutes. The Poisson model assumes that the counts are all events that happened. However, an emigration of an animal is just one event, independent of how long it stayed.
+The negative-binomial distribution represents the number of zeros which occur in a sequence of Bernoulli trials before a target number of ones is reached. It is hard to see this situation in, e.g., the number of individuals counted on plots. Therefore, we were reluctant to introduce this distribution in our old book (Korner-Nievergelt et al. 2015). However, the negative-binomial distribution often fits much better to count data than the Poisson model because it has two parameters and therefore allows for fitting both the mean and the variance to the data. Therefore, we started using the negative-binomial distribution as a data model more often. +\(x \sim negative-binomial(p,n)\)
+Its probability function is rather complex:
+\(p(x) = \frac{\Gamma(x+n)}{\Gamma(n) x!} p^n (1-p)^x\) with \(\Gamma\) being the Gamma-function. Luckily, the negative-binomial probability function is implemented in the R-function dnegbin
.
+The expected value of the negative-binomial distribution is \(E(x) = n\frac{(1-p)}{p}\) and the variance is \(Var(x) = n\frac{(1-p)}{p^2}\).
+We like to specify the distribution using the mean and the scale parameter \(x \sim negativ-binomial(\mu,\theta)\), because in practice we often specify a linear predictor for the logarithm of the mean \(\mu\).
The beta distribution is restricted to the range [0,1]. It describes the knowledge about a probability parameter. Therefore, it is usually used as a prior or posterior distribution for probabilities. The beta distribution sometimes is used as a data model for continuous probabilities, However, it is difficult to get a good fit of such models, because measured proportions often take on values of zero and ones which is not allowed in most (but not all) beta distributions, thus this distribution does not describe the variance of measured proportions correctly. However, for describing knowledge of a proportion parameter, it is a very convenient distribution with two parameters.
+\(x \sim beta(a,b)\)
+Its density function is \(p(x) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}x^{a-1}(1-x)^{b-1}\). The R-function dbeta
does the rather complicated calculations for us.
The expected value of a beta distribution is \(E(x) = \frac{a}{(a+b)}\) and the variance is \(Var(x) = \frac{ab}{(a+b)^2(a+b+1)}\). The \(beta(1,1)\) distribution is equal to the \(uniform(0,1)\) distribution. The higher the sum of \(a\) and \(b\), the more narrow is the distribution (Figure 4.3).
++Figure 4.3: Beta distributions with different parameter values. +
+The normal, or Gaussian distribution is widely used since a long time in statistics. It describes the distribution of measurements that vary because of a sum of random errors. Based on the central limit theorem, sample averages are approximately normally distributed (2).
+\(x \sim normal(\mu, \sigma^2)\)
+The density function is \(p(x) = \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{1}{2\sigma^2}(x -\mu)^2)\) and it is implemented in the R-function dnorm
.
The expected value is \(E(x) = \mu\) and the variance is \(Var(x) = \sigma^2\).
+The variance parameter can be specified to be a variance, a standard deviation or a precision. Different software (or authors) have different habits, e.g., R and Stan use the standard deviation sigma \(\sigma\), whereas BUGS (WinBugs, OpenBUGS or jags) use the precision, which is the inverse of the variance $= $.
+The normal distribution is used as a data model for measurements that scatter symmetrically around a mean, such as body size (in m), food consumption (in g), or body temperature (°C). +The normal distribution also serves as prior distribution for parameters that can take on negative or positive values. The larger the variance, the flatter (less informative) is the distribution.
+The standard normal distribution is a normal distribution with a mean of zero and a variance of one, \(z \sim normal(0, 1)\). The standard normal distribution is also called the z-distribution. Or, a z-variable is a variable with a mean of zero and a standard deviation of one.
+x <- seq(-3, 3, length=100)
+y <- dnorm(x) # density function of a standard normal distribution
+dat <- tibble(x=x,y=y)
+plot(x,y, type="l", lwd=2, col="#d95f0e", las=1, ylab="normal density of x")
+segments(0, dnorm(1), 1, dnorm(1), lwd=2)
+segments(0, dnorm(0), 0, 0)
+text(0.5, 0.23, expression(sigma))
+Figure 4.4: Standard normal distribution +
+Plus minus one times the standard deviation (\(\sigma\)) from the mean includes around 68% of the area under the curve (corresponding to around 68% of the data points in case the normal distribution is used as a data model, or 68% of the prior or posterior mass if the normal distribution is used to describe the knowledge about a parameter). Plus minus two times the standard deviation includes around 95% of the area under the curve.
+The gamma distribution is a continuous probability distribution for strictly positive values (zero is not included). The shape of the gamma distribution is right skewed with a long upper tail, whereas most of the mass is centered around a usually small value. It has two parameters, the shape \(\alpha\) and the inverse scale \(\beta\).
+\(x \sim gamma(\alpha,\beta)\)
+Its density function is \(p(x) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{(\alpha-1)} exp(-\beta x)\), or dgamma
in R. The expected value is \(E(x) = \frac{\alpha}{\beta}\) and the variance is \(Var(x) = \frac{\alpha}{\beta^2}\).
The gamma distribution is becoming more and more popular as a data model for durations (time to event) or other highly right skewed continuous measurements that do not have values of zero.
+The gamma distribution is a conjugate prior distribution for the mean of a Poisson distribution and for the precision parameter of a normal distribution. However, in hierarchical models with normally distributed random effects, it is not recommended to use the gamma distribution as a prior distribution for the among-group variance (A. Gelman 2006). The Cauchy or folded t-distribution seem to have less influence on the posterior distributions of the variance parameters.
+The Cauchy distribution is a symmetric distribution with much heavier tails compared to the normal distribution.
+$ x Cauchy(a,b)$
+Its probability density function is \(p(x) = \frac{1}{\pi b[1+(\frac{x-a}{b})^2]}\). The mean and the variance of the Cauchy distribution are not defined. The median is \(a\). +The part of the Cauchy distribution for positive values, i.e., half of the Cauchy distribution, is often used as a prior distribution for variance parameters.
+The t-distribution is the marginal posterior distribution of a the mean of a sample with unknown variance when conjugate prior distributions are used to obtain the posterior distribution. The t-distribution has three parameters, the degrees of freedom \(v\), the location \(\mu\) and the scale \(\sigma\).
+\(x \sim t(v, \mu, \sigma)\)
+Its density function is \(p(x) = \frac{\Gamma((v+1)/2)}{\Gamma(v/2)\sqrt{v\pi}\sigma}(1+\frac{1}{v}(\frac{x-\mu}{\sigma})^2)^{-(v+1)/2}\). Its expected value is \(E(x) = \mu\) for \(v>1\) and the variance is \(Var(x) = \frac{v}{v-2}\sigma ^2\) for \(v>2\).
+The t-distribution is sometimes used as data model. Because of its heavier tails compared to the normal model, the model parameters are less influenced by measurement errors when a t-distribution is used instead of a normal distribution. This is called “robust statistics”.
+Similar to the Cauchy distribution, the folded t-distribution, i.e., the positive part of the t-distribution, can serve as a prior distribution for variance parameters.
+The F-distribution is not important in Bayesian statistics. +Ratios of sample variances drawn from populations with equal variances follow an F-distribution. The density function of the F-distribution is even more complicated than the one of the t-distribution! We do not copy it here. Further, we have not yet met any Bayesian example where the F-distribution is used (that does not mean that there is no). It is used in frequentist analyses in order to compare variances, e.g. within ANOVAs. If two variances only differ because of natural variance in the data (nullhypothesis) then \(\frac{Var(X_1)}{Var(X_2)}\sim F_{df_1,df_2}\).
++Figure 4.5: Different density functions of the F statistics +
+To introduce the linear mixed model, we use repeated hormone measures at nestling Barn Owls Tyto alba. The cortbowl data set contains stress hormone data (corticosterone, variable ‘totCort’) of nestling Barn owls which were either treated with a corticosterone-implant, or with a placebo-implant as the control group. The aim of the study was to quantify the corticosterone increase due to the corticosterone implants (Almasi et al. 2009). In each brood, one or two nestlings were implanted with a corticosterone-implant and one or two nestlings with a placebo-implant (variable ‘Implant’). Blood samples were taken just before implantation, and at days 2 and 20 after implantation.
+data(cortbowl)
+<- cortbowl
+ dat $days <- factor(dat$days, levels=c("before", "2", "20"))
+ datstr(dat) # the data was sampled in 2004,2005, and 2005 by the Swiss Ornithologicla Institute
## 'data.frame': 287 obs. of 6 variables:
+## $ Brood : Factor w/ 54 levels "231","232","233",..: 7 7 7 7 8 8 9 9 10 10 ...
+## $ Ring : Factor w/ 151 levels "898054","898055",..: 44 45 45 46 31 32 9 9 18 19 ...
+## $ Implant: Factor w/ 2 levels "C","P": 2 2 2 1 2 1 1 1 2 1 ...
+## $ Age : int 49 29 47 25 57 28 35 53 35 31 ...
+## $ days : Factor w/ 3 levels "before","2","20": 3 2 3 2 3 1 2 3 2 2 ...
+## $ totCort: num 5.76 8.42 8.05 25.74 8.04 ...
+In total, there are 287 measurements of 151 individuals (variable ‘Ring’) of 54 broods. Because the measurements from the same individual are non-independent, we use a mixed model to analyze these data: Two additional arguments for a mixed model are: a) the mixed model allows prediction of corticosterone levels for an ‘average’ individual, whereas the fixed effect model allows prediction of corticosterone levels only for the 151 individuals that were sampled; and b) fewer parameters are needed. If we include individual as a fixed factor, we would use 150 parameters, while the random factor needs a much lower number of parameters. +We first create a graphic to show the development for each individual, separately for owls receiving corticosterone versus owls receiving a placebo (Figure 14.1).
++Figure 14.1: Total corticosterone before and at day 2 and 20 after implantation of a corticosterone or a placebo implant. Lines connect measurements of the same individual. +
+This is a collection of short introductions or links with commented R code that cover other topics that might be useful for ecologists.
+Bioacoustic analyses are nicely covered in a blog by Marcelo Araya-Salas.
+Like R, python is a high-level programming language that is used by many ecologists. The reticulate package provides a comprehensive set of tools for interoperability between Python and R.
+ +Up to now, we have dealt with models that assume normally distributed residuals. Sometimes the nature of the outcome +variable makes it impossible to fulfill this assumption as might occur with binary variables (e.g., alive/dead, a specific behavior occurred/did not occur), proportions (which are confined to be between 0 and 1), or counts that cannot have negative values. For such cases, models for distributions other than the normal distribution are needed; such models are called generalized linear models (GLM). They consist of three elements:
+The linear predictor is exactly the same as in normal linear models. It is a linear function that defines the relationship between the dependent and the explanatory variables. The link function transforms the expected values of the outcome variable into the range of the linear predictor, which ranges from \(-\infty\) to \(+\infty\). Or, perhaps more intuitively, the inverse link function transforms the values of the linear predictor into the range of the outcome variable. Table 14.1 gives a list of possible link functions that work with different data distributions. +Then, a specific data distribution, for example, binomial or Poisson, is used to describe how the observations scatter around the expected values. A general model formula for a generalized linear model is: +\[\bf y \sim ExpDist(\bf\hat y, \boldsymbol\theta)\] +\[g(\bf\hat y) = \bf X\boldsymbol \beta \] +where ExpDist is a distribution of the exponential family and \(g()\) is the link function. The vector \(\bf y\) contains the observed values of the outcome variable, \(\bf \beta\) contains the model parameters in the linear predictor (also called the model coefficients), and \(\bf X\) is the model matrix containing the values of the predictor variables. \(\boldsymbol \theta\) is an optional vector of additional parameters needed to define the data distribution (e.g., the number of trials in the binomial distribution or the variance in the normal distribution). +The normal linear model is a specific case of a generalized linear model, namely when ExpDist equals the normal distribution and \(g()\) is the identity function (\(g(x) = x\)). Statistical distributions of the exponential family are normal, Bernoulli, binomial, Poisson, inverse-normal, gamma, negative binomial, among others. The normal, Bernoulli, binomial, Poisson or negative binomial distributions are by far the most often used distributions. Most, but not all, data we gather in the life sciences can be analyzed assuming one of these few distributions.
+link | +Gaussian | +Binomial | +Gamma | +Inv_Gauss | +Poisson | +Negative_binomial | +
---|---|---|---|---|---|---|
logit | ++ | D | ++ | + | + | + |
probit | ++ | x | ++ | + | + | + |
cloglog | ++ | x | ++ | + | + | + |
identity | +D | ++ | x | ++ | x | +x | +
inverse | ++ | + | D | ++ | + | + |
log | ++ | + | x | ++ | D | +D | +
1/mu^2 | ++ | + | + | D | ++ | + |
sqrt | ++ | + | + | + | + | x | +
cauchit | ++ | x | ++ | + | x | ++ |
exponent (mu^a) | +x | ++ | + | + | + | + |
Paul Buerkner has implemented many different distributions and link function in the package brms
, see here.
If the outcome variable can only take one of two values (e.g., a species is present or absent, or the individual survived or died; coded as 1 or 0) we use a Bernoulli model, also called logistic regression. The Bernoulli distribution only allows for the values zero and ones and it has only one parameter \(p\), which defines the probability that the value is 1.
+When fitting a Bernoulli model to data, we have to estimate \(p\). Often we are interested in correlations between \(p\) and one or several explanatory variables. Therefore, we model \(p\) as linearly dependent on the explanatory variables. Because the values of \(p\) are squeezed between 0 and 1 (because it is a probability), \(p\) is transformed by the link-function before the linear relationship is modeled. +\[g(p_i) = \bf X\boldsymbol \beta \] +Functions that can transform a probability into the scale of the linear +predictor (\(-\infty\) to \(+\infty\)) are, for example, logit, probit, cloglog, or cauchit. These link functions differ slightly in the way they link the outcome variable to the explanatory variables (Figure 14.1). The logit link function is the most often +used link function in binomial models. However, sometimes another link +function might fit the data better. Kevin S. Van Horn gives useful tipps when to use which link function.
++Figure 14.1: Left panel: Shape of different link functions commonly used for modelling probabilities. Right panel: The relationship between the predictor x (x-axis) and p on the scale of the link function (y-axis) is assumed to be linear. +
+Functions to fit a Bernoulli model are glm
, stan_glm
, brm
, and there are many more that we do not know so well as the three we focus on in this book.
We start by using the function glm
. It uses the “iteratively reweighted least-squares method” which is an adaptation of the least-square (LS) method for fitting generalized linear models. The argument family
allows to choose a data distribution. For fitting a Bernoulli model, we need to specify binomial
. That is because the Bernoulli distribution is equal to the binomial distribution with only one trial (size parameter = 1). Note, if we forget the family
argument, we fit a normal linear model, and there is no warning by R!
With the specification of the distribution, we also choose the link-function. The default link function for the binomial or Bernoulli model is the logit-function. To change the link-function, use e.g. family=binomial(link=cloglog)
.
As an example, we use presence-absence data of little owls Athene noctua in nest boxes during the breeding season. The original data are published in Gottschalk, Ekschmitt, and Wolters (2011); here we use only parts of these data. The variable PA
contains the presence of a little owl: 1 indicates a nestbox used by little owls, whereas 0 stands for an empty nestbox. The variable elevation
has the elevation in meters above sea level. We are interested in how the presence of the little owl is associated with elevation within the study area, that is, how the probability of presence changes with elevation. Our primary interest, therefore, is the slope \(\beta_1\) of the regression line.
\[ y_i \sim Bernoulli(p_i) \] +\[ logit(p_i) = \beta_0 + \beta_1 elevation\] +where \(logit(p_i) = log(p_i/(1-p_i))\).
+data(anoctua) # Athene noctua data in the blmeco package
+mod <- glm(PA~elevation, data=anoctua, family=binomial)
+mod
##
+## Call: glm(formula = PA ~ elevation, family = binomial, data = anoctua)
+##
+## Coefficients:
+## (Intercept) elevation
+## 0.579449 -0.006106
+##
+## Degrees of Freedom: 360 Total (i.e. Null); 359 Residual
+## Null Deviance: 465.8
+## Residual Deviance: 445.6 AIC: 449.6
+As for the normal linear model, the Bernoulli model (and any oder statistical model) assumes that the residuals are independent and identically distributed (iid). Independent means that every observation \(i\) is independent of the other observations. Particularly, there are no groups in the data and no temporal or spatial correlation.
+For generalised linear model different residuals exist. The standard
+residual plots obtained by plot(mod)
produce the same four plots as for an lm
object, but it uses the deviance residuals for the first
+three plots (residuals versus fitted values, QQ plot, and residual variance versus fitted values) and the Pearson’s residuals for the last (residuals versus leverage). The deviance residuals are the contribution of each observation to the deviance of the model. This is the default type when the residuals are extracted from the model using the function resid
. The Pearson’s residual for observation \(i\) is the difference between the observed and the fitted number of successes divided by the standard deviation given the number of trials and the fitted success probability: \(\epsilon_i = \frac{y_i-n_i \hat{p_i}}{\sqrt{n_i \hat{p_i}(1-\hat{p_i})}}\). Other types of residuals are “working,” “response,” or “partial” (see Davison and Snell (1991)). For the residual plots, R chooses the type of residuals so that each plot should look roughly like the analogous plot for the normal linear model. However, in most cases the plots look awkward due to the discreteness of the data, especially when success probabilities are close to 0 or 1. We recommend thinking about why they do not look perfect; with experience, serious violations of model assumptions can be recognized. But often posterior predictive model checking or graphical comparison of fitted values to the data are better suited to assess model fit in GLMs.
For Bernoulli models, the residual plots normally look quite awful because the residual distribution very often has two peaks, a negative and a positive one resulting from the binary nature of the outcome variable. However, it is still good to have a look at these plots using plot(mod)
. At least the average should roughly be around zero and not show a trend.
+An often more informative plot to judge model fit for a binary logistic regression is to compare the fitted values with the data. To better see the observations, we slightly jitter them in the vertical direction.
+If the model would fit the data well, the data would be, on average, equal to the fitted values. Thus, we add the \(y = x\)-line to the plot using the abline
function with intercept 0 and slope 1.
+Of course, binary data cannot lie on this line because they can only take on the two discrete values 0 or 1. However, the mean of the 0 and 1 values should lie on the line if the model fits well. Therefore, we calculate the mean for suitably selected classes of fitted values. In our example, we choose a class width of 0.1. Then, we calculate means per class and add these to the plot, together with a classical standard error that tells us how reliable the means are. This can be an indication whether our arbitrarily chosen class width is reasonable.
plot(fitted(mod), jitter(anoctua$PA, amount=0.05),
+ xlab="Fitted values", ylab="Probability of presence",
+ las=1, cex.lab=1.2, cex=0.8)
+abline(0,1, lty=3)
+
+t.breaks <- cut(fitted(mod), seq(0,1, by=0.1))
+means <- tapply(anoctua$PA, t.breaks, mean)
+semean <- function(x) sd(x)/sqrt(length(x))
+means.se <- tapply(anoctua$PA, t.breaks, semean)
+points(seq(0.05, 0.95, by=0.1), means, pch=16, col="orange")
+segments(seq(0.05, 0.95, by=0.1), means-2*means.se,
+seq(0.05, 0.95,by=0.1), means+2*means.se,lwd=2, col="orange")
+
+mod <- glm(PA ~ elevation + I(elevation^2) + I(elevation^3) +
+ I(elevation^4), data=anoctua, family=binomial)
+t.breaks <- cut(fitted(mod), seq(0,1, by=0.1))
+means <- tapply(anoctua$PA, t.breaks, mean)
+semean <- function(x) sd(x)/sqrt(length(x))
+means.se <- tapply(anoctua$PA, t.breaks, semean)
+points(seq(0.05, 0.95, by=0.1)+0.01, means, pch=16, col="lightblue", cex=0.7)
+segments(seq(0.05, 0.95, by=0.1)+0.01, means-2*means.se,
+seq(0.05, 0.95,by=0.1)+0.01, means+2*means.se,lwd=2, col="lightblue")
+Figure 14.2: Goodness of fit plot for the Bernoulli model fitted to little owl presence-absence data. Open circles = observed presence (1) or absence (0) jittered in the vertical direction; orange dots = mean (and 95% compatibility intervals given as vertical bards) of the observations within classes of width 0.1 along the x-axis. The dotted line indicates perfect coincidence between observation and fitted values. Orange larger points are from the model assuming a linear effect of elevation, wheras the smaller light blue points are from a model assuming a non-linear effect. +
+The means of the observed data (orange dots) do not fit well to the data (Figure 14.2). For low presence probabilities, the model overestimates presence probabilities whereas, for medium presence probabilities, the model underestimates presence probability. This indicates that the relationship between little owl presence and elevation may not be linear. After including polynomials up to the fourth degree, we obtained a reasonable fit (light blue dots in Figure 14.2).
+Further aspects of model fit that may be checked in Bernoulli models:
+Are all observations independent?
+May spatial or temporal correlation be an issue?
+Are all parameters well informed by the data? Some parameters may not be identifiable due to complete separation, i.e. when there is no overlap between the 0 and 1’s regarding one of the predictor variables. In such cases glm
may fail to fit the model. However, Bayesian methods (stan_glm
or brm
) do not fail but the result may be highly influenced by the prior distributions. A prior sensitivity analysis is recommended.
Note, we do not have to worry about overdispersion when the outcome variable is binary, even though the variance of the Bernoulli distribution is defined by p and no separate variance parameter exists. However, because the data can only take the values 0 and 1, there is no possibility that the data can show a higher variance than the one assumed by the Bernoulli distribution.
+When we are ready to report and visualise the results (i.e. after assessing the model fit, when we think the model reasonably well describes the data generating process). We can simulate the posterior distribution of \(\beta_1\) and obtain the 95% compatibility interval.
+library(arm)
+nsim <- 5000
+bsim <- sim(mod, n.sim=nsim) # sim from package arm
+apply(bsim@coef, 2, quantile, prob=c(0.5, 0.025, 0.975))
## (Intercept) elevation I(elevation^2) I(elevation^3) I(elevation^4)
+## 50% -24.37645 0.3976119 -0.002205633 0.000004885047 -0.0000000038244246
+## 2.5% -35.23476 0.2011637 -0.003495176 0.000001400280 -0.0000000071488834
+## 97.5% -13.46421 0.5976708 -0.000933469 0.000008346174 -0.0000000004667167
+To interpret this polynomial function, an effect plot is helpful. To that end, and as we have done before, we calculate fitted values over the range of the covariate, together with compatibility intervals.
+newdat <- data.frame(elevation = seq(80,600,by=1))
+Xmat <- model.matrix(~elevation+I(elevation^2)+I(elevation^3)+
+ I(elevation^4), data=newdat) # the model matrix
+fitmat <- matrix(nrow=nrow(newdat), ncol=nsim)
+for(i in 1:nsim) fitmat[,i] <- plogis(Xmat %*% bsim@coef[i,])
+newdat$lwr <- apply(fitmat,1,quantile,probs=0.025)
+newdat$fit <- plogis(Xmat %*% coef(mod))
+newdat$upr <- apply(fitmat,1,quantile,probs=0.975)
We now can plot the data together with the estimate and its compatibility interval. We, again, use the function jitter
to slightly scatter the points along the y-axis to make overlaying points visible.
plot(anoctua$elevation, jitter(anoctua$PA, amount=0.05),
+ las=1, cex.lab=1.4, cex.axis=1.2, xlab="Elevation",
+ ylab="Probability of presence")
+lines(newdat$elevation, newdat$fit, lwd=2)
+lines(newdat$elevation, newdat$lwr, lty=3)
+lines(newdat$elevation, newdat$upr, lty=3)
+Figure 14.3: Little owl presence data versus elevation with regression line and 95% compatibility interval (dotted lines). Open circles = observed presence (1) or abesnce (0) jittered in the vertical direction. +
+Binary data do not contain a lot of information. Therefore, large sample sizes are needed to obtain robust results.
+Often presence/absence data are obtained by visiting plots several times during a distinct period, for example, a breeding period, and then it is reported whether a species has been seen or not. If it has been seen and if there is no misidentification in the data, it is present, however, if it has not been seen we +are usually not sure whether we have not detected it or whether it is absent. In the case of repeated visits to the same plot, it is possible to estimate the detection probability using occupancy models MacKenzie et al. (2002) or point count models J. Andrew Royle (2004). +Finally, logistic regression can be used in the sense of a discriminant function analysis that aims to find predictors that discriminate members of two groups J. A. Anderson (1974). However, if one wants to use the fitted value from such an analysis to assign group membership of a new subject, one has to take the prevalence of the two groups in the data into account.
+The binomial model is usesd when the response variable is a count with an upper limit, e.g., the number of seeds that germinated among a total number of seeds in a pot, or the number of chicks hatching from the total number of eggs. Thus, we can use the binomial model always when the response is the sum of a predefined number of Bernoulli trials. Whether a seed germinates or not is a Bernoulli trial. If we have more than one seed, the number of germinated seeds follow a binomial distribution.
+As an example, we use data from a study on the effects of anthropogenic fire regimes traditionally applied to savanna habitat in Gabon, Central Africa (Walters 2012). Young trees survive fires better or worse depending, among other factors, on the fuel load, which, in turn, depends heavily on the time since the last fire happened. Thus, plots were burned after different lengths of time since the previous fire (4, 9, or 12 months ago). Trees that resprouted after the previous (first) fire were counted before and after the experimental (second) fire to estimate their survival of the experimental fire depending on the time since the previous fire.
+The outcome variable is the number of surviving trees among the total number of trees per plot \(y_i\). The explanatory variable is the time since the previous fire, a factor with three levels: “4m”, “9m”, and “12m”. Assuming that the data follow a binomial distribution, the following model can be fitted to the data:
\[ y_i \sim binomial(p_i, n_i) \]
+\[ logit(p_i) = \beta_0 + \beta_1 I(treatment_i=9m) + \beta_2 I(treatment_i=12m)\] +where \(p_i\) being the survival probability and \(n_i\) the total number of tree sprouts on plot \(i\). Note that \(n_i\) should not be confused with the sample size of the data set, i.e. the number of rows in the data table.
+We normally use glm
, stan_glm
or brm
for fitting a binomial model depending on the complexity of the predictors and correlation structure.
We here, again, start with using the glm
function.
A peculiarity with binomial models is that the outcome is not just one number, it is the number of trees still live \(y_i\) out of \(n_i\) trees that were alive before the experimental fire. Therefore, the outcome variable has to be given as a matrix with two columns.
+The first column contains the number of successes (number of survivors \(y_i\)) and the second column contains the number of failures (number of trees killed by the fire, \(n_i - y_i\)). We build this matrix using cbind
(“column bind”).
data(resprouts) # example data from package blmeco
+resprouts$succ <- resprouts$post
+resprouts$fail <- resprouts$pre - resprouts$post
+
+mod <- glm(cbind(succ, fail) ~ treatment, data=resprouts,
+family=binomial)
+mod
##
+## Call: glm(formula = cbind(succ, fail) ~ treatment, family = binomial,
+## data = resprouts)
+##
+## Coefficients:
+## (Intercept) treatment9m treatment12m
+## -1.241 1.159 -2.300
+##
+## Degrees of Freedom: 40 Total (i.e. Null); 38 Residual
+## Null Deviance: 845.8
+## Residual Deviance: 395 AIC: 514.4
+Experienced readers will be alarmed because the residual deviance is much larger than the residual degrees of freedom, which indicates overdispersion. We will soon discuss overdispersion, but, for now, we continue with the analysis for the sake of illustration.
+The estimated model parameters are \(\hat{b_0} =\) -1.24, \(\hat{b_1} =\) 1.16, and \(\hat{b_2} =\) -2.3. These estimates tell us that tree survival was higher for the 9-month fire lag treatment compared to the 4-month treatment (which is the reference level), but lowest in the 12-month treatment. To obtain the mean survival probabilities per treatment, some math is needed because we have to back-transform the linear predictor to the scale of the outcome variable. The mean survival probability for the 4-month treatment is \(logit^{-1}(\)-1.24$) = =$0.22, for the 9-month treatment it is \(logit^{-1}(\)-1.24$ +$ 1.16\() =\) 0.48, and for the 12-month treatment it is \(logit^{-1}(\)-1.24$ +$ -2.3\() =\) 0.03. The function plogis
gives the inverse of the logit function and can be used to estimate the survival probabilities, for example:
## (Intercept)
+## 0.4795799
+The direct interpretation of the model coefficients \(\beta_1\) and \(\beta_2\) is that they are the log of the ratio of the odds of two treatment levels (i.e., the log odds ratio). The odds for treatment “4 months” are 0.22/(1-0.22)=0.29 (calculated using non rounded values), which is the estimated ratio of survived to killed trees in this treatment. For treatment “9 months,” the odds are 0.48/(1-0.48) = 0.92, and the log odds ratio is log(0.92/0.29) = 1.16 = \(beta_1\).
+The model output includes the null deviance and the residual deviance. Deviance is a measure of the difference between the data and a model. It corresponds to the sum of squares in the normal linear model. The smaller the residual deviance the better the model fits to the data. Adding a predictor reduces the deviance, even if the predictor does not have any relation to the outcome variable. The Akaike information criterion (AIC) value in the model output (last line) is a deviance measure that is penalized for the number of model parameters. It can be used for model comparison.
+The residual deviance is defined as minus two times the difference of the log-likelihoods of the saturated model and our model. The saturated model is a model that uses the observed proportion of successes as the success probability for each observation \(y_i \sim binomial(y_i/n_i, n_i)\). The saturated model has the highest possible likelihood (given the data set and the binomial model). This highest possible likelihood is compared to the likelihood of the model at hand,
+\(y_i \sim binomial(p_i, n_i)\) with \(p_i\) dependent on some predictor variables. The null deviance is minus two times the difference of the log-likelihoods of the saturated model, and a model that contains only one overall mean success probability, the null model \(y_i \sim binomial(p, n_i)\). The null deviance corresponds to the total sum of squares, that is, it is a measure of the total variance in the data.
In the standard residual plots, we see that in our example data there are obviously a number of influential points (especially the data points with row numbers 7, 20, and 26; Figure 14.4). The corresponding data points may be inspected for errors, or additional predictors may be identified that help to explain why these points are extreme (Are they close/far from the village? Were they grazed? etc.).
+ ++Figure 14.4: The four standard residual plots obtained by using the plot-function. +
+For whatever reason, the variance in the data is larger than assumed by the binomial distribution. We detect this higher variance in the mean of the absolute values of the standardized residuals that is clearly larger than one (lower left panel in Figure 14.4). This is called overdispersion, which we mentioned earlier and deal with next.
+The variance of a binomial model is defined by \(n\) and \(p\), that is, there is no separate variance parameter. In our example \(p\) is fully +defined by \(\beta_0\), \(\beta_1\), and \(\beta_2\): \(p_i = logit^{-1}(\beta_0 + \beta_1 I(treatment_i = 9m) + \beta_2 I(treatment_i = 12m))\), and \(n_i\) is part of the data. Similarly, in a Poisson model (which we will introduce in the next chapter) the variance is defined by the mean. Unfortunately, real data, as in our example, often show higher and sometimes lower variance than expected by a binomial (or a Poisson) distribution (Figure 14.5). When the variance in the data is higher than expected by the binomial (or the Poisson) distribution we have overdispersion. The uncertainties for the parameter estimates will be underestimated if we do not take overdispersion into account. Overdispersion is indicated when the residual deviance is substantially larger than the residual degrees of freedom. +This always has to be checked in the output of a binomial or a Poisson model. In our example, the residual deviance is 10 times larger than the residual degrees of freedom, thus, we have strong overdispersion.
++Figure 14.5: Histogram of a binomial distribution without overdispersion (orange) and one with the same total number of trials and average success probability, but with overdispersion (blue). +
+What can we do when we have overdispersion? The best way to deal with overdispersion is to find the reason for it. Overdispersion is common in biological data because animals do not behave like random objects but their behavior is sensitive to many factors that we cannot always measure such as social relationships, weather, habitat, experience, and genetics. In most cases, overdispersion is caused by influential factors that were not included in the model. If we find them and can include them in the model (as fixed or as random variables) overdispersion may disappear. If we do not find such predictor variables, we have at least three options.
+Fit a quasibinomial or quasi-Poisson model by specifying “quasibinomial” or +“quasipoisson” in the family-argument.
+ +This will fit a binomial model that estimates, in addition to the other model +parameters, a dispersion parameter, \(u\), that is multiplied by the binomial or Poisson variance to obtain the residual variance: \(var(y_i) = u n_i p_i(1 - p_i)\), or \(var(y_i)= u\lambda_i\), respectively. +This inflated variance is then used to obtain the standard errors of the +parameter estimates.However, the quasi-distributions are unnatural distributions (there is no physical justification for these distributions, such as number of coin flips that are tails among a defined number of coin flips). Quasi-models do not differ from the binomial or the Poisson model in any parameter except that the variance is stretched so that fits to the variance in the data. We can see quasi-models as a kind of post-hoc correction for overdispersion. Thus, it is better to use the quasi-model instead of an overdispersed model to draw inference. However, the point estimates may be highly influenced by a few extreme observations. +Therefore, we prefer to use options that explicitly model the additional +variance.
+Adding an observation-level random factor (i.e., a factor with the levels 1 to \(n\), the sample size) models the additional variance as a normal distribution (in the scale of the link function). Adding such an additional variance parameter to the model allows and accounts for extra variance in the data (Harrison 2014). To do that, we have to fit a generalized linear mixed model (GLMM) instead of a GLM.
+What do we have to do when the residual deviance is smaller than the residual degrees of freedom, that is, when we have “underdispersion”? Some statisticians do not bother about underdispersion, because, when the variance in the data is smaller than assumed by the model, uncertainty is overestimated. This means that conclusions will be conservative (i.e., on the “safe” side). However, we think that underdispersion should bother us as biologists (or other applied scientists). In most cases, underdispersion means that the variance in the data is smaller than expected by a random process, that is, the variance may be constrained by something. Thus, we should be interested in thinking about the factors that constrain the variance in the data. An example is the number of surviving young in some raptor species, (e.g., in the lesser spotted eagle Aquila pomarina). Most of the time two eggs are laid, but the first hatched young will usually kill the second (which was only a “backup” in case the first egg does not yield a healthy young). Because of this behavior, the number of survivors among the number of eggs laid will show much less variance than expected from \(n_i\) and \(p_i\), leading to underdispersion. Clutch size is another example of data that often produces underdispersion (but it is a Poisson rather than a binomial process, because there is no \(n_i\)).
+Sometimes, apparent under- or overdispersion can be caused by too many 0s in the data than assumed by the binomial or Poisson model. For example, the number of black stork \(Ciconia nigra\) nestlings that survived the nestling phase is very often 0, because the whole nest was depredated or fell from the tree (black storks nest in trees). If the nest survives, the number of survivors varies between 0 and 5 depending on other factors such as food availability or weather conditions. A histogram of these data shows a bimodal distribution with one peak at 0 and another peak around 2.5. It looks like a Poisson distribution, but with a lot of additional 0 values. This is called zero-inflation. +Zero-inflation is often the result of two different processes being involved in producing the data. +The process that determines whether a nest survives differs from the process that determines how many nestlings survive, given the nest survives. When we analyze such data using a model that assumes only one single process it will be very hard to understand the system and the results are likely to be biased because the distributional assumptions are violated. In such cases, we will be more successful when our model explicitly models the two different processes. Such models are zero-inflated binomial or zero-inflated Poisson models.
+We normally check whether zero-inflation may be an issue by posterior predictive model checking. If we find zero-inflation in binomial data, we try using a zero-inflated binomial model as provided by Paul Buerkner in the package brms.
+For the moment, we use the quasibinomial GLM to analyze the tree sprout data.
+We simulate 2000 values from the joint posterior distribution of the model parameters.
For each set of simulated model parameters, we derive the linear predictor by multiplying the model matrix with the corresponding set of model parameters.
+Then, the inverse logit function (\(logit^{-1}(x) = \frac{e^x}{(1+ e^x)}\); R function plogis
) is used to obtain the fitted value for each fire lag treatment. Lastly, we extract, for each treatment level, the 2.5% and 97.5% quantile of the posterior distribution of the fitted values and plot it together with the estimates (the fitted values) per treatment and the raw data.
newdat <- data.frame(treatment=factor(c("4m","9m","12m"),levels=c("4m","9m","12m")))
+Xmat <- model.matrix(~treatment, newdat)
+fitmat <- matrix(nrow=nrow(newdat), ncol=nsim)
+for(i in 1:nsim) fitmat[,i] <- plogis(Xmat %*% bsim@coef[i,])
+newdat$lwr <- apply(fitmat, 1, quantile, prob=0.025)
+newdat$upr <- apply(fitmat, 1, quantile, prob=0.975)
+newdat$fit <- plogis(Xmat%*%coef(mod))
+
+
+newdat$lag <- c(4,9,12) # used for plotting
+resprouts$lag <- c(4,9,12)[match(resprouts$treatment,c("4m","9m","12m"))] # used for plotting
+
+plot(newdat$lag, newdat$fit, type="n", xlab="Fire lag [months]", ylab="Tree survival",
+ las=1, cex.lab=1.4, cex.axis=1, xaxt="n", xlim=c(0, 13), ylim=c(0,0.6))
+axis(1, at=c(0,4,9,12), labels=c("0","4","9","12"))
+segments(newdat$lag, newdat$lwr, newdat$lag, newdat$upr, lwd=2)
+points(newdat$lag, newdat$fit, pch=21, bg="gray")
+
+points(resprouts$lag+0.3,resprouts$succ/resprouts$pre, cex=0.7) # adds the raw data to the plot
+Figure 14.6: Proportion of surviving trees (circles) for three fire lag treatments with estimated mean proportion of survivors using a quasi-binomial model. Gray dots = fitted values. Vertical bars = 95% compatibility intervals. +
+The Poisson distribution is a discrete probability distribution that naturally describes the distribution of count data. If we know how many times something happened, but we do not know how many times it did not happen (in contrast to the binomial model, where we know the number of trials), such counts usually follow a Poisson distribution. Count data are positive integers ranging from 0 to \(+\infty\).
+A Poisson distribution is positive-skewed (long tail to the right) if the mean \(\lambda\) is small and it approximates a normal distribution for large \(\lambda\). The Poisson distribution constitutes the stochastic part of a Poisson model. The deterministic part describes how \(\lambda\) is related to predictors. \(\lambda\) can only take on positive values. Therefore, we need a link function that transforms \(\lambda\) into the scale of the linear predictor (or, alternatively, an inverse link function that transforms the value from the linear predictor to nonnegative values). The most often used link function is the natural logarithm (log-link function).
+This link function transforms all \(\lambda\)-values between 0 and 1 to the interval \(-\infty\) to 0, and all \(\lambda\)-values higher than 1 are projected into the interval 0 to \(+\infty\). Sometimes, the identity link function is used instead of the log-link function, particularly when the predictor variable only contains positive values and the effect of the predictor is additive rather than multiplicative, that is, when a change in the predictor produces an addition of a specific value in the outcome rather than a multiplication by a specific value. Further, the cauchit function can also be used as a link function for +Poisson models.
+The same R functions that fit binomial models also fit Poisson models. +As an example, we fit a Poisson model with log-link function to a simulated data set containing the number of (virtual) aphids on a square centimeter (\(y\)) and a numeric predictor variable representing, for example, an aridity index (\(x\)). Real ecological data without overdispersion or zeroinflation and with no random structure are rather rare. Therefore, we illustrate this model, which is the basis for more complex models, with simulated data. The model is: +\[y_i \sim Poisson(\lambda_i)\]
+\[log(\lambda_i = \bf X_i \boldsymbol \beta)\]
+We use, similar to the R function log
, the notation \(log\) for the natural logarithm. We fit the model in R using the function glm
and use the argument “family” to specify that we assume a Poisson distribution as the error distribution.
+The log-link is used as the default link function. Then we add the regression line to the plot using the function curve
. Further add the compatibility interval to the plot (of course only after having checked the model assumptions).
set.seed(196855)
+n <- 50 # simulate 50 sampling sites, where we count aphids
+x <- rnorm(n) # the number of aphids depends, among others, on the aridity index x
+b0 <- 1 # intercept and
+b1 <- 0.5 # slope of the linear predictor
+y <- rpois(n, lambda=exp(b0+b1*x))
+
+
+mod <- glm(y~x, family="poisson")
+
+n.sim <- 2000
+bsim <- sim(mod, n.sim=n.sim)
+
+par(mar=c(4,4,1,1))
+plot(x,y, pch=16, las=1, cex.lab=1.4, cex.axis=1.2)
+curve(exp(coef(mod)[1] + coef(mod)[2]*x), add=TRUE, lwd=2)
+newdat <- data.frame(x=seq(-3, 2.5, length=100))
+Xmat <- model.matrix(~x, data=newdat)
+b <- coef(mod)
+newdat$fit <- exp(Xmat%*%b)
+fitmat <- matrix(ncol=n.sim, nrow=nrow(newdat))
+for(i in 1:n.sim) fitmat[,i] <- exp(Xmat%*%bsim@coef[i,])
+newdat$lwr <- apply(fitmat, 1, quantile, prob=0.025)
+newdat$upr <- apply(fitmat, 1, quantile, prob=0.975)
+lines(newdat$x, newdat$fit, lwd=2)
+lines(newdat$x, newdat$lwr, lty=3)
+lines(newdat$x, newdat$upr, lty=3)
+Figure 14.7: Simulated data (dots) with a Poisson regression line (solid) and the lower and upper bound of the 95% compatibility interval. +
+Because the residual variance in the Poisson model is defined by \(\lambda\) (the fitted value), it is not estimated as a separate parameter from the data. Therefore, we always have to check whether overdispersion is present. Ecological data are often overdispersed because not all influencing factors can be measured and included in the model. As with the binomial model, in a Poisson model overdispersion is present when the residual deviance is larger than the residual degrees of freedom. This is because if we add one independent observation to the data, the deviance increases, on average, by one if the variance equals \(\lambda\). If the variance is larger, the contribution of each observation to the deviance is, on average, larger than one. +We can check this in the model output:
+ +##
+## Call: glm(formula = y ~ x, family = "poisson")
+##
+## Coefficients:
+## (Intercept) x
+## 1.1329 0.4574
+##
+## Degrees of Freedom: 49 Total (i.e. Null); 48 Residual
+## Null Deviance: 85.35
+## Residual Deviance: 52.12 AIC: 198.9
+The residual deviance is 52 compared to 48 degrees of freedom. This is perfect (of course, because the model is fit to simulated data). If we are not sure, we could do a posterior predictive model checking and compare the variance in the data with the variance in data that were simulated from the model. If there is substantial overdispersion, we could fit a quasi-Poisson model that includes a dispersion parameter. However, as explained previously, we prefer to explicitly model the variance. A good alternative for overdispersed count data that we now like very much (in contrast to what we wrote in the first printed edition of this book) is the negative binomial model.
+The standard residual plots (Figure 14.8) are obtained in the usual way.
+ ++Figure 14.8: Standard residual plots for the Poisson model fitted to simulated data, thus they fit perfectly. +
+Of course, again, they look perfect because we used simulated data. In a Poisson model, as for the binomial model, it is easier to detect lack of model fit using posterior predictive model checking. For example, data could be simulated from the model and the proportion of 0 values in the simulated data could be compared to the proportion of 0 values in the observations to assess whether zero-inflation is present or not.
+We can look at the posterior distributions of the model parameters.
+ +## (Intercept) x
+## 50% 1.1370430 0.4569485
+## 2.5% 0.9692098 0.2974446
+## 97.5% 1.3000149 0.6143244
+The 95% compatibility interval of \(\beta_1\) is 0.3-0.6. Given that an effect of 0.2 or larger on the aridity scale would be considered biologically relevant, we can be quite confident that aridity has a relevant effect on aphid abundance given our data and our model. With the simulations from the posterior distributions of the model parameters (stored in the object bsim
) we obtained samples of the posterior distributions of fitted values for each of 100 x-values along the x-axis and we have drawn the 95% compatibility interval of the regression line in Figure 14.7.
Many count data are measured in relation to a reference, such as an area or a time period or a population. For example, when we count animals on plots of different sizes, the most important predictor variable will likely be the size of the plot. Or, in other words, the absolute counts do not make much sense when they are not corrected for plot size: the relevant measure is animal density. Similarly, when we count how many times a specific behavior occurs and we follow the focal animals during time periods of different lengths, then the interest is in the rate of occurrence rather than in the absolute number counted. One way to analyze rates and densities is to divide the counts by the reference value and assume that this rate (or a transformation thereof) is normally distributed. However, it is usually hard to obtain normally distributed residuals using rates or densities as dependent variables.
+A more natural approach to describe rates and densities is to use a Poisson model that takes the reference into account within the model. This is called an offset. To do so, \(\lambda\) is multiplied by the reference \(T\) (e.g., time interval, area, population). Therefore, \(log(T)\) has to be added to the linear predictor. Adding \(log(T)\) to the linear predictor is like adding a new predictor variable (the log of \(T\)) to the model with its model parameter (the slope) fixed to 1. The term “offset” +says that we add a predictor but do not estimate its effect because it is fixed to 1. +\[y_i \sim Poisson(\lambda_i T_i)\] +\[ log(\boldsymbol \lambda \boldsymbol T) = log(\boldsymbol \lambda) + log(\boldsymbol T) = \boldsymbol X \boldsymbol \beta + log(\boldsymbol T)\]
+In R, we can use the argument “offset” within the function glm
to specify an offset. We illustrate this using a breeding bird census on wildflower fields in Switzerland in 2007 conducted by Zollinger et al. (2013). We focus on the common whitethroat Silvia communis, a bird of field margins and fallow lands that has become rare in the intensively used agricultural landscape. Wildflower fields are an ecological compensation measure to provide food and nesting grounds for species such as the common whitethroat. Such fields are sown and then left unmanaged for several years except for the control of potentially problematic species (e.g., some thistle species, Carduus spp.). The plant composition and the vegetation structure in the field gradually changes over the years, hence the interest in this study was to determine the optimal age of a wildflower field for use by the common whitethroat.
We use the number of breeding pairs (bp) as the outcome variable and field size as an offset, which means that we model breeding pair density. We include the age of the field (age) as a linear and quadratic term because we expect there to be an optimal age of the field (i.e., a curvilinear relationship between the breeding pair density and age). We also include field size as a covariate (in addition to using it as the offset) because the size of the field may have an effect on the density; for example, small fields may have a higher density if the whitethroat can also use surrounding areas but uses the field to breed. Size (in hectares) was z-transformed before the model fit.
+data(wildflowerfields) # in the package blmeco
+dat <- wildflowerfields[wildflowerfields$year==2007,] # select data
+dat$size.ha <- dat$size/100 # change unit to ha
+dat$size.ha.z <- scale(dat$size.ha)
+mod <- glm(bp ~ age + I(age^2) + size.ha.z, offset=log(size.ha),
+data=dat, family=poisson)
+mod
##
+## Call: glm(formula = bp ~ age + I(age^2) + size.ha.z, family = poisson,
+## data = dat, offset = log(size.ha))
+##
+## Coefficients:
+## (Intercept) age I(age^2) size.ha.z
+## -4.2294 1.5241 -0.1408 -0.5397
+##
+## Degrees of Freedom: 40 Total (i.e. Null); 37 Residual
+## Null Deviance: 48.5
+## Residual Deviance: 27.75 AIC: 70.2
+For the residual analysis and for drawing conclusions, we can proceed in the same way we did in the Poisson model. From the model output we see that the residual deviance is smaller than the corresponding degrees of freedom, thus we have some degree of underdispersion. But the degree of underdispersion is not very extreme so we accept that the compatibility intervals will be a bit larger than “necessary” and proceed in this case. +After residual analyses, we can produce an effect plot of the estimated whitethroat density against the age of the wildflower field (Figure 14.9). And we see that the expected whitethroat density is largest on wildflower fields of age 4 to 7 years.
+n.sim <- 5000
+bsim <- sim(mod, n.sim=n.sim)
+apply(bsim@coef, 2, quantile, prob=c(0.025,0.5,0.975))
## (Intercept) age I(age^2) size.ha.z
+## 2.5% -7.006715 0.3158791 -0.26708865 -1.14757192
+## 50% -4.196504 1.5118620 -0.14034083 -0.54749587
+## 97.5% -1.445242 2.7196036 -0.01837473 0.02976658
+par(mar=c(4,4,1,1))
+plot(jitter(dat$age,amount=0.1),jitter(dat$bp/dat$size.ha,amount=0.1), pch=16, las=1, cex.lab=1.2, cex.axis=1, cex=0.7,
+ xlab="Age of wildflower field [yrs]", ylab="Density of Whitethroat [bp/ha]")
+
+# add credible/compatibility interval
+newdat <- data.frame(age=seq(1, 9, length=100), size.ha.z=0)
+Xmat <- model.matrix(~age + I(age^2) + size.ha.z, data=newdat)
+b <- coef(mod)
+newdat$fit <- exp(Xmat%*%b)
+fitmat <- matrix(ncol=n.sim, nrow=nrow(newdat))
+for(i in 1:n.sim) fitmat[,i] <- exp(Xmat%*%bsim@coef[i,])
+newdat$lwr <- apply(fitmat, 1, quantile, prob=0.025)
+newdat$upr <- apply(fitmat, 1, quantile, prob=0.975)
+lines(newdat$age, newdat$fit, lwd=2)
+lines(newdat$age, newdat$lwr, lty=3)
+lines(newdat$age, newdat$upr, lty=3)
+Figure 14.9: Whitethroat densities are highest in wildflower fields that are around 4 to 6 years old. Dots are the raw data, the bold line give the fitted values (with the 95% compatibility interval given with dotted lines) for wildflower fields of different ages (years). The fitted values are given for average field sizes of 1.4 ha. +
+In chapter 13 on linear mixed effect models we have introduced how to analyze metric outcome variables for which a normal error distribution can be assumed (potentially after transformation), when the data have a hierarchical structure and, as a consequence, observations are not independent. +In chapter 14 on generalized linear models we have introduced how to analyze outcome variables for which a normal error distribution can not be assumed, as for example binary outcomes or count data. More precisely, we have extended modelling outcomes with normal error to modelling outcomes with error distributions from the exponential family (e.g., binomial or Poisson). +Generalized linear mixed models (GLMM) combine the two complexities and are used to analyze outcomes with a non-normal error distribution when the data have a hierarchical structure. In this chapter, we will show how to analyze such data. Remember, a hierarchical structure of the data means that the data are collected at different levels, for example smaller and larger spatial units, or include repeated measurements in time on a specific subject. Typically, the outcome variable is measured/observed at the lowest level but other variables may be measured at different levels. A first example is introduced in the next section.
+To illustrate the binomial mixed model we use a subset of a data set used by Grüebler, Korner-Nievergelt, and Von Hirschheydt (2010) on barn swallow Hirundo rustica nestling survival (we selected a nonrandom sample to be able to fit a simple model; hence, the results do not add unbiased knowledge about the swallow biology!). For 63 swallow broods, we know the clutch size and the number of the nestlings that +fledged. The broods came from 51 farms (larger unit), thus some of the farms had more than one brood. Note that each farm can harbor one or several broods, and the broods are nested within farms (as opposed to crossed, see chapter 13), i.e., each brood belongs to only one farm. There are three predictors measured at the level of the farm: colony size (the number of swallow broods on that farm), cow (whether there are cows on the farm or not), and dung heap (the number of dung heaps, piles of cow dung, within 500 m of the farm). +The aim was to assess how swallows profit from insects that are attracted by livestock on the farm and by dung heaps. Broods from the same farm are not independent of each other because they belong to the same larger unit (farm), and thus share the characteristics of the farm (measured or unmeasured). Predictor variables were measured at the level of the farm, and are thus the same for all broods from a farm. In the model described and fitted below, we account for the non-independence of these clutches when building the model by including a random intercept per farm to model random variation between farms. + +The outcome variable is a proportion (proportion fledged from clutch) and thus consists of two values for each observation, as seen with the binomial model without random factors (Section 14.2.2): + +the number of chicks that fledged (successes) and the number of chicks that died (failures), i.e., the clutch size minus number that fledged.
+The random factor “farm” adds a farm-specific deviation \(b_g\) to the intercept in the linear predictor. These deviations are modeled as normally distributed with mean \(0\) and standard deviation \(\sigma_g\).
+\[ +y_i \sim binomial\left(p_i, n_i\right)\\ +logit\left(p_i\right) = \beta_0 + b_{g[i]} + \beta_1\;colonysize_i + \beta_2\;I\left(cow_i = 1\right) + \beta_3\;dungheap_i\\ +b_g \sim normal\left(0, \sigma_g\right) +\] +
+ +# Data on Barn Swallow (Hirundo rustica) nestling survival on farms
+# (a part of the data published in Grüebler et al. 2010, J Appl Ecol 47:1340-1347)
+
+
+library(blmeco)
+data(swallowfarms)
+#?swallowfarms # to see the documentation of the data set
+dat <- swallowfarms
+str(dat)
## 'data.frame': 63 obs. of 6 variables:
+## $ farm : int 1001 1002 1002 1002 1004 1008 1008 1008 1010 1016 ...
+## $ colsize: int 1 4 4 4 1 11 11 11 3 3 ...
+## $ cow : int 1 1 1 1 1 1 1 1 0 1 ...
+## $ dung : int 0 0 0 0 1 1 1 1 2 2 ...
+## $ clutch : int 8 9 8 7 13 7 9 16 10 8 ...
+## $ fledge : int 8 0 6 5 9 3 7 4 9 8 ...
+
+## [1] 51
+dat$colsize.z <- scale(dat$colsize) # z-transform values for better model convergence
+dat$dung.z <- scale(dat$dung)
+dat$die <- dat$clutch - dat$fledge
+dat$farm.f <- factor(dat$farm) # for clarity we define farm as a factor
The glmer
function uses the standard way to formulate a statistical model in R, with the outcome on the left, followed by the ~
symbol, meaning “explained by”, followed by the predictors, which are separated by +
. The notation for the random factor with only a random intercept was introduced in chapter 13 and is (1|farm.f)
here.
Remember that for fitting a binomial model we have to provide the number of successful events (number of fledglings that survived) and the number of failures (those that died) within a two-column matrix that we create using the function cbind
.
The residuals of the model look fairly normal (top left panel of Figure 15.1 with slightly wider tails. The random intercepts for the farms look perfectly normal as they should. The plot of the residuals vs. fitted values (bottom left panel) shows a slight increase in the residuals with increasing fitted values. Positive correlations between the residuals and the fitted values are common in mixed models due to the shrinkage effect (chapter 13). Due to the same reason the fitted proportions slightly overestimate the observed proportions when these are large, but underestimate them when small (bottom right panel). What is looking like a lack of fit here can be seen as preventing an overestimation of the among farm variance based on the assumption that the farms in the data are a random sample of farms belonging to the same population.
+ +The mean of the random effects is close to zero as it should. We check that because sometimes the glmer
function fails to correctly separate the farm-specific intercepts from the overall intercept. A non-zero mean of random effects does not mean a lack of fit, but a failure of the model fitting algorithm. In such a case, we recommend using a different fitting algorithm, e.g. brm
(see below) or stan_glmer
from the rstanarm
package.
A slight overdispersion (approximated dispersion parameter >1) seems to be present, but nothing to worry about.
+par(mfrow=c(2,2), mar=c(3,5,1,1))
+# check normal distribution of residuals
+qqnorm(resid(mod.glmer), main="qq-plot residuals")
+qqline(resid(mod.glmer))
+
+# check normal distribution of random intercepts
+qqnorm(ranef(mod.glmer)$farm.f[,1], main="qq-plot, farm")
+qqline(ranef(mod.glmer)$farm.f[,1])
+
+# residuals vs fitted values to check homoscedasticity
+plot(fitted(mod.glmer), resid(mod.glmer))
+abline(h=0)
+
+# plot data vs. predicted values
+dat$fitted <- fitted(mod.glmer)
+plot(dat$fitted,dat$fledge/dat$clutch)
+abline(0,1)
+Figure 15.1: Diagnostic plots to assess model assumptions for mod.glmer. Uppper left: quantile-quantile plot of the residuals vs. theoretical quantiles of the normal distribution. Upper rihgt: quantile-quantile plot of the random effects “farm”. Lower left: residuals vs. fitted values. Lower right: observed vs. fitted values. +
+## [1] -0.001690303
+
+## [1] 1.192931
+
+Now we fit the same model using the function brm
from the R package brms
. This function allows fitting Bayesian generalized (non-)linear multivariate multilevel models using Stan (Betancourt 2013) for full Bayesian inference. We shortly introduce the fitting algorithm used by Stan, Hamiltonian Monte Carlo, in chapter 18. When using the function brm
there is no need to install rstan
or write the model in Stan-language. A wide range of distributions and link functions are supported, and the function offers many things more. Here we use it to fit the model as specified by the formula object above.
+Note that brm requires that a binomial outcome is specified in the format successes|trials()
, which is the number of fledged nestlings out of the total clutch size in our case. In contrast, the glmer
function required to specify the number of nestlings that fledged and died (which together sum up to clutch size), in the format cbind(successes, failures)
.
+The family is also called binomial
in brm
, but would be bernoulli
for a binary outcome, whereas glmer
would use binomial in both situations (Bernoulli distribution is a special case of the binomial). However, it is slightly confusing that (at the time of writing this chapter) the documentation for brmsfamily
did not mention the binomial family under Usage, where it probably went missing, but it is mentioned under Arguments for the argument family.
+Prior distributions are an integral part of a Bayesian model, therefore we need to specify prior distributions. We can see what default prior distributions brm
is using by applying the get_prior
function to the model formula. The default prior for the effect sizes is a flat prior which gives a density of 1 for any value between minus and plus infinity. Because this is not a proper probability distribution it is also called an improper distribution. The intercept gets a t-distribution with mean of 0, standard deviation of 2.5 and 3 degrees of freedoms. Transforming this t-distribution to the proportion scale (using the inverse-logit function) becomes something similar to a uniform distribution between 0 and 1 that can be seen as non-informative for a probability. For the among-farm standard deviation, it uses the same t-distribution as for the intercept. However, because variance parameters such as standard deviations only can take on positive numbers, it will use only the positive half of the t-distribution (this is not seen in the output of get_prior
). When we have no prior information on any parameter, or if we would like to base the results solely on the information in the data, we specify weakly informative prior distributions that do not noticeably affect the results but they will facilitate the fitting algorithm. This is true for the priors of the intercept and among-farm standard deviation. However, for the effect sizes, we prefer specifying more narrow distributions (see chapter 10). To do so, we use the function prior
.
+
To apply MCMC sampling we need some more arguments: warmup
specifies the number of iterations during which we allow the algorithm to be adapted to our specific model and to converge to the posterior distribution. These iterations should be discarded (similar to the burn-in period when using, e.g., Gibbs sampling); iter
specifies the total number of iterations (including those discarded); chains
specifies the number of chains; init
specifies the starting values of the iterations. By default (init=NULL
) or by setting init="random"
the initial values are randomly chosen which is recommended because then different initial values are chosen for each chain which helps to identify non-convergence. However, sometimes random initial values cause the Markov chains to behave badly.
+
+Then you can either use the maximum likelihood estimates of the parameters as starting values, or simply ask the algorithm to start with zeros. thin
specifies the thinning of the chain, i.e., whether all iterations should be kept (thin=1) or for example every 4th only (thin=4); cores
specifies the number of cores used for the algorithm; seed
specifies the random seed, allowing for replication of results.
library(brms)
+
+# check which parameters need a prior
+get_prior(fledge|trials(clutch) ~ colsize.z + cow + dung.z + (1|farm.f),
+ data=dat,
+ family=binomial(link="logit"))
## prior class coef group resp dpar nlpar lb ub
+## (flat) b
+## (flat) b colsize.z
+## (flat) b cow
+## (flat) b dung.z
+## student_t(3, 0, 2.5) Intercept
+## student_t(3, 0, 2.5) sd 0
+## student_t(3, 0, 2.5) sd farm.f 0
+## student_t(3, 0, 2.5) sd Intercept farm.f 0
+## source
+## default
+## (vectorized)
+## (vectorized)
+## (vectorized)
+## default
+## default
+## (vectorized)
+## (vectorized)
+# specify own priors
+myprior <- prior(normal(0,5), class="b")
+
+
+mod.brm <- brm(fledge|trials(clutch) ~ colsize.z + cow + dung.z + (1|farm.f) ,
+ data=dat, family=binomial(link="logit"),
+ prior=myprior,
+ warmup = 500,
+ iter = 2000,
+ chains = 2,
+ init = "random",
+ cores = 2,
+ seed = 123)
+# note: thin=1 is default and we did not change this here.
We first check whether we find warnings in the R console about problems of the fitting algorithm. Warnings should be taken seriously. Often, we find help in the Stan online documentation (or when typing launch_shinystan(mod.brm)
into the R-console) what to change when calling the brm
function to get a fit that is running smoothly. Once, we get rid of all warnings, we need to check how well the Markov chains mixed. We can either do that by scanning through the many diagnostic plots given by launch_shinystan(mod)
or create the most important plots ourselves such as the traceplot (Figure 15.2).
+Figure 15.2: Traceplot of the Markov chains. After convergence, both Markov chains should sample from the same stationary distribution. Indications of non-convergence would be, if the two chains diverge or vary around different means. +
+To assess how well the model fits to the data we do posterior predictive model checking (Chapter 16). For binomial as well as for Poisson models comparing the standard deviation of the data with those of replicated data from the model is particularly important. If the standard deviation of the real data would be much higher compared to the ones of the replicated data from the model, overdispersion would be an issue. However, here, the model is able to capture the variance in the data correctly (Figure 15.3). +The fitted vs observed plot also shows a good fit.
+yrep <- posterior_predict(mod.brm)
+sdyrep <- apply(yrep, 1, sd)
+
+par(mfrow=c(1,3), mar=c(3,4,1,1))
+hist(yrep, freq=FALSE, main=NA, xlab="Number of fledglings")
+hist(dat$fledge, add=TRUE, col=rgb(1,0,0,0.3), freq=FALSE)
+legend(10, 0.15, fill=c("grey",rgb(1,0,0,0.3)), legend=c("yrep", "y"))
+
+hist(sdyrep)
+abline(v=sd(dat$fledge), col="red", lwd=2)
+
+plot(fitted(mod.brm)[,1], dat$fledge, pch=16, cex=0.6)
+abline(0,1)
+Figure 15.3: Posterior predictive model checking: Histogram of the number of fledglings simulated from the model together with a histogram of the real data, and a histogram of the standard deviations of replicated data from the model together with the standard deviation of the data (vertical line in red). The third plot gives the fitted vs. observed values. +
+After checking the diagnostic plots, the posterior predictive model checking and the general model fit, we assume that the model describes the data generating process reasonably well, so that we can proceed to drawing conclusions.
+The generic summary
function gives us the results for the model object containing the fitted model, and works for both the model fitted with glmer
and brm
. Let’s start having a look at the summary from mod.glmer
.
+The summary provides the fitting method, the model formula, statistics for the model fit including the Akaike information criterion (AIC), the Bayesian information criterion (BIC), the scaled residuals, the random effects variance and information about observations and groups, a table with coefficient estimates for the fixed effects (with standard errors and a z-test for the coefficient) and correlations between fixed effects. We recommend to always check if the number of observations and groups, i.e., 63 barn swallow nests from 51 farms here, is correct. This information shows if the glmer
function has correctly recognized the hierarchical structure in the data. Here, this is correct. To assess the associations between the predictor variables and the outcome analyzed, we need to look at the column “Estimate” in the table of fixed effects. This column contains the estimated model coefficients, and the standard error for these estimates is given in the column “Std. Error”, along with a z-test for the null hypothesis of a coefficient of zero.
+In the random effects table, the among farm variance and standard deviation (square root of the variance) are given.
+The function confint
shows the 95% confidence intervals for the random effects (.sig01
) and fixed effects estimates.
In the summary
output from mod.brm
we see the model formula and some information on the Markov chains after the warm-up. In the group-level effects (between group standard deviations) and population-level effects (effect sizes, model coefficients) tables some summary statistics of the posterior distribution of each parameter are given. The “Estimate” is the mean of the posterior distribution, the “Est.Error” is the standard deviation of the posterior distribution (which is the standard error of the parameter estimate). Then we see the lower and upper limit of the 95% credible interval. Also, some statistics for measuring how well the Markov chains converged are given: the “Rhat” and the effective sample size (ESS). The bulk ESS tells us how many independent samples we have to describe the posterior distribution, and the tail ESS tells us on how many samples the limits of the 95% credible interval is based on.
Because we used the logit link function, the coefficients are actually on the logit scale and are a bit difficult to interpret. What we can say is that positive coefficients indicate an increase and negative coefficients indicate a decrease in the proportion of nestlings fledged. For continuous predictors, as colsize.z and dung.z, this coefficient refers to the change in the logit of the outcome with a change of one in the predictor (e.g., for colsize.z an increase of one corresponds to an increase of a standard deviation of colsize). For categorical predictors, the coefficients represent a difference between one category and another (reference category is the one not shown in the table). +To visualize the coefficients we could draw effect plots.
+ +## Generalized linear mixed model fit by maximum likelihood (Laplace
+## Approximation) [glmerMod]
+## Family: binomial ( logit )
+## Formula: cbind(fledge, die) ~ colsize.z + cow + dung.z + (1 | farm.f)
+## Data: dat
+##
+## AIC BIC logLik deviance df.resid
+## 282.5 293.2 -136.3 272.5 58
+##
+## Scaled residuals:
+## Min 1Q Median 3Q Max
+## -3.2071 -0.4868 0.0812 0.6210 1.8905
+##
+## Random effects:
+## Groups Name Variance Std.Dev.
+## farm.f (Intercept) 0.2058 0.4536
+## Number of obs: 63, groups: farm.f, 51
+##
+## Fixed effects:
+## Estimate Std. Error z value Pr(>|z|)
+## (Intercept) -0.09533 0.19068 -0.500 0.6171
+## colsize.z 0.05087 0.11735 0.434 0.6646
+## cow 0.39370 0.22692 1.735 0.0827 .
+## dung.z -0.14236 0.10862 -1.311 0.1900
+## ---
+## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+##
+## Correlation of Fixed Effects:
+## (Intr) clsz.z cow
+## colsize.z 0.129
+## cow -0.828 -0.075
+## dung.z 0.033 0.139 -0.091
+
+## 2.5 % 97.5 %
+## .sig01 0.16809483 0.7385238
+## (Intercept) -0.48398346 0.2863200
+## colsize.z -0.18428769 0.2950063
+## cow -0.05360035 0.8588134
+## dung.z -0.36296714 0.0733620
+
+## Family: binomial
+## Links: mu = logit
+## Formula: fledge | trials(clutch) ~ colsize.z + cow + dung.z + (1 | farm.f)
+## Data: dat (Number of observations: 63)
+## Draws: 2 chains, each with iter = 2000; warmup = 500; thin = 1;
+## total post-warmup draws = 3000
+##
+## Group-Level Effects:
+## ~farm.f (Number of levels: 51)
+## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
+## sd(Intercept) 0.55 0.16 0.26 0.86 1.00 910 1284
+##
+## Population-Level Effects:
+## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
+## Intercept -0.10 0.21 -0.52 0.32 1.00 2863 2165
+## colsize.z 0.05 0.14 -0.21 0.34 1.00 2266 1794
+## cow 0.41 0.25 -0.06 0.90 1.00 3069 2117
+## dung.z -0.15 0.12 -0.38 0.09 1.00 3254 2241
+##
+## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
+## and Tail_ESS are effective sample size measures, and Rhat is the potential
+## scale reduction factor on split chains (at convergence, Rhat = 1).
+From the results we conclude that in farms without cows (when cow=0) and for average colony sizes (when colsize.z=0) and average number of dung heaps (when dung.z=0) the average nestling survival of Barn swallows is the inverse-logit function of the Intercept, thus, plogis(
-0.1)
= 0.47 with a 95% uncertainty interval of 0.37 - 0.58. We further see that colony size and number of dung heaps are less important than whether cows are present or not. Their estimated partial effect is small and their uncertainty interval includes only values close to zero. However, whether cows are present or not may be important for the survival of nestlings. The average nestling survival in farms with cows is plogis(
-0.1+
0.41)
= 0.58. For getting the uncertainty interval of this survival estimate, we need to do the calculation for every simulation from the posterior distribution of both parameters.
bsim <- posterior_samples(mod.brm)
+# survival of nestlings on farms with cows:
+survivalest <- plogis(bsim$b_Intercept + bsim$b_cow)
+quantile(survivalest, probs=c(0.025, 0.975)) # 95% uncertainty interval
## 2.5% 97.5%
+## 0.5126716 0.6412675
+In medical research, it is standard to report the fixed-effects coefficients from GLMM with binomial or Bernoulli error as odds ratios by taking the exponent (R function exp
for \(e^{()}\)) of the coefficient on the logit-scale. For example, the coefficient for cow from mod.glmer
, 0.39 (95% CI from -0.05 to -0.05), represents an odds ratio of exp(
+0.39)=1.48 (95% CI from 0.95 to 0.95). This means that the odds for fledging (vs. not fledging) from a clutch from a farm with livestock present is about 1.5 times larger than the odds for fledging if no livestock is present (relative effect).
2024-10-12
+In 2015, we wrote a statistics book for Master/PhD level Bayesian data analyses in ecology (Korner-Nievergelt et al. 2015). You can order it here. People seemed to like it (e.g. (Harju 2016)). Since then, two parallel processes happen. First, we learn more and we become more confident in what we do, or what we do not, and why we do what we do. Second, several really clever people develop software that broaden the spectrum of ecological models that now easily can be applied by ecologists used to work with R. With this e-book, we open the possibility to add new or substantially revised material. In most of the time, it should be in a state that it can be printed and used together with the book as handout for our stats courses.
+We do not copy text from the book into the e-book. Therefore, we refer to the book (Korner-Nievergelt et al. 2015) for reading about the basic theory on doing Bayesian data analyses using linear models. However, Chapters 1 to 17 of this dynamic e-book correspond to the book chapters. In each chapter, we may provide updated R-codes and/or additional material. The following chapters contain completely new material that we think may be useful for ecologists.
+While we show the R-code behind most of the analyses, we sometimes choose not to show all the code in the html version of the book. This is particularly the case for some of the illustrations. An intrested reader can always consult the public GitHub repository with the rmarkdown-files that were used to generate the book.
+It is open so that everybody with a GitHub account can make comments and suggestions for improvement. Readers can contribute in two ways. One way is to add an issue. The second way is to contribute content directly through the edit button at the top of the page (i.e. a symbol showing a pencil in a square). That button is linked to the rmarkdown source file of each page. You can correct typos or add new text and then submit a GitHub pull request. We try to respond to you as quickly as possible. We are looking forward to your contribution!
+We thank Yihui Xie for providing bookdown which makes it much fun to write open books such as ours. +We thank many anonymous students and collaborators who searched information on new software, reported updates and gave feedback on earlier versions of the book. Specifically, we thank Carole Niffenegger for looking up the difference between the bulk and tail ESS in the brm output, Martin Küblbeck for using the conditional logistic regression in rstanarm, …
+ +