report.rmd

---
title: "A Fully Bayesian Way of Estimating Insurance Relativities"
author: "Brayden Tang"
date: "27/11/2020"
output:
  prettydoc::html_pretty:
    theme: tactile
    highlight: github
    toc: true
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(rstan)
library(tidyverse)
library(recipes)
library(DT)
library(kableExtra)
library(prettydoc)
library(statmod)
library(patchwork)

# CHANGE THIS BEFORE KNITING IF YOU WANT TO RERUN THE MCMC CHAINS.
refit <- FALSE
```

## Some Context

I am not an actuarial student (anymore) but I recently completed a
typical pricing assignment as part of an evaluation for a job
application. I didn't end up applying, however I enjoyed having a
familiar (and fake) kind of dataset that I used to work with all the
time in my past life. I wanted to see if I could get around some of the
issues I had with these kinds of datasets from three years ago.

The objective of actuarial relativity analysis is, at a high level, to
most accurately predict the pure premium as closely as possible for all
combinations of rating variables. The pure premium is the amount of
money needed, on average, that is required to cover the cost of claims
only (so no profit allocations or expenses). Rating variables are simply
policyholder characteristics (such as the amount of kilometers on a
vehicle, age of the policyholder, color of the vehicle, engine size,
etc.)

Naturally, pure premium is expressed simply as the rate at which a risk
makes any kind of claim multiplied by the average amount per each claim.
Thus, pure premium is defined as:

$$ \text{frequency} \times \text{severity} = \frac{\text{total claim counts}}{\text{total risks = exposures}} \times \frac{\text{total claim amounts (losses)}}{\text{total claim counts}} $$

$$ = \frac{\text{total claim amounts (losses)}}{\text{total risks}}.$$

We can model either the frequency and severity distributions separately,
or we can model pure premium directly. The former approach is often
preferred as it offers more flexibility and is at least logically, more
robust.

## Relativities

Relativities are simply the **marginal** impact a particular rating variable
has on the response. This has the exact same interpretation as
regression coefficients do in a multivariate linear model - holding all
other variables constant, the regression coefficient represents the
change in some quantity (say, the change in the log of severity or the
log of frequency) per some change in said rating variable.

Relativities are typically calculated by predicting pure premiums across
all of the levels of a particular rating variable, while simultaneously
holding all of the other rating variables constant at the levels of some
chosen base class. The base class represents one specific combination of
rating variables that all other classes are compared against. Other
combinations of rating variables are more risky (higher pure premium) or
less risky (lower pure premium), **relative** to this base class.

Thus, each predicted pure premium is divided by the predicted pure
premium of the relevant level from the base class to obtain these
relativities.

```{r relativities, echo = FALSE}
example_city_rels <- tibble(
	City = c("Small City", "Medium City", "Large City"),
	Relativity = c(0.95, 1, 1.50)
	) %>%
	kable() %>%
	kable_styling()

example_km_rels <- tibble(
	Kilometres = c(
		"Less Than 10,000km",
		"10,000-50,000km",
		"50,001-100,000km",
		"More Than 100,000km"
		),
	Relativity = c(0.35, 0.76, 1, 1.378)
) %>%
	kable() %>%
	kable_styling()

knitr::kables(
	list(example_city_rels, example_km_rels),
	caption = "Figure 1: Example relativities with two rating variables") %>% 
	kable_styling() 

```

<br> Note that the vehicle described by the levels of each rating
variable with relativities 1 (medium city & 50,001-100,000km) is the
base class (in a linear model, this is the class described by the
intercept). Relative to the base class, other vehicles have higher or
lower premiums. The base class premium is multiplied by the
corresponding relativities to derive the premiums for the other vehicle
classes.

For example, suppose the base class pure premium (the premium for a
policyholder who drives in the medium city and has a vehicle with
50,001-100,000km) is \$1000. Then, the pure premium for a policyholder
who drives in the small city with a vehicle that has more than 100,000km
of mileage has a pure premium of \$1000 \* 0.95 \* 1.378 = \$1309.10.

## The Dataset

I import the dataset below:

```{r Import the data, echo = FALSE}
data <- read_csv("data/pricingdat.csv") 

data %>% 
	datatable(
		rownames = FALSE,
		options = list(scrollY = 300, searching = FALSE, scrollX = 300)) %>%
	formatRound(columns = 6:7, digits = 0)
```

<br> As is typical, one row represents one unique combination of rating
variables, containing the total number of claims observed, amount of
exposures (i.e. number of risks, which is typically defined as the
amount of policy earned over a period of one year) and the total amount
of claim payments.

While this dataset seems straightforward, it is deceivingly difficult to
work with when compared to other more standard datasets. This is
primarily due to the aggregation which causes rows and columns to be
dependent. If we add more rating variables (which are columns) to the
dataset, we naturally will get more rows since there will be more unique
combinations of rating variables. In fact, the number of rows will grow
exponentially (see the curse of dimensionality).

The consequences of this are dire. For one, model validation becomes
very difficult. We can not naively split the dataset above because each
row is unique. If we did, the model would be evaluated against
combinations of rating variables that would be completely unseen in the
training set, forcing the model to extrapolate (likely leading to poor
predictions that aren't reflective of how the model would perform in
reality). 

Second, explicit feature selection (which is emphasized a lot in pricing
since there are many business reasons to have as simple of a rating
algorithm as possible), is impossible. We cannot simply drop columns
without affecting the number of rows (and therefore, completely changing
the dataset in the process). Thus, fairly comparing models that
have different underlying training datasets is not very clear.

Third, each row will become more sparse as well due to the curse of
dimensionality, with some combinations of rating variables simply having
zero claims or exposures. Naturally, this will lead to ill-defined
models - the traditional gamma won't work, for example, since we will
have exact zero losses. 

One solution to the sparsity problem is to group the sparse level of
the rating variable with another. This seems
harmless, but in reality this corresponds to a deceivingly strong prior,
namely that the effects of each level on the response are the same. In
math, this corresponds to setting
$\beta_{1} - \beta_{2} \sim N(0, 2\epsilon)$ where $\epsilon$ is some
number very close to 0, and $\beta_{1}, \beta_{2}$ are the effects of
the two separate levels being grouped. See [this
video](https://www.youtube.com/watch?v=BKumW2RfSoQ&ab_channel=Stan) for
more details. Regardless, grouping levels of rating variables together
is a very strong prior that is not at all transparent. Rather than doing
this, we can use mixed effects/hierarchical models to achieve partial
pooling, which is far more flexible and robust.

### Aside: The Ideal Dataset

The ideal dataset, at least hypothetically, would be a dataset where one
row is equal to one policyholder. The total number of earned exposures
for the policyholder would be one column, and the total claim counts and
total claim costs (for that policyholder) would be additional columns
(from which we could model). The rating variables relevant for that
policyholder would then be represented as additional columns.

This would secure independence between the number of rating variables
used and the number of rows. To obtain relativities, the aggregated
dataset could be recreated (that is, one row is equal to one unique
combination of exposures) from which we could then make our predictions
of frequency and severity using our trained model (that was trained on
the policyholder data). The model predicts it's best guess of
$E[Y | X = \text{rating variables}]$ learned over all observed
policyholders.

## Post-Adjustments

Traditionally, generalized linear models (Poisson and gamma linear models in particular)
are used to predict the frequency and severity of claims. These models produce the expected frequency
and severity of a particular rating combination, and assuming independence, these expectations
are multiplied together to get the expected pure premium. 

However, any models used to predict the frequency and severity of claims is
susceptible to overfitting, leading to relativities that are possibly overstated or understated. 
This often leads to actuaries manually adjusting relativities to be more 
in line with what their apriori beliefs are. For example, a vehicle with a
larger engine would be expected to be of higher risk than a vehicle with a
smaller engine since presumably the vehicles with larger engines are more
expensive on average. Sometimes, the relativities produced do not reflect this due to
uncertainty/variance and so an actuary may decide to "manually" adjust the values to smooth said estimates.
This is often done out of respect for the customer who may not appreciate being
charged more in premium when they perceive themselves as lower risk.
Depending on the actuary, they may just adjust relativities that are not
monotonic.

However, these adjustments are arbitrary if we do not actually know how much
uncertainty there is in the estimates. In the case of traditional pricing
modeling, it is difficult to actually quantify how much uncertainty we
are dealing with because the frequency and severity models are estimated
completely independently of each other. The expectations produced by
each individual model are used to produce the pure premiums, which are
then used to calculate relativities.  

However, it is clear that if we view frequency and severity as random
variables that they are not independent. If we have a zero frequency, we
must have a zero severity (or equivalently, if we have observed claims
we must have non-zero severity, assuming that claims below a deductible
are ignored). Estimating both models separately ignores this dependence which
inevitably affects the variance of the pure premium.

In addition, other sources of variation, such as parameter and residual uncertainty, 
are ignored. 

Bayesian modeling will allow us to quantify the vast majority of
uncertainty in our data. By modeling the data as a generative process instead, we can account for nearly all sources of variation that are ignored in the traditional, frequentist approach. As a result, actuaries will have the ability to choose
reasonable values that are in line with the amount of uncertainty that exists. In addition, it allows actuaries
to quantify their confidence in allowing for a non-monotonic relativity,
giving them the ability to defend their decisions.

## The Bayesian/Generative Model

We describe the fully generative model below:

Let $X$ be the severity random variable, $R_{i}$ be the rating
variables, and $N$ be the claim counts. Assume that exposures are a
known constant.

Then, for the ith unique combination of rating variables,
$\forall i \ X_{i} > 0$ let:

$$X_{i} \big|N_{i}, \phi, \eta_{i} \sim Gamma\left(\frac{N_{i}}{\phi}, \frac{N_{i}}{\eta_{i} \times \phi}\right),$$

$$\eta_{i} \big| \beta, R_{i} = exp(R_{i} \times \beta),$$
$$N_{i} \big|\lambda_{i} \sim Poisson(\lambda_{i}),$$
$$\lambda_{i} \big|\psi, R_{i} = exp(R_{i} \times \psi + log(\text{exposure}_{i})),$$
$$\beta \sim Normal(0, \tau),$$ 
$$\psi \sim Normal(0, \omega),$$
$$\phi \sim \text{Half-Cauchy}(0, \alpha).$$ Note that
$\alpha, \tau, \omega$ are constants that must be specified by the user.
In addition, note that
$$E[X_{i} \big |N_{i}, \phi, \eta_{i}] = \frac{N_{i}}{\phi} \times \frac{\eta_{i} \times \phi}{N_{i}} = \eta_{i}$$
but
$$Var[X_{i} \big |N_{i}, \phi, \eta_{i}] = \frac{N_{i}}{\phi} \times \frac{\eta_{i}^2 \times \phi^2}{N_{i}^2} = \frac{\eta_{i}^2 \times \phi}{N_{i}}.$$

That is, as the number of claims that makes up a severity calculation
increases, the lower the variance of the resulting severity distribution
for the particular combination of rating variables. Equivalently, the
more claims for a particular combination of rating variables, the more
influence that particular combination of rating variables has on the
overall model fit. This is pretty much the same thing as using weights
in a gamma generalized linear model, however, in this case the weights
are also random variables and everything is modeled simultaneously.

It follows that
$$X_{i} \times \frac{N_{i}}{\text{number of exposures for ith combination}} = P_{i},$$
where $P_{i}$ is the pure premium for the ith combination of rating
variables, is also a random variable. Since we will have joint posterior
draws of the two random variables that the pure premium is a function
of, the posterior distribution of the pure premium (and relativities) is
also known.

The above formulation should also give insight into what combination of
rating variables should, in general, represent the base class pure
premium. It is clear that the variance of the pure premium for the ith
combination is $\frac{Var[X_{i}]}{\text{exposures}_{i}^2}$. Thus,
choosing the base class with the largest amount of exposures will often
yield a low varying (i.e. stable) pure premium (assuming the severity
variance is not extremely large).

In addition, this Bayesian approach helps solve problems related to the
aggregation of data.

### Feature Selection

It was mentioned previously that feature selection is very difficult to
carry out on an aggregated dataset where rows and columns are not
independent. This is true if we are using explicit feature selection
where rating variables are literally removed from the design
matrix/training dataset.

However, we can achieve "feature selection" through regularization as
well. It turns out that L1 (LASSO), and L2 (ridge) are equivalent to
Laplace and Normal zero mean priors on regression coefficients, for
example. L1 regularization actually has the potential to explicitly
feature select as well, allowing for exactly zero coefficients (albeit,
we are not maximizing but marginalizing in Bayesian approaches in general
so we won't ever get this exactly in the Bayesian approach). The cost parameter typically
associated with these regularization techniques is exactly equal to the
variance of the Laplace/Normal distributions - the larger the variance,
the lower the cost/regularization. Bayesian approaches also allow for
even more specific priors depending on the context, such as the
horseshoe/spike and slab priors.

In the Bayesian context, the variance is chosen based on our confidence
that the (predictive) effects of each level are non-zero, apriori.
Alternatively, one can choose to use an empirical Bayes approach and
tune this variance parameter using something like cross validation. This
is equivalent to the typical machine learning approach - but in the
Bayes context this isn't as feasible since Bayesian models fit using
MCMC are slow. An alternative is to estimate the expected out of sample
log pointwise predictive density (ELPD) using importance sampling
(counterfactuals of the loglikelihood), which is faster than true cross
validation (since there is no actual data splitting) but still potentially too slow.

Regardless, the Bayesian model allows us to use regularization while
still remaining completely probabilistic. This will pay off when we look
at relativities.

### Predictive Performance

The other main issue involved the issue of model validation in terms of
predictive performance. We cannot just naively split the dataset,
because each row represents a unique combination of rating variables.

We can, however, use likelihood based evaluation metrics if we are
comparing models fit with the same underlying dataset (among other
caveats like having the same raw target variable and avoiding
comparisons between discrete and continuous distributions). One such
metric is the aforementioned leave one out estimate of ELPD, which
produces counterfactuals of the loglikelihood under scenarios where one
observation at a time is excluded from the dataset (so that each
resulting loglikelihood can be evaluated at the held out data point,
simulating cross validation). However, just like true data splitting
we still suffer from the same problem of each row being a unique
combination of rating variables. Thus, the ELPD on the held out rating
combination will be estimated through extrapolation, which is not true
to how the model will behave in reality.

Unfortunately, it is unclear how to obtain absolute measures of predictive performance. 
The only alternative may be to perform posterior predictive
checks, but these are known to be biased for predictive performance
because this involves in-sample data. Despite this, [some do suggest the use of 50% credible intervals to at least detect overfitting.](https://stats.stackexchange.com/questions/343420/bayesian-thinking-about-overfitting)

Some argue that Bayesian methods cannot really overfit providing that uninformative priors
are avoided, and in general, if marginalizing rather than maximization is done as much as possible. Informative priors encourage estimators to not fully rely on the underlying training dataset by introducing bias via. the prior. Marginalization is preferable to maximization since marginalization weighs each outcome with its posterior probability.

Regardless, none of this allows for the ability to quantify how well the model actually predicts in absolute terms. 

## Stan Code

Below is the exact same model described above. In this case, we set
$\tau$ and $\omega$ to 1 and 1, respectively, which are "weakly informative" priors.
No strong apriori beliefs regarding the effects exist (there is no context regarding the data)
but regularization to shrink effects to 0, if it is justifiable, is desired.

For the variance parameter, we use a half-Cauchy(0, 3) distribution. This is again a weakly informative prior.

```{stan output.var = "poisson_gamma_model"}
data {
  int<lower=1> Nobs;
  int<lower=1> Nvar;
  matrix[Nobs, Nvar] X;
  int<lower=0> claim_count[Nobs];
  vector<lower=0>[Nobs] exposure;
  vector<lower=0>[Nobs] severity;
}

parameters {
  vector[Nvar] beta_frequency;
  vector[Nvar] beta_severity;
  real<lower=0> dispersion_severity;
}

model {
  
  beta_frequency ~ std_normal();
  beta_severity ~ std_normal();
  dispersion_severity ~ cauchy(0, 3);
  
  claim_count ~ poisson_log(X * beta_frequency + log(exposure));
  
  for (i in 1:Nobs) {
    if (claim_count[i] != 0) {
      severity[i] ~ gamma(
      	claim_count[i] / dispersion_severity,
      	claim_count[i] / (exp(X[i, ] * beta_severity) * dispersion_severity)
      	);
    }
  }
  
}

```

Next, preprocess the data in a suitable format for Stan.

```{r Preprocessing}
data_with_severity <- data %>%
	mutate(
		severity = ifelse(claim_count != 0, claim_payments / claim_count, 0), 
		kilometres = as.factor(kilometres),
		zone = as.factor(zone),
		bonus = as.factor(bonus),
		make = as.factor(make),
		observed_pp = claim_payments / vehicle_exposure_years
	)

preprocessing <- recipe(claim_count ~ ., data = data_with_severity) %>%
	step_dummy(all_nominal())

prepped_recipe <- prep(preprocessing, training = data_with_severity)

X <- juice(prepped_recipe) %>%
	mutate(intercept = rep(1, nrow(.))) %>%
	relocate(intercept) %>%
	select(
		-vehicle_exposure_years,
		-claim_payments,
		-severity, 
		-claim_count, 
		-observed_pp
		) 

data_list <- list(
	Nobs = nrow(X),
	Nvar = ncol(X),
	X = as.matrix.data.frame(X),
	claim_count = data_with_severity$claim_count,
	exposure = data_with_severity$vehicle_exposure_years,
	severity = data_with_severity$severity
)

```

Now we sample from the posterior distribution as required:

```{r Sample from Posterior - Poisson/Gamma}
if (refit == TRUE) {
	fit_poisson_gamma <- sampling(
		poisson_gamma_model,
		data = data_list,
		chains = 6,
		iter = 2000,
		seed = 200350623,
		cores = 6,
		verbose = TRUE
		)
	
	saveRDS(fit_poisson_gamma, "rds/fittedPGmodel.rds")
} else {
	fit_poisson_gamma <- readRDS("rds/fittedPGmodel.rds")
}

```

This function extracts the required samples from the MCMC sampler.

```{r Helper Function For Extraction of Posterior Parameters}
#' Extract the required mean/dispersion parameters for both candidate frequency
#` and severity distributions. 
#'
#' @param fitted_model A Stan fitted model (the result of rstan::sampling).
#`	This model MUST have named parameters beta_frequency, beta_severity, 
#`	and can optionally have parameters dispersion_frequency and 
#`  dispersion_severity.
#' @param exposures A vector of length N containing the exposures
#' @param X The design matrix of rating variables
#'
#' @return A list containing matrices/vectors of posterior samples of the 
#` required probability distributions. Both frequency and severity parameter
#` samples are returned on the predictor (not linked) scale.
#' @export
#'
#' @examples 
#` \dontrun{
#` extract_and_get_posterior_samples(
#` my_stan_fitted,
#` vehicle_exposure_years,
#` X_design
#` )
#`}
extract_and_get_posterior_samples <- function(fitted_model, exposures, X) {
	
	samples <- rstan::extract(fitted_model)
	exposure <- matrix(
		log(exposures),
		nrow = nrow(samples$beta_frequency),
		ncol = length(exposures),
		byrow = TRUE
		)
	
	if (is.null(samples$dispersion_frequency)) {
		dispersion_frequency_samples <- NULL
	} else {
		dispersion_frequency_samples <- samples$dispersion_frequency
	}
	
	if (is.null(samples$dispersion_severity)) {
		dispersion_severity_samples <- NULL
	} else {
		dispersion_severity_samples <- samples$dispersion_severity
	}
	
	return(list(
		frequency_samples = samples$beta_frequency %*% t(as.matrix.data.frame(X)) + 
			exposure,
		severity_samples = samples$beta_severity %*% t(as.matrix.data.frame(X)),
		dispersion_frequency_samples = dispersion_frequency_samples,
		dispersion_severity_samples = dispersion_severity_samples,
		exposures = exposure
		)
	)
	
}
```

We now extract the posterior samples.

```{r Extract Posterior Samples}
# Extract the samples for the Poisson-Gamma model.
pg_posterior_samples <- extract_and_get_posterior_samples(
	fitted_model = fit_poisson_gamma,
	exposures = data_with_severity$vehicle_exposure_years,
	X = X
)

# Apply the link functions.
pg_posterior_samples$frequency_samples <- exp(pg_posterior_samples$frequency_samples)
pg_posterior_samples$severity_samples <- exp(pg_posterior_samples$severity_samples)
```

## Posterior Predictive Checks

Does the underlying generative model fit the dataset well?

```{r Posterior Predictive Check function, echo = FALSE}
#' Generates posterior parameter samples for the weighted gamma.
#'
#' @param posterior_predictive_samples Matrix of posterior predictive samples.
#' @param y Vector of observations representing the actuals.
#' @param n Number of samples to plot. Must be less than the number of rows in
#'	posterior_predictive_samples.
#' @param variable Character string for graphing purposes.
#' @param limits Vector of upper and lower limits to plot (zooming in on the plot).
#'
#' @return None.
#' @export
#'
#' @examples
#' posterior_predictive_check(
#`	my_posterior_samples, n = 1000, variable = "claims"
#` )
posterior_pred_check <- function(posterior_predictive_samples, y, n, variable,
																 limits) {

	data_tibble <- as_tibble(posterior_predictive_samples[1:n, ]) %>%
		mutate(iter = 1:nrow(.)) %>%
		gather(key = "obs", value = "var", -iter) %>%
		select(-obs)
	
	ggplot(data_tibble, aes(x = var)) +
		geom_density(aes(group = as.factor(iter)), color = alpha("blue", 0.7)) +
		theme_bw() +
		scale_x_continuous(trans = "sqrt") +
		geom_density(
			data = tibble(actual = y),
			aes(x = actual), 
			size = 1, 
			alpha = 0.9
			) +
		coord_cartesian(xlim = limits, expand = FALSE) +
		labs(x = variable)

}
```

### Posterior Predictive Check - Frequency

First, we generate samples from the posterior predictive distribution. Then, we perform posterior predictive checks for the frequency, severity, and pure premium distributions (i.e. compare the generated data against the observed data).

```{r Generate Samples}
# Basically, flatten the matrix of frequencies to a vector, generate a vector 
# of length equal to this vector using each of these frequencies, then convert
# back
set.seed(200350623)
claim_count_posterior_pred <- rpois(
	n = length(pg_posterior_samples$frequency_samples),
	lambda = pg_posterior_samples$frequency_samples
	)

dim(claim_count_posterior_pred) <- dim(pg_posterior_samples$frequency_samples)
```

```{r Posterior Predictive Check - Frequency, fig.align="center", fig.height=6, fig.width=10}
posterior_pred_check(
	claim_count_posterior_pred,
	y = data$claim_count,
	n = 50, 
	"claims (sqrt scaled)",
	limits = c(0, 700)
	)
```

The blue curves represent 50 simulated datasets generated from the
posterior predictive. The black curve is the actual, observed dataset
(in this case, claim counts). Ideally, we want the black curve to fall
**well** within the blue region as this implies that the actual,
observed dataset is a likely realization under our generative model.

The posterior predictive graphs are all zoomed in on the region where
the density is not trivially zero for easier inspection.

Overall, we can see that the frequency model looks decent at most of the
claim count ranges (especially the lower claim counts), but the rating
combinations with larger claim counts seem to be more variable than what
the model can explain.

That is, the distribution of claim counts exhibits possible
overdispersion. Thus, a possible improvement to the model might be to
use the negative binomial distribution instead, rather than the Poisson.

### Posterior Predictive Check - Severity

Similarily, we get posterior predictive draws of severities per each
combination of rating variables.

```{r}
#' Computes the posterior parameters for the weighted Gamma distribution.
#'
#' @param posterior_samples Posterior parameter samples from the Stan model.
#' @param posterior_predictive_claim_counts Posterior predictive claim counts.
#'  Must be a matrix.
#'
#' @return A list with posterior shape and rate parameters.
#' @export
#'
#' @examples sample_weighted_gamma(
#' posterior_samples, claim_counts_nb)
#'
#'
#'
sample_weighted_gamma <- function(
	posterior_samples,
	posterior_predictive_claim_counts) {
	
	dispersion_severity_posterior <- matrix(
		posterior_samples$dispersion_severity_samples,
		nrow = nrow(posterior_samples$frequency_samples),
		ncol = ncol(posterior_samples$frequency_samples)
		)
	
	shapes_posterior <- posterior_predictive_claim_counts / 
		dispersion_severity_posterior
	
	rates_posterior <- posterior_predictive_claim_counts /
		(posterior_samples$severity_samples * dispersion_severity_posterior)
	
	return(list(shapes = shapes_posterior, rates = rates_posterior))
	
	}

posterior_sev_samples_pg <- sample_weighted_gamma(
	pg_posterior_samples,
	claim_count_posterior_pred	
)

set.seed(200350623)
simulate_severity <- rgamma(
	n = length(pg_posterior_samples$severity_samples),
	shape = posterior_sev_samples_pg$shapes,
	rate = posterior_sev_samples_pg$rates
	)

dim(simulate_severity) <- dim(pg_posterior_samples$severity_samples)
```

```{r, fig.align="center", fig.height=6, fig.width=10}
posterior_pred_check(
	simulate_severity,
	y = data_with_severity$severity,
	n = 50,
	variable = "severity (sqrt scaled)",
	limits = c(0, 40000)
	)

```

As we can see in this case, the generative model looks off here. The
model underfits the most important areas of largest density (the large
spike) and fails to account for uncertainty at higher and lower levels
of severity.

A major contributor to this is the fact that there is an unusual amount
of density in the 31,000 region for severity. It turns out that this
corresponds to a bunch of rating combinations with one or two claims and
total payments in multiples of 31,442. This is likely an artifact that
has been inserted in the data that is incredibly suspicious. However,
these rating combinations receive such little weight in the fit due to
the weighting scheme employed and therefore the overall impact of these
large severity amounts should be mitigated.

Ultimately, however, the goal is the pure premium since our relativities
are based on these quantities.

### Posterior Predictive Check - Pure Premium
```{r Posterior Predictive Check - PP}
simulate_pure_premium <- (
	claim_count_posterior_pred / exp(pg_posterior_samples$exposures)
	) * simulate_severity
```

```{r, fig.align="center", fig.height=6, fig.width=10}
posterior_pred_check(
	simulate_pure_premium,
	y = data_with_severity$observed_pp,
	n = 50,
	variable = "pure premium (sqrt scaled)",
	limits = c(0, 4000)
)
```

The "peaks" in the distribution are not as extreme as what is observed
in the data. The black curve does not fall in the blue region in
important areas of the distribution, namely the single peak. Thus, under
the generative model the observed data is simply not probable in some
regions of pure premium, which is an issue.

Otherwise, the generative model at least has the right shape, for the
most part.

```{r remove objects, include = FALSE}
rm(
	claim_count_posterior_pred, fit_poisson_gamma, pg_posterior_samples,
	posterior_sev_samples_pg, simulate_pure_premium, simulate_severity
	)

```

## Model Revision - Switch Frequency to Negative Binomial

We noticed in the previous section that the Poisson distribution does
not account for as much variance as we would like. Indeed, the variance
of the Poisson distribution is equal to its mean.

A negative binomial has variance greater than its mean and therefore
could address issues where uncertainty is understated for particular
rating variable combinations with large claim counts.

The generative model is quite similar to the previous model, but we
switch the frequency distribution:

$$X_{i} \big|N_{i}, \phi, \eta_{i} \sim Gamma\left(\frac{N_{i}}{\phi}, \frac{N_{i}}{\eta_{i} \times \phi}\right),$$

$$\eta_{i} \big| \beta, R_{i} = exp(R_{i} \times \beta),$$
$$N_{i} \big|\xi_{i}, \nu \sim \text{Negative Binomial}(\xi_{i}, \nu),$$
$$\xi_{i} \big|\psi, R_{i} = exp(R_{i} \times \psi + log(\text{exposure}_{i})),$$
$$\beta \sim Normal(0, 1),$$ 
$$\psi \sim Normal(0, 1),$$
$$\phi \sim \text{Half-Cauchy}(0, 3),$$
$$\nu \sim \text{Half-Cauchy}(0, 10).$$ Note that the Negative Binomial
is using the mean, dispersion parameterization in Stan (i.e.
NegativeBinomial2). The Half-Cauchy is set to be super wide, since stronger priors will heavily favor models that have significant amounts of over-dispersion (leading to sampling issues if the data does not exhibit strong overdispersion).

Compiling the model above in Stan:

```{stan output.var = "nb_gamma"}
data {
  int<lower=1> Nobs;
  int<lower=1> Nvar;
  matrix[Nobs, Nvar] X;
  int<lower=0> claim_count[Nobs];
  vector<lower=0>[Nobs] exposure;
  vector<lower=0>[Nobs] severity;
}

parameters {
  vector[Nvar] beta_frequency;
  vector[Nvar] beta_severity;
  real<lower=0> dispersion_severity;
  real<lower=0> dispersion_frequency;
}

model {
  
  beta_frequency ~ std_normal();
  beta_severity ~ std_normal();
  dispersion_severity ~ cauchy(0, 3);
  dispersion_frequency ~ cauchy(0, 10); /* Larger variance here for the negative binomial 
  because the data did not appear to be that overdispersed and we need to allow for
  larger dispersion parameters to consider these cases */
  
  claim_count ~ neg_binomial_2_log(
  	X * beta_frequency + log(exposure), dispersion_frequency
  );
  
  for (i in 1:Nobs) {
    if (claim_count[i] != 0) {
      severity[i] ~ gamma(
      	claim_count[i] / dispersion_severity,
      	claim_count[i] / (exp(X[i, ] * beta_severity) * dispersion_severity)
      	);
    }
  }
  
}
```

Now we sample from the posterior distribution as required:

```{r Sample from Posterior - NB/Gamma}
if (refit == TRUE) {
	
	fit_nb_gamma <- sampling(
		nb_gamma,
		data = data_list,
		chains = 6,
		iter = 2000,
		seed = 200350623,
		cores = 6,
		verbose = TRUE
		)
	
	saveRDS(fit_nb_gamma, "rds/nb-weighted_gamma.rds")
} else {
	fit_nb_gamma <- readRDS("rds/nb-weighted_gamma.rds")
}

nb_posterior_samples <- extract_and_get_posterior_samples(
	fitted_model = fit_nb_gamma,
	exposures = data_with_severity$vehicle_exposure_years,
	X = X
)

nb_posterior_samples$frequency_samples <- exp(nb_posterior_samples$frequency_samples)
nb_posterior_samples$severity_samples <- exp(nb_posterior_samples$severity_samples)

dispersion_frequency_nb <- matrix(
	nb_posterior_samples$dispersion_frequency,
	nrow = length(nb_posterior_samples$dispersion_frequency),
	ncol = nrow(data)
)

```

## Posterior Predictive Check - Revision I

We again simulate observations from the posterior predictive.

```{r}
# Frequency
set.seed(200350623)
claim_count_posterior_pred_nb <- rnbinom(
	n = length(nb_posterior_samples$frequency_samples),
	mu = nb_posterior_samples$frequency_samples,
	size = dispersion_frequency_nb
	)

dim(claim_count_posterior_pred_nb) <- dim(dispersion_frequency_nb)

parameters_nb <- sample_weighted_gamma(
	posterior_samples = nb_posterior_samples,
	posterior_predictive_claim_counts = claim_count_posterior_pred_nb
)

# Severity
set.seed(200350623)
simulate_severity_nb <- rgamma(
	n = length(parameters_nb$shapes),
	shape = parameters_nb$shapes,
	rate = parameters_nb$rates
	)

dim(simulate_severity_nb) <- dim(nb_posterior_samples$severity_samples)

# Pure Premium
simulate_pure_premium_nb <- (
	claim_count_posterior_pred_nb / exp(nb_posterior_samples$exposures)
	) * simulate_severity_nb

```

Next, complete the posterior predictive checks as before:

```{r}
results <- mapply(
	posterior_pred_check,
	posterior_predictive_samples = list(
			claim_count_posterior_pred_nb,
			simulate_severity_nb,
			simulate_pure_premium_nb
			),
	y = list(
			data_with_severity$claim_count,
			data_with_severity$severity,
			data_with_severity$observed_pp
			),
	variable = c(
			"claims (sqrt scaled)",
			"severity (sqrt scaled)",
			"pure premium (sqrt scaled)"
			),
	limits = list(c(0, 700), c(0, 40000), c(0, 4000)),
	MoreArgs = list(n = 50),
	SIMPLIFY = FALSE
	)

```

Finally:

```{r, fig.align="center", fig.height=6, fig.width=10}
(results[[1]] | results[[2]]) / results[[3]] + plot_annotation(
	title = "Posterior Predictive Check - Negative Binomial Frequency"
)

```

Overall, the frequency distribution (claims) looks to be better modeled
with a Negative Binomial, albeit marginally.

The severity distribution still looks off, and as a result, the pure
premium looks only marginally (if that) better than before.

```{r include = FALSE}
rm(
	claim_count_posterior_pred_nb,
	nb_posterior_samples, 
	parameters_nb,
	simulate_severity_nb
	)

gc()
```

## Model Revision - Switch Severity to A More Heavy Tailed Distribution

It looks like the current generative model gives too much density to
small severity amounts, but also too much density to larger amounts.
Therefore, if we switch to a heavy tailed distribution we might make the
tails even heavier (and they are already too heavy) even though we may
improve the fit for smaller severity amounts.

Regardless, we can still explore these distributions and see how they
compare.

There are many heavy tailed distributions that have heavier tails than
the gamma.

### Log-normal

The log-normal distribution is known to have heavier tails than the
gamma distribution. This can be shown by calculating the limit of the ratio of survival
functions.

Stan actually has a log-normal distribution built in. However, it is
difficult to work with the log-normal distribution when linking the
expected value of the log-normal with predictors (this is because the
log-normal does not belong to the exponential dispersion family).

Instead of using the log-normal distribution, we can instead log the
severity random variable to get a Normal($\mu$, $\sigma$) random
variable (assuming that the severity random variable is log-normal). We
can then simulate directly from the log-normal distribution using the
posterior parameters to get the posterior predictive of the original
(assumed) log-normal severity random variable.

```{r}
data_list$severity <- ifelse(
	data_with_severity$claim_count != 0,
	log(data_with_severity$severity),
	0
)
```

The generative model is exactly the same as before, except now we
replace $X_{i} \big | N_{i}, \phi, \eta_{i}$ with a log-normal that has
parameters $R_{i}\beta, \frac{\sigma}{N_{i}^{0.5}}$.

```{stan output.var="sm_nb_lognormal"}
data {
  int<lower=1> Nobs;
  int<lower=1> Nvar;
  matrix[Nobs, Nvar] X;
  int<lower=0> claim_count[Nobs];
  vector<lower=0>[Nobs] exposure;
  vector[Nobs] severity; /*Note the change here - we removed the lower 
  limit since now the severity can technically be < 0 now when logged */
}

parameters {
  vector[Nvar] beta_frequency;
  vector[Nvar] beta_severity;
  real<lower=0> dispersion_severity;
  real<lower=0> dispersion_frequency;
}

model {
  
  beta_frequency ~ std_normal();
  beta_severity ~ std_normal();
  dispersion_severity ~ cauchy(0, 3);
  dispersion_frequency ~ cauchy(0, 10);
  
  claim_count ~ neg_binomial_2_log(
  	X * beta_frequency + log(exposure), dispersion_frequency
  );
  
  for (i in 1:Nobs) {
    if (claim_count[i] != 0) {
      severity[i] ~ normal(
      	X[i, ] * beta_severity,
      	(dispersion_severity / claim_count[i])^0.5
      	);
    }
  }
  
}
```

Extracting the posterior samples:

```{r}
if (refit == TRUE) {
	
	fit_nb_lognormal <- sampling(
		sm_nb_lognormal,
		data = data_list,
		chains = 6,
		iter = 2000,
		control = list(adapt_delta = 0.95),
		seed = 200350623,
		cores = 6,
		verbose = TRUE
		)

	saveRDS(fit_nb_lognormal, "rds/fitted_nb_lognormal.rds")
} else {
	fit_nb_lognormal <- readRDS("rds/fitted_nb_lognormal.rds")
}

lognormal_sev_posterior_samples <- extract_and_get_posterior_samples(
	fitted_model = fit_nb_lognormal,
	exposures = data_with_severity$vehicle_exposure_years,
	X = X 
)

# Just exponentiate the negative binomial here.
lognormal_sev_posterior_samples$frequency_samples <- exp(
	lognormal_sev_posterior_samples$frequency_samples
	)
```

Simulating from the posterior predictive:

```{r}
# Frequency
dispersion_frequency_nb <- matrix(
	lognormal_sev_posterior_samples$dispersion_frequency,
	nrow = length(lognormal_sev_posterior_samples$dispersion_frequency),
	ncol = nrow(data)
)

set.seed(200350623)
claim_count_posterior_pred_nb <- rnbinom(
	n = length(lognormal_sev_posterior_samples$frequency_samples),
	mu = lognormal_sev_posterior_samples$frequency_samples,
	size = dispersion_frequency_nb
	)

dim(claim_count_posterior_pred_nb) <- dim(
	lognormal_sev_posterior_samples$frequency_samples
	)

# Severity

# First, calculate the normal standard deviations

dispersion_severity_posterior <- matrix(
		lognormal_sev_posterior_samples$dispersion_severity_samples,
		nrow = nrow(lognormal_sev_posterior_samples$frequency_samples),
		ncol = ncol(lognormal_sev_posterior_samples$frequency_samples)
		)

normal_sd <- (dispersion_severity_posterior / claim_count_posterior_pred_nb)^0.5

# Doesn't matter what we set the sd to here since if we simulate 0 claim counts,
# we simulate 0 claim payments. Set to 1 for convenience.
normal_sd[is.infinite(normal_sd)] <- 1

# Can now simulate as required
set.seed(200350623)
simulate_severity_ln <- rlnorm(
	n = length(dispersion_severity_posterior),
	mean = lognormal_sev_posterior_samples$severity_samples,
	sd = normal_sd
)

dim(simulate_severity_ln) <- dim(
	lognormal_sev_posterior_samples$severity_samples
	)

# If posterior predictive claim count is 0, then severity should be 0.
simulate_severity_ln[claim_count_posterior_pred_nb == 0] <- 0
```

Simulating pure premiums:

```{r}
simulate_pure_premium_ln <- (
	claim_count_posterior_pred_nb / exp(lognormal_sev_posterior_samples$exposures)
	) * simulate_severity_ln
```

#### Posterior Predictive Check - Revision II

```{r}
results <- mapply(
	posterior_pred_check,
	posterior_predictive_samples = list(
			claim_count_posterior_pred_nb,
			simulate_severity_ln,
			simulate_pure_premium_ln
			),
	y = list(
			data_with_severity$claim_count,
			data_with_severity$severity,
			data_with_severity$observed_pp
			),
	variable = c(
			"claims (sqrt scaled)",
			"severity (sqrt scaled)",
			"pure premium (sqrt scaled)"
			),
	limits = list(c(0, 700), c(0, 40000), c(0, 4000)),
	MoreArgs = list(n = 50),
	SIMPLIFY = FALSE
	)
```

Like before:

```{r, fig.align="center", fig.height=6, fig.width=10}
(results[[1]] | results[[2]]) / results[[3]] + plot_annotation(
	title = "Posterior Predictive Check - Log-normal Severity"
)

```

The fit for severity is mixed, as we expected. There is clearly more mass being put
towards the tail and less towards small severity amounts which has
improved the fit dramatically in this region. However, more density has
been allocated to the larger severity amounts as a consequence which has
made the generative model behave worse there.

The pure premium model looks better in some areas and slightly worse
than others as a result. The region of highest density is being modeled
a bit better than the gamma, but arguably worse than the gamma result at
lower pure premium values.

```{r include = FALSE}
rm(
	fit_nb_lognormal, claim_count_posterior_pred_nb, dispersion_frequency_nb,
	dispersion_severity_posterior, lognormal_sev_posterior_samples,
	normal_sd, poisson_gamma_model, results, simulate_severity_ln,
	simulate_pure_premium_ln, simulate_pure_premium_nb
	)

```

### Inverse-Gaussian

The inverse-gaussian is even more heavier tailed when compared to the
log-normal. 

This requires writing a custom Stan function since the inverse-gaussian
is not actually implemented natively in Stan.

```{stan output.var="sm_nb_ig"}
functions {

/* This is just the observed loglikelihood of the IG, with additive constants
removed */

	real IG_lpdf (real x, real mu, real lambda) {
	
		real lprob;
		
		if (!(x > 0)) {
			reject("x must be strictly greater than 0");
		} 
		
		if (!(mu > 0)) {
			reject("mu must be strictly greater than 0");
		} 
		
		if (!(lambda > 0)) {
			reject("lambda must be strictly greater than 0");
		}	
		
		lprob = 0.5 * log(lambda) - (lambda * (x-mu)^2/(2 * mu^2 * x));
		return lprob;
	}
}

data {
  int<lower=1> Nobs;
  int<lower=1> Nvar;
  matrix[Nobs, Nvar] X;
  int<lower=0> claim_count[Nobs];
  vector<lower=0>[Nobs] exposure;
  vector<lower=0>[Nobs] severity;
}

parameters {
  vector[Nvar] beta_frequency;
  vector[Nvar] beta_severity;
  real<lower=0> dispersion_severity;
  real<lower=0> dispersion_frequency;
}

model {
  
  beta_frequency ~ std_normal();
  beta_severity ~ std_normal();
  dispersion_severity ~ cauchy(0, 3);
  dispersion_frequency ~ cauchy(0, 10);
  
  claim_count ~ neg_binomial_2_log(
  	X * beta_frequency + log(exposure), dispersion_frequency
  );
  
  for (i in 1:Nobs) {
    if (claim_count[i] != 0) {
      target += IG_lpdf(
      	severity[i] | exp(X[i, ] * beta_severity), 
      	claim_count[i] / dispersion_severity
      	);
    } 
  }
  
}
```

Extracting posterior samples like before:

```{r}
# Change the severity variable back to the original, non-logged version.
data_list$severity <- data_with_severity$severity

if (refit == TRUE) {
	
	fit_nb_IG <- sampling(
		sm_nb_ig,
		data = data_list,
		chains = 6,
		iter = 2000,
		control = list(adapt_delta = 0.97),
		seed = 200350623,
		cores = 6,
		verbose = TRUE
		)

	saveRDS(fit_nb_IG, "rds/fitted_nb_IG.rds")
} else {
	fit_nb_IG <- readRDS("rds/fitted_nb_IG.rds")
}

IG_sev_posterior_samples <- extract_and_get_posterior_samples(
	fitted_model = fit_nb_IG,
	exposures = data_with_severity$vehicle_exposure_years,
	X = X 
)

# Exponentiate the means to get the scalar parameters
IG_sev_posterior_samples$frequency_samples <- exp(
	IG_sev_posterior_samples$frequency_samples
	)

IG_sev_posterior_samples$severity_samples <- exp(
	IG_sev_posterior_samples$severity_samples
)
```

Again, simulating from the posterior predictive distributions:

```{r}
# Frequency
dispersion_frequency_nb <- matrix(
	IG_sev_posterior_samples$dispersion_frequency,
	nrow = length(IG_sev_posterior_samples$dispersion_frequency),
	ncol = nrow(data)
)

set.seed(200350623)
claim_count_posterior_pred_nb <- rnbinom(
	n = length(IG_sev_posterior_samples$frequency_samples),
	mu = IG_sev_posterior_samples$frequency_samples,
	size = dispersion_frequency_nb
	)

dim(claim_count_posterior_pred_nb) <- dim(
	IG_sev_posterior_samples$frequency_samples
	)

# Severity

# lambda parameter
dispersion_severity_posterior <- matrix(
		IG_sev_posterior_samples$dispersion_severity_samples,
		nrow = nrow(IG_sev_posterior_samples$frequency_samples),
		ncol = ncol(IG_sev_posterior_samples$frequency_samples)
		)

# Note: like with the gamma, in this case if we have 0 claims then we get
# a lambda of 0. This is fine as the inverse_gaussian with 0 = lambda returns 0
# w.p. 1, regardless of mean.
lambda <- claim_count_posterior_pred_nb / dispersion_severity_posterior

# Simulate from severity as normal 
set.seed(200350623)
simulate_severity_IG <- rinvgauss(
	n = length(dispersion_severity_posterior),
	mean = IG_sev_posterior_samples$severity_samples,
	shape = lambda
)

dim(simulate_severity_IG) <- dim(
	IG_sev_posterior_samples$severity_samples
	)

simulate_pure_premium_IG <- (
	claim_count_posterior_pred_nb / exp(IG_sev_posterior_samples$exposures)
	) * simulate_severity_IG
```

#### Posterior Predictive Check - Revision III

```{r}
results <- mapply(
	posterior_pred_check,
	posterior_predictive_samples = list(
			claim_count_posterior_pred_nb,
			simulate_severity_IG,
			simulate_pure_premium_IG
			),
	y = list(
			data_with_severity$claim_count,
			data_with_severity$severity,
			data_with_severity$observed_pp
			),
	variable = c(
			"claims (sqrt scaled)",
			"severity (sqrt scaled)",
			"pure premium (sqrt scaled)"
			),
	limits = list(c(0, 700), c(0, 40000), c(0, 4000)),
	MoreArgs = list(n = 50),
	SIMPLIFY = FALSE
	)
```

Like before:

```{r, fig.align="center", fig.height=6, fig.width=10}
(results[[1]] | results[[2]]) / results[[3]] + plot_annotation(
	title = "Posterior Predictive Check - Inverse-gaussian Severity"
)
```

Overall, the fit looks surprisingly better than the lognormal and
overall, looks to be the best fitting model out of everything that has
been tried up to this point. The resulting posterior predictive
distribution of the pure premium has the ability to generate the actual
observed data with reasonable probability in most ranges.

```{r, include = FALSE}
rm(
	results, IG_sev_posterior_samples, fit_nb_gamma, fit_nb_IG,
	dispersion_frequency_nb, dispersion_severity_posterior, lambda, nb_gamma,
	sm_nb_ig, simulate_pure_premium_IG, simulate_severity_IG, 
	claim_count_posterior_pred_nb, sm_nb_lognormal
	)
```

## The Random Crossed Effects Model (hierarchical Bayes)

All of the models above have their parameters estimated using only
observations that are relevant to that parameter. For example, only
rating combinations with `zone 1` contribute to the estimation of the
coefficient relevant to `zone 1`. However, many of the rating
levels within each rating variable have much smaller claim and/or exposure counts when
compared to other, more popular levels. That is, many of the
rating levels are sparse.

A common practice is to simply group specific levels together to get rid
of this sparsity (discussed in "The Dataset"). However, as mentioned
previously this decision corresponds directly with assuming an incredibly
strong prior assumption which is not transparent.

Instead of committing to these post-hoc methods, we can instead take
advantage of partial pooling. All levels will
have their estimates computed as a weighted average between the global
estimate over all levels and their individual estimate, where the
weighting depends on the variation **between** all rating levels,
the amount of variance **within** a level, and the amount of 
data that make up a level (though in this case, the amount of data per each level
influences the amount of variation within a level to control pooling instead,
due to the use of weighting in the models to address issues with data aggregation). 
This is often interpreted as a way to induce the sharing of 
information over different levels of particular rating variables. 
In the actuarial world, this is a very similar idea to credibility theory but
credibility is often applied in an ad-hoc fashion.

In addition, the variation between rating levels accounts for extra
variation between (and not just within each level) each individual level
of a rating variable. This leads to larger credible intervals which
allows for better inferences, if desired.

### The Generative Random Effects Model

The full generative model is given below:

$$\forall X_{i} >0, N_{i} \in N_{0},$$ 
$$X_{i_{jklm}} \ \big| \ N_{i_{jklm}}, \phi, \mu_{i_{jklm}} \sim \text{Inverse Gaussian}\left(\mu_{i_{jklm}}, \frac{N_{i_{jklm}}}{\phi}\right),$$

$$\mu_{i_{jklm}} \ \big| \ \beta, b_{ij}, b_{ik}, b_{il}, b_{im}  = exp(\beta + b_{ij} + b_{ik} + b_{il} + b_{im}),$$
$$b_{i_{[jklm]}} \sim \text{Normal}(0, \sigma_{[jklm]}^2),$$
$$\beta \sim \text{Normal}(8, 2.5),$$
$$N_{i_{jklm}} \ \big| \ \xi_{i_{jklm}}, \nu \sim \text{Negative Binomial}(\xi_{i_{jklm}}, \nu),$$
$$\xi_{i_{[jklm]}} \ \big| \ \psi, t_{ij}, t_{ik}, t_{il}, t_{im} = exp(\psi + t_{ij} + t_{ik} + t_{il} + t_{im} + log(\text{exposure}_{i})),$$
$$t_{i_{[jklm]}} \sim \text{Normal}(0, \tau_{[jklm]}^2),$$
$$\psi \sim \text{Normal}(1.6, 2.5),$$
$$\phi \sim \text{Half-Cauchy}(0, 5),$$
$$\nu \sim \text{Half-Cauchy}(0, 12),$$
$$\sigma^2_{[jklm]} \sim \text{Half Student's T(3, 0, 2.5)},$$
$$\tau_{[jklm]}^2 \sim \text{Half Student's T(3, 0, 2.5)},$$
where j, k, l, and m are the kilometre, zone, bonus, and make rating levels for the ith rating combination.

Equivalently, to illustrate the hierarchical structure that is going on:

$$X_{i_{[jklm]}} \ \big| \ \mu_{i_{[jklm]}}, N_{i_{[jklm]}}, \phi \sim \text{Inverse Gaussian}\left(\mu_{i_{[jklm]}}, \frac{N_{i_{[jklm]}}}{\phi}\right),$$
$$log(\mu_{i_{[jklm]}}) \ \big| \ \beta, \sigma_{[jklm]}^2 \sim \text{Normal}(\beta, \sigma_{[jklm]}^2),$$
$$\beta \sim \text{Normal}(0, 1),$$
$$\sigma^2_{[jklm]} \sim \text{Half Student's T(3, 0, 2.5)},$$

$$N_{i_{[jklm]}} \ \big| \ \xi_{i_{[jklm]}}, \nu \sim \text{Negative Binomial}(\xi_{i_{[jklm]}}, \nu),$$
$$log(\xi_{i_{[jklm]}}) \ \big| \ \psi, \tau^2_{[jklm]} \sim \text{Normal}(\psi, \tau^2_{[jklm]}),$$
$$\psi \sim \text{Normal}(0, 1),$$
$$\tau^2_{[jklm]} \sim \text{Half Student's T}(3,0, 2.5).$$

The average severity of any particular level of a rating variable comes from the same distribution with a common mean $\beta$ and variance $\sigma^2_{[jklm]}$. $\beta$ is the overall grand mean over all levels of a particular rating variable (which is the average severity ignoring all rating variables in this case). $\sigma^2_{[jklm]}$ represents the observed variation _between_ levels for a particular rating variable. Thus, individual levels within a rating variable are "pulled" towards the overall group mean, acting as additional regularization. More or less shrinkage is induced depending on the variation between levels, the number of observations for a level, and the amount of variance within a level. All rating variables share the same grand mean, since the effects are crossed.

Priors are given to $\beta$ and $\sigma^2$ so that these parameters are inferred from the data. These parameters are inferred using all of the data regardless of the observation's level, which is where the "information sharing" aspect of partial pooling comes into play.

The frequency model for $N_{i_{[jklm]}}$ follows similarly to the formulation for severity.

### Stan Code and Model Evaluation

```{stan output.var="sm_random_effects"}
functions {

/* This is just the observed loglikelihood of the IG, with additive constants
removed */

	real IG_lpdf (real x, real mu, real lambda) {
	
		real lprob;
		
		if (!(x > 0)) {
			reject("x must be strictly greater than 0");
		} 
		
		if (!(mu > 0)) {
			reject("mu must be strictly greater than 0");
		} 
		
		if (!(lambda > 0)) {
			reject("lambda must be strictly greater than 0");
		}	
		
		lprob = 0.5 * log(lambda) - (lambda * (x-mu)^2/(2 * mu^2 * x));
		return lprob;
	}
}

data {
  int<lower=1> Nobs;
  int<lower=1> Nvar;
  int<lower=1> N_lev_km;
  int<lower=1> N_lev_zone;
  int<lower=1> N_lev_bonus;
  int<lower=1> N_lev_make;
  int X[Nobs, Nvar];
  int<lower=0> claim_count[Nobs];
  vector<lower=0>[Nobs] exposure;
  vector<lower=0>[Nobs] severity;
}

parameters {

  real<lower=0> dispersion_severity;
  real<lower=0> dispersion_frequency;
  
  /* Frequency random effects mu and sigma */
  real mu_freq;
  real<lower=0> freq_sigma_km;
  real<lower=0> freq_sigma_zone;
  real<lower=0> freq_sigma_bonus;
  real<lower=0> freq_sigma_make;
  
  vector[N_lev_km] freq_eta_km;
  vector[N_lev_zone] freq_eta_zone;
  vector[N_lev_bonus] freq_eta_bonus;
  vector[N_lev_make] freq_eta_make;
  
  /* Severity random effects mu and sigma */
  real mu_sev;
  real<lower=0> sev_sigma_km;
  real<lower=0> sev_sigma_zone;
  real<lower=0> sev_sigma_bonus;
  real<lower=0> sev_sigma_make;
  
  vector[N_lev_km] sev_eta_km;
  vector[N_lev_zone] sev_eta_zone;
  vector[N_lev_bonus] sev_eta_bonus;
  vector[N_lev_make] sev_eta_make;
  
}

transformed parameters {

	/* Frequency random effects */
	vector[N_lev_km] freq_re_km;
	vector[N_lev_zone] freq_re_zone;
	vector[N_lev_bonus] freq_re_bonus;
	vector[N_lev_make] freq_re_make;
	
	/* Severity random effects */
	vector[N_lev_km] sev_re_km;
	vector[N_lev_zone] sev_re_zone;
	vector[N_lev_bonus] sev_re_bonus;
	vector[N_lev_make] sev_re_make;
	
	freq_re_km = freq_sigma_km * freq_eta_km;
	freq_re_zone = freq_sigma_zone * freq_eta_zone;
	freq_re_bonus = freq_sigma_bonus * freq_eta_bonus;
	freq_re_make = freq_sigma_make * freq_eta_make;

	sev_re_km = sev_sigma_km * sev_eta_km;
	sev_re_zone = sev_sigma_zone * sev_eta_zone;
	sev_re_bonus = sev_sigma_bonus * sev_eta_bonus;
	sev_re_make = sev_sigma_make * sev_eta_make;	
	
}

model {

	vector[Nobs] mu_nb = mu_freq + rep_vector(0, Nobs) + log(exposure);
	vector[Nobs] mu_ig = mu_sev + rep_vector(0, Nobs);
	
	mu_nb += freq_re_km[X[, 1]] + freq_re_zone[X[, 2]] + freq_re_bonus[X[, 3]] + freq_re_make[X[, 4]];
	mu_ig += sev_re_km[X[, 1]] + sev_re_zone[X[, 2]] + sev_re_bonus[X[, 3]] + sev_re_make[X[, 4]];
  
  mu_freq ~ normal(1.6, 2.5);
  mu_sev ~ normal(8, 2.5);
  dispersion_severity ~ cauchy(0, 5);
  dispersion_frequency ~ cauchy(0, 12);
  
  /* Eta parameters: all must be N(0, 1)  */
  freq_eta_km ~ std_normal();
  freq_eta_zone ~ std_normal();
  freq_eta_bonus ~ std_normal();
  freq_eta_make ~ std_normal();
  
  sev_eta_km ~ std_normal();
  sev_eta_zone ~ std_normal();
  sev_eta_bonus ~ std_normal();
  sev_eta_make ~ std_normal();
  
  /* Variation between levels for each rating variable */
  /* Default weakly informative again */
  freq_sigma_km ~ student_t(3, 0, 2.5);
  freq_sigma_zone ~ student_t(3, 0, 2.5);
  freq_sigma_bonus ~ student_t(3, 0, 2.5);
  freq_sigma_make ~ student_t(3, 0, 2.5);
  
  sev_sigma_km ~ student_t(3, 0, 2.5);
  sev_sigma_zone ~ student_t(3, 0, 2.5);
  sev_sigma_bonus ~ student_t(3, 0, 2.5);
  sev_sigma_make ~ student_t(3, 0, 2.5);
  
  claim_count ~ neg_binomial_2_log(
  	mu_nb, dispersion_frequency
  );
  
  for (i in 1:Nobs) {
  
    if (claim_count[i] != 0) {
      target += IG_lpdf(
      	severity[i] | exp(mu_ig[i]), 
      	claim_count[i] / dispersion_severity
      	);
    } 
  }
  
}
```

We have to redefine the data list here since the random effects model is parameterized
differently.

```{r}
data_re <- list(
	Nobs = nrow(data_with_severity),
	Nvar = 4, 
	N_lev_km = length(unique(data_with_severity$kilometres)),
	N_lev_zone = length(unique(data_with_severity$zone)),
	N_lev_bonus = length(unique(data_with_severity$bonus)),
	N_lev_make = length(unique(data_with_severity$make)),
	X = data_with_severity[, 1:4],
	claim_count = data_with_severity$claim_count,
	exposure = data_with_severity$vehicle_exposure_years,
	severity = data_with_severity$severity
)

```

Sampling from the posterior again:

```{r}
if (refit == TRUE) {

	fit_re <- sampling(
			sm_random_effects,
			data = data_re,
			chains = 6,
			iter = 2000,
			control = list(adapt_delta = 0.98),
			seed = 200350623,
			cores = 6,
			verbose = TRUE
			)
	
	saveRDS(fit_re, file = "rds/fitted_re.rds")
	
} else {
	
	fit_re <- readRDS("rds/fitted_re.rds")
	
}
```
Extracting the needed samples to simulate from the posterior predictive:

```{r}
re_samples <- rstan::extract(
	fit_re,
	pars = c(
		"dispersion_severity",
		"dispersion_frequency",
		"mu_freq",
		"mu_sev",
		"freq_sigma_km",
		"freq_sigma_zone",
		"freq_sigma_make",
		"freq_sigma_bonus",
		"sev_sigma_km",
		"sev_sigma_zone",
		"sev_sigma_make",
		"sev_sigma_bonus",
		"freq_re_km",
		"freq_re_zone",
		"freq_re_bonus",
		"freq_re_make",
		"sev_re_km",
		"sev_re_zone",
		"sev_re_bonus",
		"sev_re_make"
		)
	)
```

Computing the required posterior parameters for the frequency distribution,
and then generating posterior predictive samples:

```{r}
nb_mu <- matrix(
	data = re_samples$mu_freq, nrow = length(re_samples[[1]]), ncol = nrow(data_with_severity)
	) + 
	matrix(
		data = log(data_with_severity$vehicle_exposure_years),
		nrow = length(re_samples[[1]]),
		ncol = nrow(data_with_severity),
		byrow = TRUE
	) + 
	re_samples$freq_re_km[, data_with_severity$kilometres] +
	re_samples$freq_re_zone[, data_with_severity$zone] +
	re_samples$freq_re_bonus[, data_with_severity$bonus] + 
	re_samples$freq_re_make[, data_with_severity$make]

nb_disp <- matrix(
	data = re_samples$dispersion_frequency,
	nrow = length(re_samples[[1]]),
	ncol = nrow(data_with_severity)
)

set.seed(200350623)
claim_count_posterior_pred_nb <- rnbinom(
	n = length(nb_mu),
	mu = exp(nb_mu),
	size = nb_disp
	)

dim(claim_count_posterior_pred_nb) <- dim(nb_mu)
```

```{r}
ig_mu <- matrix(
	data = re_samples$mu_sev, nrow = length(re_samples[[1]]), ncol = nrow(data_with_severity)
) +
	re_samples$sev_re_km[, data_with_severity$kilometres] +
	re_samples$sev_re_zone[, data_with_severity$zone] +
	re_samples$sev_re_bonus[, data_with_severity$bonus] + 
	re_samples$sev_re_make[, data_with_severity$make]

ig_lambda <- claim_count_posterior_pred_nb / 
	matrix(
		data = re_samples$dispersion_severity,
		nrow = length(re_samples[[1]]),
		ncol = nrow(data_with_severity)
		)

set.seed(200350623)
severity_posterior_predictive <- rinvgauss(
	n = length(ig_lambda),
	mean = exp(ig_mu),
	shape = ig_lambda
)

dim(severity_posterior_predictive) <- dim(ig_lambda)
```

```{r, include = FALSE}
rm(ig_lambda, ig_mu, nb_disp, nb_mu)
```

```{r}
posterior_predictive_pp <- claim_count_posterior_pred_nb / 
	matrix(
		data = data_with_severity$vehicle_exposure_years,
		nrow = length(re_samples[[1]]),
		ncol = nrow(data_with_severity),
		byrow = TRUE
	) * severity_posterior_predictive
```

Finally:

```{r}
results <- mapply(
	posterior_pred_check,
	posterior_predictive_samples = list(
			claim_count_posterior_pred_nb,
			severity_posterior_predictive,
			posterior_predictive_pp
			),
	y = list(
			data_with_severity$claim_count,
			data_with_severity$severity,
			data_with_severity$observed_pp
			),
	variable = c(
			"claims (sqrt scaled)",
			"severity (sqrt scaled)",
			"pure premium (sqrt scaled)"
			),
	limits = list(c(0, 700), c(0, 40000), c(0, 4000)),
	MoreArgs = list(n = 50),
	SIMPLIFY = FALSE
	)
```

```{r, fig.align="center", fig.height=6, fig.width=10}
(results[[1]] | results[[2]]) / results[[3]] + plot_annotation(
	title = "Posterior Predictive Check - Random Effects Model"
)
```

Overall, we can see some improvement particularly at pure premium regions of low density. The variability is noticeably wider than the previous model in certain regions of the pure premium density, while regions of high density are left basically unchanged from the previous unpooled model.

This isn't really that surprising. The rating combinations corresponding with the most frequent pure premium values are made up of many exposures and claims, causing small within group variation (this is explicitly formulated in both generative models) and therefore less pooling.

## Relativities {.tabset}

Assuming that the hierarchical Bayesian model above is sufficient, we can calculate relativities by using the simulated pure premiums from the model above.

As mentioned previously, a base class (the class where all other rating combinations are compared against) is required. In the data, this corresponds to the rating combination corresponding to row `r which(data$vehicle_exposure_years == max(data$vehicle_exposure_years))` (1 = km, 4 = zone, 7 = bonus, 9 = make) as this rating combination has the largest number of exposures. 

We can now calculate posterior predictive distributions for the relativities for each rating variable.

```{r}
base_class_row <- which(
	data$vehicle_exposure_years == max(data$vehicle_exposure_years)
	)

find_rels <- function(column_indices, posterior_pred_pp, base_class_row) {
	
	posterior_premium <- posterior_predictive_pp[, column_indices]
	
	relativities <- posterior_premium /
		posterior_premium[, which(column_indices == base_class_row)]
	
	return(relativities)
}

graph_rels <- function(rels, rv_name) {
	
	base_level <- which(sapply(X = rels, FUN = function(x) all(x == 1)) == TRUE)
	
	error_bars <- tibble(
		Level = as.character(1:ncol(rels)),
		median = apply(rels, MARGIN = 2, FUN = function(x) median(x)),
		lower = apply(rels, MARGIN = 2, FUN = function(x) quantile(x, 0.05)),
		upper = apply(rels, MARGIN = 2, FUN = function(x) quantile(x, 0.95))
	) %>% 
		filter(Level != as.character(base_level))
	
	graph <- rels %>%
		gather(key = "Level", value = "Relativity") %>%
		filter(Level != as.character(base_level)) %>%
	ggplot(data = ., aes(x = Level, y = Relativity)) + 
		geom_violin() +
		theme_bw() +
		labs(title = paste(rv_name, "Relativity Densities")) +
		geom_errorbar(
			data = error_bars,
			mapping = aes(x = Level, ymin = lower, ymax = upper), 
			width = 0.2,
			inherit.aes = FALSE
			) +
		coord_flip() +
		stat_summary(fun.y = median, geom = "point", size = 2, color = "red") +
		geom_text(
			data = error_bars,
			mapping = aes(x = Level, y = median, label = round(median, 2)),
			vjust = -0.5,
			size = 4,
			inherit.aes = FALSE
		)
	
	return(graph)
	
}

create_tables <- function(rels, rv_name) {
	
	table <- tibble(
		temp = 1:ncol(rels),
		`Estimated Relativity (median)` = apply(
			X = rels, MARGIN = 2, FUN = function(x) median(x)
			),
		`Lower 90 Credible Interval` = apply(
			X = rels, MARGIN = 2, FUN = function(x) quantile(x, 0.05)
			),
		`Upper 90 Credible Interval` = apply(
			X = rels, MARGIN = 2, FUN = function(x) quantile(x, 0.95)
			),
		`Standard Deviation` = apply(
			X = rels, MARGIN = 2, FUN = function(x) sd(x)
			)
	)
	
	colnames(table)[1] <- rv_name
	base_level <- which(sapply(X = rels, FUN = function(x) all(x == 1)) == TRUE)
	
	table %>%
		kable(digits = 3) %>%
		kable_styling() %>%
		row_spec(base_level, bold = T, color = "white", background = "teal") %>%
		return()
	
}
```

### Kilometres

```{r}
# Get the column indices of the needed pure premium distributions
desired_km_columns <- which(
	data$zone == 4 & data$bonus == 7 & data$make == 9
)

km_rels <- as_tibble(find_rels(
	column_indices = desired_km_columns,
	posterior_pred_pp = posterior_predictive_pp,
	base_class_row = base_class_row
	)) 

colnames(km_rels) <- 1:ncol(km_rels)
```

We can plot the posterior distribution of relativities for each non-base rating level:

```{r, fig.align='center'}
graph_rels(km_rels, rv_name = "Kilometres")
```

The above plot in table format, with base class highlighted in blue:

```{r}
create_tables(km_rels, "Kilometres")
```

In this case, we can see that the estimated relativities (the median of each posterior predictive distribution) follows a monotonic pattern and likely no adjustments would be needed.

The 90% credible intervals describe a range in which the actual relativity of a future level K policyholder will fall with 90% probability, holding all other rating variables (bonus, make, and zone) constant. These are prediction intervals.

The point is to emphasize that while our single "best prediction" might be 1, 1.231, 1.339, 1.458, and 1.717 for each rating level, values such as 1, 1.3, 1.6, 1.8, 2 could reasonably be observed as well. Another set could be 1, 0.97, 1.4, 1.3, 1.5. These combinations aren't as likely, but plausible given the data.

In fact, we can calculate the probability of any future scenario occurring (with respect to the relativities):

Some examples:

- The exact probability that a future level 2 policyholder will have a higher relativity than a future level 3 or 4 or 5 policyholder (holding all other rating variables constant):

```{r}
mean(!apply(
	X = km_rels[, 2:5],
	MARGIN = 1,
	FUN = function(x) min(x)) == km_rels[, 2]
	)
```

- The probability that a future level 2 policyholder has the largest relativity out of a single group of four future policyholders with levels 2, 3, 4, and 5 (holding all other rating variables constant):

```{r}
mean(apply(
	X = km_rels[, 2:5],
	MARGIN = 1,
	FUN = function(x) max(x)) == km_rels[, 2]
	)
```

- The probability that a future level 3 policyholder exceeds the relativity of a future level 2 policyholder by at least 0.1 (or any arbitrary constant, and again holding all other rating variables constant):

```{r}
mean(km_rels[, 3] - km_rels[, 2] > 0.1)
```

...and so on

### Zone

```{r}
# Get the column indices of the needed pure premium distributions
desired_zone_columns <- which(
	data$kilometres == 1 & data$bonus == 7 & data$make == 9
)

zone_rels <- as_tibble(find_rels(
	column_indices = desired_zone_columns,
	posterior_pred_pp = posterior_predictive_pp,
	base_class_row = base_class_row
	)) 

colnames(zone_rels) <- 1:ncol(zone_rels)
```

```{r, fig.align='center'}
graph_rels(zone_rels, rv_name = "Zone")
```

The above plot in a table:

```{r}
create_tables(zone_rels, "Zone")
```

### Bonus

```{r}
# Get the column indices of the needed pure premium distributions
desired_bonus_columns <- which(
	data$kilometres == 1 & data$zone == 4 & data$make == 9
)

bonus_rels <- as_tibble(find_rels(
	column_indices = desired_bonus_columns,
	posterior_pred_pp = posterior_predictive_pp,
	base_class_row = base_class_row
	)) 

colnames(bonus_rels) <- 1:ncol(bonus_rels)
```

```{r, fig.align='center'}
graph_rels(bonus_rels, rv_name = "Bonus")
```

The above plot in a table:

```{r}
create_tables(bonus_rels, "Bonus")
```

### Make

```{r}
# Get the column indices of the needed pure premium distributions
desired_make_columns <- which(
	data$kilometres == 1 & data$zone == 4 & data$bonus == 7
)

make_rels <- as_tibble(find_rels(
	column_indices = desired_make_columns,
	posterior_pred_pp = posterior_predictive_pp,
	base_class_row = base_class_row
	)) 

colnames(make_rels) <- 1:ncol(make_rels)
```

```{r, fig.align='center'}
graph_rels(make_rels, rv_name = "Make")
```

The above plot in a table:

```{r}
create_tables(make_rels, "Make")
```

## Credible Intervals for Pure Premiums

If the actual predicted pure premiums are desired instead of relativities, we can derive these 
easily as well:

```{r}
tibble(
	kilometres = data$kilometres,
	zone = data$zone,
	bonus = data$bonus,
	make = data$make,
	mean_pred = apply(
		X = posterior_predictive_pp, MARGIN = 2, FUN = function(x) mean(x)
		),
	median_pred = apply(
		X = posterior_predictive_pp, MARGIN = 2, FUN = function(x) median(x)
		),
	lower = apply(
		X = posterior_predictive_pp, MARGIN = 2, FUN = function(x) quantile(x, 0.05)
		),
	upper = apply(
		X = posterior_predictive_pp, MARGIN = 2, FUN = function(x) quantile(x, 0.95)
		)
) %>%
	datatable(
		rownames = FALSE,
		colnames = c(
			"Kilometres",
			"Zone",
			"Bonus",
			"Make",
			"Predicted Pure Premium (mean)",
			"Predicted Pure Premium (median)",
			"Lower 90% Credible Interval",
			"Upper 90% Credible Interval"
		),
		filter = list(position = 'top', clear = FALSE),
		options = list(
			scrollY = 300, height = "700px", scrollX = 300)
		) %>%
	formatRound(columns = 5:8, digits = 0) 
```

As with the relativities, the 90% credible intervals describe a range in which the future pure premium of any policyholder with a specific set of rating variables will lie with 90% probability. For example: for a policyholder with kilometres = 1, zone = 4, bonus = 2, and make = 9, there is a 90% chance that their future pure premium will fall in the interval [228, 345].

Note that for some of the rating combinations, there exists credible intervals of [0, 0]. This is not an error, as these rating combinations essentially have close to 100% probability of observing no claims. Thus, at least 95% of the time there will be 0 claims, and therefore, a pure premium of 0. That being said, the posterior means reveal that the model does **not** assign 0 probability to the event where pure premium is > 0. Indeed, some of the posterior distributions exhibit significant right skewness.

The table below provides the probability that the predicted pure premium is greater than 0 for the rating combinations that have 90% credible intervals of [0, 0]:

```{r}
related_0_medians_lower <- which(
	apply(
		X = posterior_predictive_pp, MARGIN = 2, FUN = function(x) quantile(x, 0.05)
		) == 0 
)

related_0_medians_upper <- which(
	apply(
		X = posterior_predictive_pp, MARGIN = 2, FUN = function(x) quantile(x, 0.95)
		) == 0 
)

related_0_medians <- intersect(related_0_medians_lower, related_0_medians_upper)

tibble(
	kilometres = data$kilometres[related_0_medians],
	zone = data$zone[related_0_medians],
	bonus = data$bonus[related_0_medians],
	make = data$make[related_0_medians],
	median_pred = apply(
		X = posterior_predictive_pp[, related_0_medians],
		MARGIN = 2,
		FUN = function(x) median(x)
		),
	prob_gt_0 = apply(
		X = posterior_predictive_pp[, related_0_medians],
		MARGIN = 2,
		FUN = function(x) mean(x > 0)
		)
) %>%
	datatable(
		rownames = FALSE,
		colnames = c(
			"Kilometres",
			"Zone",
			"Bonus",
			"Make",
			"Predicted Pure Premium (median)",
			"Pr(pure premium > 0)"
		),
		filter = list(position = 'top', clear = FALSE),
		options = list(scrollY = 300, searching = FALSE, scrollX = 300)) %>%
	formatRound(columns = 5, digits = 3) %>%
	formatRound(columns = 6, digits = 5)
```

The last column illustrates the posterior predictive probability that the future pure premium of a policyholder exceeds 0 for a particular set of rating combinations. Notice how all of these probabilities are greater than 0.

By simply taking the compliment of these probabilities, 1 - Pr(pure premium > 0), we get the probability that the pure premium is exactly 0. Notice how all of these values are less than 0.05, which is what we used as the cutoff for our credible intervals. That is, for the 51 rating combinations above, there is <0.05 probability of actually observing a non-zero pure premium.

## Appendix: Model Convergence Checking

For the sake of brevity, only the convergence of the most
complicated model is illustrated (the hierarchical Bayesian model). The other models
demonstrated convergence and appear to have mixed well, based on the same checks
done in this section. This isn't that surprising because all of the models except
the hierarchical Bayesian model are relatively simple in terms of parameterization. 

### Checking the number of effective samples and Gelman-Rubin's R

Autocorrelation within the MCMC process increases the variance of Monte Carlo estimates and thus, any inferences
drawn from highly autocorrelated chains may be unreliable. The effective sample size attempts to correct for this autocorrelation,
by restating (or in some cases with autocorrelation < 1, increasing) the actual number
of posterior samples by the amount of autocorrelation exhibited.

Burkner (2017) recommends effective samples sizes of at least 1000 to get stable inferences.

Viewing the summary statistics:

```{r}
table_check_n_eff <- as_tibble(summary(fit_re)$summary[, 9:10]) %>%
	bind_cols(parameter = rownames(summary(fit_re)$summary), .) %>%
	arrange(n_eff)

table_check_n_eff %>%
	datatable(
		rownames = FALSE,
		colnames = c(
			"Parameter",
			"Effective Sample Size",
			"Rhat"
		),
		options = list(scrollY = 300, scrollX = 300)) %>%
	formatRound(columns = 2, digits = 0) %>%
	formatRound(columns = 3, digits = 4)
```

As we can see here, the minimum effective size we have obtained is `r round(min(table_check_n_eff$n_eff), 0)` which is larger than 1000. Therefore, Monte Carlo estimates are likely precise enough.

Gelman-Rubin's Rhat tries to assess convergence by analyzing the variance between and within chains. Due to the law of total variance, the marginal variance of a parameter can be estimated as a weighted average of the average within chain variance and the variance of the averages between chains.

However, if all of the chains have converged to the same thing, the between chain variation should be negligible and therefore, the ratio of the weighted average estimator and the average variation within each individual chain should approximately be 1.

Gelman and Rubin recommend that **all** parameters have Rhat values less than 1.01. As we can see here, the largest Rhat we observe is `r max(table_check_n_eff$Rhat)`, which does not exceed 1.01. This is good evidence to suggest that all of the separate Markov Chains we have run have converged to the same posterior distribution. Thus, there is good evidence to suggest that there are no convergence issues.

### Traceplots

We can further verify that each separate chain has converged to the same posterior distribution, even though the chains were initialized at different starting points.

Similar to the Rhat metric, the traceplots demonstrate separate chains converging to the same distribution (if there are no issues). These are often described as "fuzzy caterpillars" if the chains have converged. Each "fuzzy caterpillar" should overlap for the most part, and be randomly distributed about a constant (the posterior mean). If there are differences between chains, or if there is a clear systematic deviation from the average posterior parameter, it is very likely that some (or all) of the chains have not sampled from the true posterior distribution. 

Graphing the traceplots for the parameters that generate our predicted pure premiums (and therefore, relativities) below:

```{r, fig.align='center'}
traceplot(
	fit_re, 
	pars = c(
		"dispersion_frequency",
		"dispersion_severity",
		"mu_freq",
		"freq_sigma_km"
		)
)

traceplot(
	fit_re, 
	pars = c(
		"freq_sigma_zone",
		"freq_sigma_bonus",
		"freq_sigma_make",
		"mu_sev",
		"sev_sigma_km",
		"sev_sigma_zone",
		"sev_sigma_bonus",
		"sev_sigma_make"
		)
)

traceplot(
	fit_re, 
	pars = c(
		"freq_re_km",
		"freq_re_zone"
		)
)

traceplot(
	fit_re, 
	pars = c(
		"freq_re_bonus",
		"freq_re_make"
		)
)

traceplot(
	fit_re, 
	pars = c(
		"sev_re_km",
		"sev_re_zone"
		)
)

traceplot(
	fit_re, 
	pars = c(
		"sev_re_bonus",
		"sev_re_make"
		)
)
```

The traceplots look ideal. 

In conclusion, there are likely no convergence issues. 

<br>