-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreport.rmd
2265 lines (1832 loc) · 74 KB
/
report.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "A Fully Bayesian Way of Estimating Insurance Relativities"
author: "Brayden Tang"
date: "27/11/2020"
output:
prettydoc::html_pretty:
theme: tactile
highlight: github
toc: true
editor_options:
markdown:
wrap: 72
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(rstan)
library(tidyverse)
library(recipes)
library(DT)
library(kableExtra)
library(prettydoc)
library(statmod)
library(patchwork)
# CHANGE THIS BEFORE KNITING IF YOU WANT TO RERUN THE MCMC CHAINS.
refit <- FALSE
```
## Some Context
I am not an actuarial student (anymore) but I recently completed a
typical pricing assignment as part of an evaluation for a job
application. I didn't end up applying, however I enjoyed having a
familiar (and fake) kind of dataset that I used to work with all the
time in my past life. I wanted to see if I could get around some of the
issues I had with these kinds of datasets from three years ago.
The objective of actuarial relativity analysis is, at a high level, to
most accurately predict the pure premium as closely as possible for all
combinations of rating variables. The pure premium is the amount of
money needed, on average, that is required to cover the cost of claims
only (so no profit allocations or expenses). Rating variables are simply
policyholder characteristics (such as the amount of kilometers on a
vehicle, age of the policyholder, color of the vehicle, engine size,
etc.)
Naturally, pure premium is expressed simply as the rate at which a risk
makes any kind of claim multiplied by the average amount per each claim.
Thus, pure premium is defined as:
$$ \text{frequency} \times \text{severity} = \frac{\text{total claim counts}}{\text{total risks = exposures}} \times \frac{\text{total claim amounts (losses)}}{\text{total claim counts}} $$
$$ = \frac{\text{total claim amounts (losses)}}{\text{total risks}}.$$
We can model either the frequency and severity distributions separately,
or we can model pure premium directly. The former approach is often
preferred as it offers more flexibility and is at least logically, more
robust.
## Relativities
Relativities are simply the **marginal** impact a particular rating variable
has on the response. This has the exact same interpretation as
regression coefficients do in a multivariate linear model - holding all
other variables constant, the regression coefficient represents the
change in some quantity (say, the change in the log of severity or the
log of frequency) per some change in said rating variable.
Relativities are typically calculated by predicting pure premiums across
all of the levels of a particular rating variable, while simultaneously
holding all of the other rating variables constant at the levels of some
chosen base class. The base class represents one specific combination of
rating variables that all other classes are compared against. Other
combinations of rating variables are more risky (higher pure premium) or
less risky (lower pure premium), **relative** to this base class.
Thus, each predicted pure premium is divided by the predicted pure
premium of the relevant level from the base class to obtain these
relativities.
```{r relativities, echo = FALSE}
example_city_rels <- tibble(
City = c("Small City", "Medium City", "Large City"),
Relativity = c(0.95, 1, 1.50)
) %>%
kable() %>%
kable_styling()
example_km_rels <- tibble(
Kilometres = c(
"Less Than 10,000km",
"10,000-50,000km",
"50,001-100,000km",
"More Than 100,000km"
),
Relativity = c(0.35, 0.76, 1, 1.378)
) %>%
kable() %>%
kable_styling()
knitr::kables(
list(example_city_rels, example_km_rels),
caption = "Figure 1: Example relativities with two rating variables") %>%
kable_styling()
```
<br> Note that the vehicle described by the levels of each rating
variable with relativities 1 (medium city & 50,001-100,000km) is the
base class (in a linear model, this is the class described by the
intercept). Relative to the base class, other vehicles have higher or
lower premiums. The base class premium is multiplied by the
corresponding relativities to derive the premiums for the other vehicle
classes.
For example, suppose the base class pure premium (the premium for a
policyholder who drives in the medium city and has a vehicle with
50,001-100,000km) is \$1000. Then, the pure premium for a policyholder
who drives in the small city with a vehicle that has more than 100,000km
of mileage has a pure premium of \$1000 \* 0.95 \* 1.378 = \$1309.10.
## The Dataset
I import the dataset below:
```{r Import the data, echo = FALSE}
data <- read_csv("data/pricingdat.csv")
data %>%
datatable(
rownames = FALSE,
options = list(scrollY = 300, searching = FALSE, scrollX = 300)) %>%
formatRound(columns = 6:7, digits = 0)
```
<br> As is typical, one row represents one unique combination of rating
variables, containing the total number of claims observed, amount of
exposures (i.e. number of risks, which is typically defined as the
amount of policy earned over a period of one year) and the total amount
of claim payments.
While this dataset seems straightforward, it is deceivingly difficult to
work with when compared to other more standard datasets. This is
primarily due to the aggregation which causes rows and columns to be
dependent. If we add more rating variables (which are columns) to the
dataset, we naturally will get more rows since there will be more unique
combinations of rating variables. In fact, the number of rows will grow
exponentially (see the curse of dimensionality).
The consequences of this are dire. For one, model validation becomes
very difficult. We can not naively split the dataset above because each
row is unique. If we did, the model would be evaluated against
combinations of rating variables that would be completely unseen in the
training set, forcing the model to extrapolate (likely leading to poor
predictions that aren't reflective of how the model would perform in
reality).
Second, explicit feature selection (which is emphasized a lot in pricing
since there are many business reasons to have as simple of a rating
algorithm as possible), is impossible. We cannot simply drop columns
without affecting the number of rows (and therefore, completely changing
the dataset in the process). Thus, fairly comparing models that
have different underlying training datasets is not very clear.
Third, each row will become more sparse as well due to the curse of
dimensionality, with some combinations of rating variables simply having
zero claims or exposures. Naturally, this will lead to ill-defined
models - the traditional gamma won't work, for example, since we will
have exact zero losses.
One solution to the sparsity problem is to group the sparse level of
the rating variable with another. This seems
harmless, but in reality this corresponds to a deceivingly strong prior,
namely that the effects of each level on the response are the same. In
math, this corresponds to setting
$\beta_{1} - \beta_{2} \sim N(0, 2\epsilon)$ where $\epsilon$ is some
number very close to 0, and $\beta_{1}, \beta_{2}$ are the effects of
the two separate levels being grouped. See [this
video](https://www.youtube.com/watch?v=BKumW2RfSoQ&ab_channel=Stan) for
more details. Regardless, grouping levels of rating variables together
is a very strong prior that is not at all transparent. Rather than doing
this, we can use mixed effects/hierarchical models to achieve partial
pooling, which is far more flexible and robust.
### Aside: The Ideal Dataset
The ideal dataset, at least hypothetically, would be a dataset where one
row is equal to one policyholder. The total number of earned exposures
for the policyholder would be one column, and the total claim counts and
total claim costs (for that policyholder) would be additional columns
(from which we could model). The rating variables relevant for that
policyholder would then be represented as additional columns.
This would secure independence between the number of rating variables
used and the number of rows. To obtain relativities, the aggregated
dataset could be recreated (that is, one row is equal to one unique
combination of exposures) from which we could then make our predictions
of frequency and severity using our trained model (that was trained on
the policyholder data). The model predicts it's best guess of
$E[Y | X = \text{rating variables}]$ learned over all observed
policyholders.
## Post-Adjustments
Traditionally, generalized linear models (Poisson and gamma linear models in particular)
are used to predict the frequency and severity of claims. These models produce the expected frequency
and severity of a particular rating combination, and assuming independence, these expectations
are multiplied together to get the expected pure premium.
However, any models used to predict the frequency and severity of claims is
susceptible to overfitting, leading to relativities that are possibly overstated or understated.
This often leads to actuaries manually adjusting relativities to be more
in line with what their apriori beliefs are. For example, a vehicle with a
larger engine would be expected to be of higher risk than a vehicle with a
smaller engine since presumably the vehicles with larger engines are more
expensive on average. Sometimes, the relativities produced do not reflect this due to
uncertainty/variance and so an actuary may decide to "manually" adjust the values to smooth said estimates.
This is often done out of respect for the customer who may not appreciate being
charged more in premium when they perceive themselves as lower risk.
Depending on the actuary, they may just adjust relativities that are not
monotonic.
However, these adjustments are arbitrary if we do not actually know how much
uncertainty there is in the estimates. In the case of traditional pricing
modeling, it is difficult to actually quantify how much uncertainty we
are dealing with because the frequency and severity models are estimated
completely independently of each other. The expectations produced by
each individual model are used to produce the pure premiums, which are
then used to calculate relativities.
However, it is clear that if we view frequency and severity as random
variables that they are not independent. If we have a zero frequency, we
must have a zero severity (or equivalently, if we have observed claims
we must have non-zero severity, assuming that claims below a deductible
are ignored). Estimating both models separately ignores this dependence which
inevitably affects the variance of the pure premium.
In addition, other sources of variation, such as parameter and residual uncertainty,
are ignored.
Bayesian modeling will allow us to quantify the vast majority of
uncertainty in our data. By modeling the data as a generative process instead, we can account for nearly all sources of variation that are ignored in the traditional, frequentist approach. As a result, actuaries will have the ability to choose
reasonable values that are in line with the amount of uncertainty that exists. In addition, it allows actuaries
to quantify their confidence in allowing for a non-monotonic relativity,
giving them the ability to defend their decisions.
## The Bayesian/Generative Model
We describe the fully generative model below:
Let $X$ be the severity random variable, $R_{i}$ be the rating
variables, and $N$ be the claim counts. Assume that exposures are a
known constant.
Then, for the ith unique combination of rating variables,
$\forall i \ X_{i} > 0$ let:
$$X_{i} \big|N_{i}, \phi, \eta_{i} \sim Gamma\left(\frac{N_{i}}{\phi}, \frac{N_{i}}{\eta_{i} \times \phi}\right),$$
$$\eta_{i} \big| \beta, R_{i} = exp(R_{i} \times \beta),$$
$$N_{i} \big|\lambda_{i} \sim Poisson(\lambda_{i}),$$
$$\lambda_{i} \big|\psi, R_{i} = exp(R_{i} \times \psi + log(\text{exposure}_{i})),$$
$$\beta \sim Normal(0, \tau),$$
$$\psi \sim Normal(0, \omega),$$
$$\phi \sim \text{Half-Cauchy}(0, \alpha).$$ Note that
$\alpha, \tau, \omega$ are constants that must be specified by the user.
In addition, note that
$$E[X_{i} \big |N_{i}, \phi, \eta_{i}] = \frac{N_{i}}{\phi} \times \frac{\eta_{i} \times \phi}{N_{i}} = \eta_{i}$$
but
$$Var[X_{i} \big |N_{i}, \phi, \eta_{i}] = \frac{N_{i}}{\phi} \times \frac{\eta_{i}^2 \times \phi^2}{N_{i}^2} = \frac{\eta_{i}^2 \times \phi}{N_{i}}.$$
That is, as the number of claims that makes up a severity calculation
increases, the lower the variance of the resulting severity distribution
for the particular combination of rating variables. Equivalently, the
more claims for a particular combination of rating variables, the more
influence that particular combination of rating variables has on the
overall model fit. This is pretty much the same thing as using weights
in a gamma generalized linear model, however, in this case the weights
are also random variables and everything is modeled simultaneously.
It follows that
$$X_{i} \times \frac{N_{i}}{\text{number of exposures for ith combination}} = P_{i},$$
where $P_{i}$ is the pure premium for the ith combination of rating
variables, is also a random variable. Since we will have joint posterior
draws of the two random variables that the pure premium is a function
of, the posterior distribution of the pure premium (and relativities) is
also known.
The above formulation should also give insight into what combination of
rating variables should, in general, represent the base class pure
premium. It is clear that the variance of the pure premium for the ith
combination is $\frac{Var[X_{i}]}{\text{exposures}_{i}^2}$. Thus,
choosing the base class with the largest amount of exposures will often
yield a low varying (i.e. stable) pure premium (assuming the severity
variance is not extremely large).
In addition, this Bayesian approach helps solve problems related to the
aggregation of data.
### Feature Selection
It was mentioned previously that feature selection is very difficult to
carry out on an aggregated dataset where rows and columns are not
independent. This is true if we are using explicit feature selection
where rating variables are literally removed from the design
matrix/training dataset.
However, we can achieve "feature selection" through regularization as
well. It turns out that L1 (LASSO), and L2 (ridge) are equivalent to
Laplace and Normal zero mean priors on regression coefficients, for
example. L1 regularization actually has the potential to explicitly
feature select as well, allowing for exactly zero coefficients (albeit,
we are not maximizing but marginalizing in Bayesian approaches in general
so we won't ever get this exactly in the Bayesian approach). The cost parameter typically
associated with these regularization techniques is exactly equal to the
variance of the Laplace/Normal distributions - the larger the variance,
the lower the cost/regularization. Bayesian approaches also allow for
even more specific priors depending on the context, such as the
horseshoe/spike and slab priors.
In the Bayesian context, the variance is chosen based on our confidence
that the (predictive) effects of each level are non-zero, apriori.
Alternatively, one can choose to use an empirical Bayes approach and
tune this variance parameter using something like cross validation. This
is equivalent to the typical machine learning approach - but in the
Bayes context this isn't as feasible since Bayesian models fit using
MCMC are slow. An alternative is to estimate the expected out of sample
log pointwise predictive density (ELPD) using importance sampling
(counterfactuals of the loglikelihood), which is faster than true cross
validation (since there is no actual data splitting) but still potentially too slow.
Regardless, the Bayesian model allows us to use regularization while
still remaining completely probabilistic. This will pay off when we look
at relativities.
### Predictive Performance
The other main issue involved the issue of model validation in terms of
predictive performance. We cannot just naively split the dataset,
because each row represents a unique combination of rating variables.
We can, however, use likelihood based evaluation metrics if we are
comparing models fit with the same underlying dataset (among other
caveats like having the same raw target variable and avoiding
comparisons between discrete and continuous distributions). One such
metric is the aforementioned leave one out estimate of ELPD, which
produces counterfactuals of the loglikelihood under scenarios where one
observation at a time is excluded from the dataset (so that each
resulting loglikelihood can be evaluated at the held out data point,
simulating cross validation). However, just like true data splitting
we still suffer from the same problem of each row being a unique
combination of rating variables. Thus, the ELPD on the held out rating
combination will be estimated through extrapolation, which is not true
to how the model will behave in reality.
Unfortunately, it is unclear how to obtain absolute measures of predictive performance.
The only alternative may be to perform posterior predictive
checks, but these are known to be biased for predictive performance
because this involves in-sample data. Despite this, [some do suggest the use of 50% credible intervals to at least detect overfitting.](https://stats.stackexchange.com/questions/343420/bayesian-thinking-about-overfitting)
Some argue that Bayesian methods cannot really overfit providing that uninformative priors
are avoided, and in general, if marginalizing rather than maximization is done as much as possible. Informative priors encourage estimators to not fully rely on the underlying training dataset by introducing bias via. the prior. Marginalization is preferable to maximization since marginalization weighs each outcome with its posterior probability.
Regardless, none of this allows for the ability to quantify how well the model actually predicts in absolute terms.
## Stan Code
Below is the exact same model described above. In this case, we set
$\tau$ and $\omega$ to 1 and 1, respectively, which are "weakly informative" priors.
No strong apriori beliefs regarding the effects exist (there is no context regarding the data)
but regularization to shrink effects to 0, if it is justifiable, is desired.
For the variance parameter, we use a half-Cauchy(0, 3) distribution. This is again a weakly informative prior.
```{stan output.var = "poisson_gamma_model"}
data {
int<lower=1> Nobs;
int<lower=1> Nvar;
matrix[Nobs, Nvar] X;
int<lower=0> claim_count[Nobs];
vector<lower=0>[Nobs] exposure;
vector<lower=0>[Nobs] severity;
}
parameters {
vector[Nvar] beta_frequency;
vector[Nvar] beta_severity;
real<lower=0> dispersion_severity;
}
model {
beta_frequency ~ std_normal();
beta_severity ~ std_normal();
dispersion_severity ~ cauchy(0, 3);
claim_count ~ poisson_log(X * beta_frequency + log(exposure));
for (i in 1:Nobs) {
if (claim_count[i] != 0) {
severity[i] ~ gamma(
claim_count[i] / dispersion_severity,
claim_count[i] / (exp(X[i, ] * beta_severity) * dispersion_severity)
);
}
}
}
```
Next, preprocess the data in a suitable format for Stan.
```{r Preprocessing}
data_with_severity <- data %>%
mutate(
severity = ifelse(claim_count != 0, claim_payments / claim_count, 0),
kilometres = as.factor(kilometres),
zone = as.factor(zone),
bonus = as.factor(bonus),
make = as.factor(make),
observed_pp = claim_payments / vehicle_exposure_years
)
preprocessing <- recipe(claim_count ~ ., data = data_with_severity) %>%
step_dummy(all_nominal())
prepped_recipe <- prep(preprocessing, training = data_with_severity)
X <- juice(prepped_recipe) %>%
mutate(intercept = rep(1, nrow(.))) %>%
relocate(intercept) %>%
select(
-vehicle_exposure_years,
-claim_payments,
-severity,
-claim_count,
-observed_pp
)
data_list <- list(
Nobs = nrow(X),
Nvar = ncol(X),
X = as.matrix.data.frame(X),
claim_count = data_with_severity$claim_count,
exposure = data_with_severity$vehicle_exposure_years,
severity = data_with_severity$severity
)
```
Now we sample from the posterior distribution as required:
```{r Sample from Posterior - Poisson/Gamma}
if (refit == TRUE) {
fit_poisson_gamma <- sampling(
poisson_gamma_model,
data = data_list,
chains = 6,
iter = 2000,
seed = 200350623,
cores = 6,
verbose = TRUE
)
saveRDS(fit_poisson_gamma, "rds/fittedPGmodel.rds")
} else {
fit_poisson_gamma <- readRDS("rds/fittedPGmodel.rds")
}
```
This function extracts the required samples from the MCMC sampler.
```{r Helper Function For Extraction of Posterior Parameters}
#' Extract the required mean/dispersion parameters for both candidate frequency
#` and severity distributions.
#'
#' @param fitted_model A Stan fitted model (the result of rstan::sampling).
#` This model MUST have named parameters beta_frequency, beta_severity,
#` and can optionally have parameters dispersion_frequency and
#` dispersion_severity.
#' @param exposures A vector of length N containing the exposures
#' @param X The design matrix of rating variables
#'
#' @return A list containing matrices/vectors of posterior samples of the
#` required probability distributions. Both frequency and severity parameter
#` samples are returned on the predictor (not linked) scale.
#' @export
#'
#' @examples
#` \dontrun{
#` extract_and_get_posterior_samples(
#` my_stan_fitted,
#` vehicle_exposure_years,
#` X_design
#` )
#`}
extract_and_get_posterior_samples <- function(fitted_model, exposures, X) {
samples <- rstan::extract(fitted_model)
exposure <- matrix(
log(exposures),
nrow = nrow(samples$beta_frequency),
ncol = length(exposures),
byrow = TRUE
)
if (is.null(samples$dispersion_frequency)) {
dispersion_frequency_samples <- NULL
} else {
dispersion_frequency_samples <- samples$dispersion_frequency
}
if (is.null(samples$dispersion_severity)) {
dispersion_severity_samples <- NULL
} else {
dispersion_severity_samples <- samples$dispersion_severity
}
return(list(
frequency_samples = samples$beta_frequency %*% t(as.matrix.data.frame(X)) +
exposure,
severity_samples = samples$beta_severity %*% t(as.matrix.data.frame(X)),
dispersion_frequency_samples = dispersion_frequency_samples,
dispersion_severity_samples = dispersion_severity_samples,
exposures = exposure
)
)
}
```
We now extract the posterior samples.
```{r Extract Posterior Samples}
# Extract the samples for the Poisson-Gamma model.
pg_posterior_samples <- extract_and_get_posterior_samples(
fitted_model = fit_poisson_gamma,
exposures = data_with_severity$vehicle_exposure_years,
X = X
)
# Apply the link functions.
pg_posterior_samples$frequency_samples <- exp(pg_posterior_samples$frequency_samples)
pg_posterior_samples$severity_samples <- exp(pg_posterior_samples$severity_samples)
```
## Posterior Predictive Checks
Does the underlying generative model fit the dataset well?
```{r Posterior Predictive Check function, echo = FALSE}
#' Generates posterior parameter samples for the weighted gamma.
#'
#' @param posterior_predictive_samples Matrix of posterior predictive samples.
#' @param y Vector of observations representing the actuals.
#' @param n Number of samples to plot. Must be less than the number of rows in
#' posterior_predictive_samples.
#' @param variable Character string for graphing purposes.
#' @param limits Vector of upper and lower limits to plot (zooming in on the plot).
#'
#' @return None.
#' @export
#'
#' @examples
#' posterior_predictive_check(
#` my_posterior_samples, n = 1000, variable = "claims"
#` )
posterior_pred_check <- function(posterior_predictive_samples, y, n, variable,
limits) {
data_tibble <- as_tibble(posterior_predictive_samples[1:n, ]) %>%
mutate(iter = 1:nrow(.)) %>%
gather(key = "obs", value = "var", -iter) %>%
select(-obs)
ggplot(data_tibble, aes(x = var)) +
geom_density(aes(group = as.factor(iter)), color = alpha("blue", 0.7)) +
theme_bw() +
scale_x_continuous(trans = "sqrt") +
geom_density(
data = tibble(actual = y),
aes(x = actual),
size = 1,
alpha = 0.9
) +
coord_cartesian(xlim = limits, expand = FALSE) +
labs(x = variable)
}
```
### Posterior Predictive Check - Frequency
First, we generate samples from the posterior predictive distribution. Then, we perform posterior predictive checks for the frequency, severity, and pure premium distributions (i.e. compare the generated data against the observed data).
```{r Generate Samples}
# Basically, flatten the matrix of frequencies to a vector, generate a vector
# of length equal to this vector using each of these frequencies, then convert
# back
set.seed(200350623)
claim_count_posterior_pred <- rpois(
n = length(pg_posterior_samples$frequency_samples),
lambda = pg_posterior_samples$frequency_samples
)
dim(claim_count_posterior_pred) <- dim(pg_posterior_samples$frequency_samples)
```
```{r Posterior Predictive Check - Frequency, fig.align="center", fig.height=6, fig.width=10}
posterior_pred_check(
claim_count_posterior_pred,
y = data$claim_count,
n = 50,
"claims (sqrt scaled)",
limits = c(0, 700)
)
```
The blue curves represent 50 simulated datasets generated from the
posterior predictive. The black curve is the actual, observed dataset
(in this case, claim counts). Ideally, we want the black curve to fall
**well** within the blue region as this implies that the actual,
observed dataset is a likely realization under our generative model.
The posterior predictive graphs are all zoomed in on the region where
the density is not trivially zero for easier inspection.
Overall, we can see that the frequency model looks decent at most of the
claim count ranges (especially the lower claim counts), but the rating
combinations with larger claim counts seem to be more variable than what
the model can explain.
That is, the distribution of claim counts exhibits possible
overdispersion. Thus, a possible improvement to the model might be to
use the negative binomial distribution instead, rather than the Poisson.
### Posterior Predictive Check - Severity
Similarily, we get posterior predictive draws of severities per each
combination of rating variables.
```{r}
#' Computes the posterior parameters for the weighted Gamma distribution.
#'
#' @param posterior_samples Posterior parameter samples from the Stan model.
#' @param posterior_predictive_claim_counts Posterior predictive claim counts.
#' Must be a matrix.
#'
#' @return A list with posterior shape and rate parameters.
#' @export
#'
#' @examples sample_weighted_gamma(
#' posterior_samples, claim_counts_nb)
#'
#'
#'
sample_weighted_gamma <- function(
posterior_samples,
posterior_predictive_claim_counts) {
dispersion_severity_posterior <- matrix(
posterior_samples$dispersion_severity_samples,
nrow = nrow(posterior_samples$frequency_samples),
ncol = ncol(posterior_samples$frequency_samples)
)
shapes_posterior <- posterior_predictive_claim_counts /
dispersion_severity_posterior
rates_posterior <- posterior_predictive_claim_counts /
(posterior_samples$severity_samples * dispersion_severity_posterior)
return(list(shapes = shapes_posterior, rates = rates_posterior))
}
posterior_sev_samples_pg <- sample_weighted_gamma(
pg_posterior_samples,
claim_count_posterior_pred
)
set.seed(200350623)
simulate_severity <- rgamma(
n = length(pg_posterior_samples$severity_samples),
shape = posterior_sev_samples_pg$shapes,
rate = posterior_sev_samples_pg$rates
)
dim(simulate_severity) <- dim(pg_posterior_samples$severity_samples)
```
```{r, fig.align="center", fig.height=6, fig.width=10}
posterior_pred_check(
simulate_severity,
y = data_with_severity$severity,
n = 50,
variable = "severity (sqrt scaled)",
limits = c(0, 40000)
)
```
As we can see in this case, the generative model looks off here. The
model underfits the most important areas of largest density (the large
spike) and fails to account for uncertainty at higher and lower levels
of severity.
A major contributor to this is the fact that there is an unusual amount
of density in the 31,000 region for severity. It turns out that this
corresponds to a bunch of rating combinations with one or two claims and
total payments in multiples of 31,442. This is likely an artifact that
has been inserted in the data that is incredibly suspicious. However,
these rating combinations receive such little weight in the fit due to
the weighting scheme employed and therefore the overall impact of these
large severity amounts should be mitigated.
Ultimately, however, the goal is the pure premium since our relativities
are based on these quantities.
### Posterior Predictive Check - Pure Premium
```{r Posterior Predictive Check - PP}
simulate_pure_premium <- (
claim_count_posterior_pred / exp(pg_posterior_samples$exposures)
) * simulate_severity
```
```{r, fig.align="center", fig.height=6, fig.width=10}
posterior_pred_check(
simulate_pure_premium,
y = data_with_severity$observed_pp,
n = 50,
variable = "pure premium (sqrt scaled)",
limits = c(0, 4000)
)
```
The "peaks" in the distribution are not as extreme as what is observed
in the data. The black curve does not fall in the blue region in
important areas of the distribution, namely the single peak. Thus, under
the generative model the observed data is simply not probable in some
regions of pure premium, which is an issue.
Otherwise, the generative model at least has the right shape, for the
most part.
```{r remove objects, include = FALSE}
rm(
claim_count_posterior_pred, fit_poisson_gamma, pg_posterior_samples,
posterior_sev_samples_pg, simulate_pure_premium, simulate_severity
)
```
## Model Revision - Switch Frequency to Negative Binomial
We noticed in the previous section that the Poisson distribution does
not account for as much variance as we would like. Indeed, the variance
of the Poisson distribution is equal to its mean.
A negative binomial has variance greater than its mean and therefore
could address issues where uncertainty is understated for particular
rating variable combinations with large claim counts.
The generative model is quite similar to the previous model, but we
switch the frequency distribution:
$$X_{i} \big|N_{i}, \phi, \eta_{i} \sim Gamma\left(\frac{N_{i}}{\phi}, \frac{N_{i}}{\eta_{i} \times \phi}\right),$$
$$\eta_{i} \big| \beta, R_{i} = exp(R_{i} \times \beta),$$
$$N_{i} \big|\xi_{i}, \nu \sim \text{Negative Binomial}(\xi_{i}, \nu),$$
$$\xi_{i} \big|\psi, R_{i} = exp(R_{i} \times \psi + log(\text{exposure}_{i})),$$
$$\beta \sim Normal(0, 1),$$
$$\psi \sim Normal(0, 1),$$
$$\phi \sim \text{Half-Cauchy}(0, 3),$$
$$\nu \sim \text{Half-Cauchy}(0, 10).$$ Note that the Negative Binomial
is using the mean, dispersion parameterization in Stan (i.e.
NegativeBinomial2). The Half-Cauchy is set to be super wide, since stronger priors will heavily favor models that have significant amounts of over-dispersion (leading to sampling issues if the data does not exhibit strong overdispersion).
Compiling the model above in Stan:
```{stan output.var = "nb_gamma"}
data {
int<lower=1> Nobs;
int<lower=1> Nvar;
matrix[Nobs, Nvar] X;
int<lower=0> claim_count[Nobs];
vector<lower=0>[Nobs] exposure;
vector<lower=0>[Nobs] severity;
}
parameters {
vector[Nvar] beta_frequency;
vector[Nvar] beta_severity;
real<lower=0> dispersion_severity;
real<lower=0> dispersion_frequency;
}
model {
beta_frequency ~ std_normal();
beta_severity ~ std_normal();
dispersion_severity ~ cauchy(0, 3);
dispersion_frequency ~ cauchy(0, 10); /* Larger variance here for the negative binomial
because the data did not appear to be that overdispersed and we need to allow for
larger dispersion parameters to consider these cases */
claim_count ~ neg_binomial_2_log(
X * beta_frequency + log(exposure), dispersion_frequency
);
for (i in 1:Nobs) {
if (claim_count[i] != 0) {
severity[i] ~ gamma(
claim_count[i] / dispersion_severity,
claim_count[i] / (exp(X[i, ] * beta_severity) * dispersion_severity)
);
}
}
}
```
Now we sample from the posterior distribution as required:
```{r Sample from Posterior - NB/Gamma}
if (refit == TRUE) {
fit_nb_gamma <- sampling(
nb_gamma,
data = data_list,
chains = 6,
iter = 2000,
seed = 200350623,
cores = 6,
verbose = TRUE
)
saveRDS(fit_nb_gamma, "rds/nb-weighted_gamma.rds")
} else {
fit_nb_gamma <- readRDS("rds/nb-weighted_gamma.rds")
}
nb_posterior_samples <- extract_and_get_posterior_samples(
fitted_model = fit_nb_gamma,
exposures = data_with_severity$vehicle_exposure_years,
X = X
)
nb_posterior_samples$frequency_samples <- exp(nb_posterior_samples$frequency_samples)
nb_posterior_samples$severity_samples <- exp(nb_posterior_samples$severity_samples)
dispersion_frequency_nb <- matrix(
nb_posterior_samples$dispersion_frequency,
nrow = length(nb_posterior_samples$dispersion_frequency),
ncol = nrow(data)
)
```
## Posterior Predictive Check - Revision I
We again simulate observations from the posterior predictive.
```{r}
# Frequency
set.seed(200350623)
claim_count_posterior_pred_nb <- rnbinom(
n = length(nb_posterior_samples$frequency_samples),
mu = nb_posterior_samples$frequency_samples,
size = dispersion_frequency_nb
)
dim(claim_count_posterior_pred_nb) <- dim(dispersion_frequency_nb)
parameters_nb <- sample_weighted_gamma(
posterior_samples = nb_posterior_samples,
posterior_predictive_claim_counts = claim_count_posterior_pred_nb
)
# Severity
set.seed(200350623)
simulate_severity_nb <- rgamma(
n = length(parameters_nb$shapes),
shape = parameters_nb$shapes,
rate = parameters_nb$rates
)
dim(simulate_severity_nb) <- dim(nb_posterior_samples$severity_samples)
# Pure Premium
simulate_pure_premium_nb <- (
claim_count_posterior_pred_nb / exp(nb_posterior_samples$exposures)
) * simulate_severity_nb
```
Next, complete the posterior predictive checks as before:
```{r}
results <- mapply(
posterior_pred_check,
posterior_predictive_samples = list(
claim_count_posterior_pred_nb,
simulate_severity_nb,
simulate_pure_premium_nb
),
y = list(
data_with_severity$claim_count,
data_with_severity$severity,
data_with_severity$observed_pp
),
variable = c(
"claims (sqrt scaled)",
"severity (sqrt scaled)",
"pure premium (sqrt scaled)"
),
limits = list(c(0, 700), c(0, 40000), c(0, 4000)),
MoreArgs = list(n = 50),
SIMPLIFY = FALSE
)
```
Finally:
```{r, fig.align="center", fig.height=6, fig.width=10}
(results[[1]] | results[[2]]) / results[[3]] + plot_annotation(
title = "Posterior Predictive Check - Negative Binomial Frequency"
)
```
Overall, the frequency distribution (claims) looks to be better modeled
with a Negative Binomial, albeit marginally.
The severity distribution still looks off, and as a result, the pure
premium looks only marginally (if that) better than before.
```{r include = FALSE}
rm(
claim_count_posterior_pred_nb,
nb_posterior_samples,
parameters_nb,
simulate_severity_nb
)
gc()
```
## Model Revision - Switch Severity to A More Heavy Tailed Distribution
It looks like the current generative model gives too much density to
small severity amounts, but also too much density to larger amounts.
Therefore, if we switch to a heavy tailed distribution we might make the
tails even heavier (and they are already too heavy) even though we may
improve the fit for smaller severity amounts.
Regardless, we can still explore these distributions and see how they
compare.
There are many heavy tailed distributions that have heavier tails than
the gamma.
### Log-normal
The log-normal distribution is known to have heavier tails than the
gamma distribution. This can be shown by calculating the limit of the ratio of survival
functions.
Stan actually has a log-normal distribution built in. However, it is
difficult to work with the log-normal distribution when linking the
expected value of the log-normal with predictors (this is because the
log-normal does not belong to the exponential dispersion family).
Instead of using the log-normal distribution, we can instead log the
severity random variable to get a Normal($\mu$, $\sigma$) random
variable (assuming that the severity random variable is log-normal). We
can then simulate directly from the log-normal distribution using the
posterior parameters to get the posterior predictive of the original
(assumed) log-normal severity random variable.
```{r}
data_list$severity <- ifelse(
data_with_severity$claim_count != 0,
log(data_with_severity$severity),
0
)
```
The generative model is exactly the same as before, except now we
replace $X_{i} \big | N_{i}, \phi, \eta_{i}$ with a log-normal that has
parameters $R_{i}\beta, \frac{\sigma}{N_{i}^{0.5}}$.
```{stan output.var="sm_nb_lognormal"}
data {
int<lower=1> Nobs;
int<lower=1> Nvar;