-
-
Notifications
You must be signed in to change notification settings - Fork 39
/
Copy path11-Ceteris-paribus-Oscillations.Rmd
executable file
·246 lines (182 loc) · 21.4 KB
/
11-Ceteris-paribus-Oscillations.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
# Ceteris-paribus Oscillations {#ceterisParibusOscillations}
```{r, echo=FALSE, warning=FALSE}
source("code_snippets/ema_init.R")
```
## Introduction {#CPOscIntro}
Visual examination of ceteris-paribus (CP) profiles, as illustrated in the previous chapter, is insightful. However, in case of a model with a large number of explanatory variables, we may end up with a large number of plots that may be overwhelming. In such a situation, it might be useful to select the most interesting or important profiles. In this chapter, we describe a measure that can be used for such a purpose and that is directly linked to CP profiles. It can be seen as an instance-level variable-importance measure alternative to the measures discussed in Chapters \@ref(breakDown)--\@ref(LIME).
## Intuition {#CPOscIntuition}
To assign importance to CP profiles, we can use the concept of profile oscillations. It is worth noting that the larger influence of an explanatory variable on prediction for a particular instance, the larger the fluctuations of the corresponding CP profile. For a variable that exercises little or no influence on a model's prediction, the profile will be flat or will barely change. In other words, the values of the CP profile should be close to the value of the model's prediction for a particular instance. Consequently, the sum of differences between the profile and the value of the prediction, taken across all possible values of the explanatory variable, should be close to zero. The sum can be graphically depicted by the area between the profile and the horizontal line representing the value of the single-instance prediction. On the other hand, for an explanatory variable with a large influence on the prediction, the area should be large. Figure \@ref(fig:CPVIPprofiles) illustrates the concept based on CP profiles presented in Figure \@ref(fig:profileV4Rf). The larger the highlighted area in Figure \@ref(fig:CPVIPprofiles), the more important is the variable for the particular prediction.
(ref:CPVIPprofilesDesc) The value of the coloured area summarizes the oscillations of a ceteris-paribus (CP) profile and provides the mean of the absolute deviations between the CP profile and the single-instance prediction. The CP profiles are constructed for the `titanic_rf` random forest model for the Titanic data and passenger Henry.
```{r CPVIPprofiles, echo=FALSE, fig.cap='(ref:CPVIPprofilesDesc)', out.width = '99%', fig.align='center'}
knitr::include_graphics("figure/profile_v4_rf2.png")
```
## Method {#CPOscMethod}
Let us formalize this concept now. Denote by $g^j(z)$ the probability density function of the distribution of the $j$-th explanatory variable. The summary measure of the variable's importance for model $f()$'s prediction at $\underline{x}_*$, $vip_{CP}^{j}(\underline{x}_*)$, is defined as follows:
\begin{equation}
vip_{CP}^j(\underline{x}_*) = \int_{\mathcal R} |h^{j}_{\underline{x}_*}(z) - f(\underline{x}_*)| g^j(z)dz=E_{X^j}\left\{|h^{j}_{\underline{x}_*}(X^j) - f(\underline{x}_*)|\right\}.
(\#eq:VIPCPdef)
\end{equation}
Thus, $vip_{CP}^j(\underline{x}_*)$ is the expected absolute deviation of the CP profile $h^{j}_{\underline{x}_*}()$, defined in \@ref(eq:CPPdef), from the model's prediction at $\underline{x}_*$, computed over the distribution $g^j(z)$ of the $j$-th explanatory variable.
The true distribution of $j$-th explanatory variable is, in most cases, unknown. There are several possible approaches to construct an estimator of \@ref(eq:VIPCPdef).
One is to calculate the area under the CP curve, i.e., to assume that $g^j(z)$ is a uniform distribution over the range of variable $X^j$. It follows that a straightforward estimator of $vip_{CP}^{j}(\underline{x}_*)$ is
\begin{equation}
\widehat{vip}_{CP}^{j,uni}(\underline{x}_*) = \frac 1k \sum_{l=1}^k |h^{j}_{x_*}(z_l) - f(\underline{x}_*)|,
(\#eq:VIPCPuni)
\end{equation}
where $z_l$ ($l=1, \ldots, k$) are selected values of the $j$-th explanatory variable. For instance, one can consider all unique values of $X^{j}$ in a dataset. Alternatively, for a continuous variable, one can use an equidistant grid of values.
Another approach is to use the empirical distribution of $X^{j}$. This leads to the estimator defined as follows:
\begin{equation}
\widehat{vip}_{CP}^{j,emp}(\underline{x}_*) = \frac 1n \sum_{i=1}^n |h^{j}_{\underline{x}_*}(x^{j}_i) - f(\underline{x}_*)|,
(\#eq:VIPCPemp)
\end{equation}
where index $i$ runs through all observations in a dataset.
The use of $\widehat{vip}_{CP}^{j,emp}(\underline{x}_*)$ is preferred when there are enough data to accurately estimate the empirical distribution and when the distribution is not uniform. On the other hand, $\widehat{vip}_{CP}^{j,uni}(\underline{x}_*)$ is in most cases quicker to compute and, therefore, it is preferred if we look for fast approximations.
Note that the local evaluation of the variables' importance can be very different from the global evaluation. This is well illustrated by the following example. Consider the model
$$
f(x^1, x^2) = x^1 * x^2,
$$
where variables $X^1$ and $X^2$ take values in $[0,1]$. Furthermore, consider prediction for an observation described by vector $\underline{x}_* = (0,1)$. In that case, the importance of $X^1$ is larger than $X^2$. This is because the CP profile $h^1_{x_*}(z) = z$, while $h^2_{x_*}(z) = 0$. Thus, there are oscillations for the first variable, but no oscillations for the second one. Hence, at $\underline{x}_* = (0,1)$, the first variable is more important than the second. Globally, however, both variables are equally important, because the model is symmetrical.
<!-- for the first variable, given by the values of function $f(z,1)=z$, will have oscillations. On the other hand, the profile for the second variable will show no oscillations, because the profile is given by function $f(0,z)=0$. Obviously, the situation is reversed for $\underline{x}_*=(1,0)$.
It is worth noting that the importance of an explanatory variable for instance prediction may be very different for different values of $\underline{x}_*$. For example, c -->
## Example: Titanic data {#CPOscExample}
Figure \@ref(fig:CPVIP1) shows bar plots summarizing the size of oscillations for explanatory variables for the random forest model `titanic_rf` (see Section \@ref(model-titanic-rf)) for Henry, a 47-year-old man who travelled in the first class (see Section \@ref(predictions-titanic)). The longer the bar, the larger the CP-profile oscillations for the particular explanatory variable. The left-hand-side panel presents the variable-importance measures computed by applying estimator $\widehat{vip}_{CP}^{j,uni}(\underline{x}_*)$, given in \@ref(eq:VIPCPuni), to an equidistant grid of values. The right-hand-side panel shows the results obtained by applying estimator $\widehat{vip}_{CP}^{j,emp}(\underline{x}_*)$, given in \@ref(eq:VIPCPemp), with an empirical distribution for explanatory variables.
The plots presented in Figure \@ref(fig:CPVIP1) indicate that both estimators consistently suggest that the most important variables for the model's prediction for Henry are *gender* and *age*, followed by *class*. However, a remarkable difference can be observed for the *sibsp* variable, which gains in relative importance for estimator $\widehat{vip}_{CP}^{j,uni}(\underline{x}_*)$. In this respect, it is worth recalling that this variable has a very skewed distribution (see Figure \@ref(fig:titanicExplorationParch)). In particular, a significant mass of the distribution is concentrated at zero, but there have been a few high values observed for the variable. As a result, the of empirical density is very different from a uniform distribution. Hence the difference in the relative importance noted in Figure \@ref(fig:CPVIP1).
It is worth noting that, while the variable-importance plot in Figure \@ref(fig:CPVIP1) does indicate which explanatory variables are important, it does not describe how do the variables influence the prediction. In that respect, the CP profile for *age* for Henry (see Figure \@ref(fig:profileV4Rf)) suggested that, if Henry were older, this would significantly lower his probability of survival. One the other hand, the CP profile for *sibsp* (see Figure \@ref(fig:profileV4Rf)) indicated that, were Henry not travelling alone, this would increase his chances of survival. Thus, the variable-importance plots should always be accompanied by plots of the relevant CP profiles.
(ref:CPVIP1Desc) Variable-importance measures based on ceteris-paribus oscillations estimated by using (left-hand-side panel) a uniform grid of explanatory-variable values and (right-hand-side panel) empirical distribution of explanatory-variables for the random forest model and passenger Henry for the Titanic data.
```{r CPVIP1, warning=FALSE, message=FALSE, echo=FALSE, fig.cap='(ref:CPVIP1Desc)', fig.width=9, fig.height=4, out.width = '100%', fig.align='center'}
library("ggplot2")
library("patchwork")
library("randomForest")
library("DALEX")
explain_rf <- DALEX::explain(model = titanic_rf,
data = titanic_imputed[, -9],
y = titanic_imputed$survived == "yes",
label = "Random Forest",
verbose = FALSE)
oscillations_equi <- predict_parts(explain_rf, henry, type = "oscillations_uni")
oscillations_emp <- predict_parts(explain_rf, henry, type = "oscillations_emp", N = 1000)
oscillations_equi$`_ids_` <- "Henry"
oscillations_emp$`_ids_` <- "Henry"
pl1 <- plot(oscillations_equi) +
ggtitle("CP Oscillations for uniform distribution", "") + theme_ema
pl2 <- plot(oscillations_emp) +
ggtitle("CP Oscillations for empirical distribution", "") + theme_ema
pl1 + pl2
# oscillations_equi <- variable_attribution(explain_rf, henry,
# variable_splits = list(age = seq(0, 65, 0.1),
# fare = seq(0, 200, 0.1),
# sibsp = seq(0, 8, 0.1),
# parch = seq(0, 8, 0.1),
# gender = unique(titanic_imputed$gender),
# embarked = unique(titanic_imputed$embarked),
# class = unique(titanic_imputed$class)),
# type = "oscillations")
# oscillations_equi$`_ids_` <- "Henry"
# plot(oscillations_equi) +
# ggtitle("Ceteris-paribus Oscillations",
# "Expectation over uniform distribution (equidistant grid)")
#knitr::include_graphics("figure/oscillations_all_rf_plot.png")
```
## Pros and cons {#CPOscProsCons}
Oscillations of CP profiles are easy to interpret and understand. By using the average of oscillations, it is possible to select the most important variables for an instance prediction. This method can easily be extended to two or more variables. <!---In such cases, one needs to integrate the equation \@ref(eq:VIPCPdef) over a larger number of variables.[TOMASZ: PROBLEMATIC STATEMENT; THE EQUATION IS FOR A SINGLE VARIABLE.]--->
There are several issues related to the use of the CP oscillations, though. For example, the oscillations may not be of help in situations when the use of CP profiles may itself be problematic (e.g., in the case of correlated explanatory variables or interactions -- see Section \@ref(CPProsCons)). An important issue is that the CP-based variable-importance measures \@ref(eq:VIPCPdef) do not fulfil the local accuracy condition (see Section \@ref(SHAPMethod)), i.e., they do not sum up to the instance prediction for which they are calculated, unlike the break-down attributions (see Chapter \@ref(breakDown)) or Shapley values (see Chapter \@ref(shapley)).
## Code snippets for R {#CPOscR}
In this section, we present analysis of CP-profile oscillations as implemented in the `DALEX` package for R. For illustration, we use the random forest model `titanic_rf` (Section \@ref(model-titanic-rf)). The model was developed to predict the probability of survival after the sinking of the Titanic. Instance-level explanations are calculated for Henry, a 47-year-old male passenger that travelled in the first class (see Section \@ref(predictions-titanic)).
We first retrieve the `titanic_rf` model-object and the data frame for Henry via the `archivist` hooks, as listed in Section \@ref(ListOfModelsTitanic). We also retrieve the version of the `titanic` data with imputed missing values.
```{r, warning=FALSE, message=FALSE, eval=FALSE}
titanic_imputed <- archivist::aread("pbiecek/models/27e5c")
titanic_rf <- archivist:: aread("pbiecek/models/4e0fc")
(henry <- archivist::aread("pbiecek/models/a6538"))
```
```
class gender age sibsp parch fare embarked
1 1st male 47 0 0 25 Cherbourg
```
Then we construct the explainer for the model by using the function `explain()` from the `DALEX` package (see Section \@ref(ExplainersTitanicRCode)). We also load the `randomForest` package, as the model was fitted by using function `randomForest()` from this package (see Section \@ref(model-titanic-rf)) and it is important to have the corresponding `predict()` function available. The model's prediction for Henry is obtained with the help of that function.
```{r, warning=FALSE, message=FALSE, eval = FALSE}
library("randomForest")
library("DALEX")
explain_rf <- DALEX::explain(model = titanic_rf,
data = titanic_imputed[, -9],
y = titanic_imputed$survived == "yes",
label = "Random Forest")
predict(explain_rf, henry)
```
```
[1] 0.246
```
### Basic use of the `predict_parts()` function
To calculate CP-profile oscillations, we use the `predict_parts()` function, already introduced in Section \@ref(BDR). In particular, to use estimator $\widehat{vip}_{CP}^{j,uni}(\underline{x}_*)$, defined in \@ref(eq:VIPCPuni), we specify argument `type="oscillations_uni"`, whereas for estimator $\widehat{vip}_{CP}^{j,emp}(\underline{x}_*)$, defined in \@ref(eq:VIPCPemp), we specify argument `type="oscillations_emp"`. By default, oscillations are calculated for all explanatory variables. To perform calcualtions only for a subset of variables, one can use the `variables` argument.
In the code below, we apply the function to the explainer-object for the random forest model `titanic_rf` and the data frame for the instance of interest, i.e., `henry`. Additionally, we specify the `type="oscillations_uni"` argument to indicate that we want to compute CP-profile oscillations and the estimated value of the variable-importance measure as defined in \@ref(eq:VIPCPuni). <!--- Note that, one nesd to specify `type="oscillations_emp"` to get oscillations estimated with empirical density as defined in Equation \@ref(eq:VIPCPuni).--->
```{r titanicCeterisProfile02C, warning=FALSE, message=FALSE}
oscillations_uniform <- predict_parts(explainer = explain_rf,
new_observation = henry,
type = "oscillations_uni")
oscillations_uniform
```
The resulting object is of class `ceteris_paribus_oscillations`, which is a data frame with three variables: `_vname_`, `_ids_`, and `oscillations` that provide, respectively, the name of the variable, the value of the identifier of the instance, and the estimated value of the variable-importance measure. Additionally, the object has also got an overloaded `plot()` function. We can use the latter function to plot the estimated values of the variable-importance measure for the instance of interest. In the code below, before creating the plot, we make the identifier for Henry more explicit. The resulting graph is shown in Figure \@ref(fig:CPoscDefForHenry).
(ref:CPoscDefForHenryDesc) Variable-importance measures based on ceteris-paribus oscillations estimated by the `oscillations_uni` method of the `predict_parts()` function for the random forest model and passenger Henry for the Titanic data.
```{r, warning=FALSE, message=FALSE, eval=FALSE}
oscillations_uniform$`_ids_` <- "Henry"
plot(oscillations_uniform) +
ggtitle("Ceteris-paribus Oscillations",
"Expectation over uniform distribution (unique values)")
```
```{r CPoscDefForHenry, warning=FALSE, message=FALSE, fig.width=6, echo=FALSE, fig.height=4, fig.cap='(ref:CPoscDefForHenryDesc)', out.width = '70%', fig.align='center'}
oscillations_uniform$`_ids_` <- "Henry"
plot(oscillations_uniform) +
ggtitle("Ceteris-paribus Oscillations",
"Expectation over uniform distribution (unique values)") + theme_ema
```
### Advanced use of the `predict_parts()` function
As mentioned in the previous section, the `predict_parts()` function with argument `type = "oscillations_uni"` computes estimator $\widehat{vip}_{CP}^{j,uni}(\underline{x}_*)$, defined in \@ref(eq:VIPCPuni), while for argument `type="oscillations_emp"` it provides estimator $\widehat{vip}_{CP}^{j,emp}(\underline{x}_*)$, defined in \@ref(eq:VIPCPemp). However, one could also consider applying estimator $\widehat{vip}_{CP}^{j,uni}(\underline{x}_*)$ but using a pre-defined grid of values for a continuous explanatory variable. Toward this aim, we can use the `variable_splits` argument to explicitly specify values for the density estimation. Its application is illustrated in the code below for variables *age* and *fare*. Note that, in this case, we use argument `type = "oscillations"`. It is also worth noting that the use of the `variable_splits` argument limits the computations to the variables specified in the argument.
```{r titanicCeterisProfile02F, warning=FALSE, message=FALSE, eval=TRUE}
oscillations_equidist <- predict_parts(explain_rf, henry,
variable_splits = list(age = seq(0, 65, 0.1),
fare = seq(0, 200, 0.1),
gender = unique(titanic_imputed$gender),
class = unique(titanic_imputed$class)),
type = "oscillations")
oscillations_equidist
```
<!-- The obtained estimates of the variable-importance measure are slightly different from the values obtained by using all unique values of the explanatory variables. In particular, the grid-based estimates are now almost equal for the *sibsp* and *age* variables. A more substantial change, from 0.054 to 0.104, can be seen for *fare*. However, the ordering of the variables is the same as in the case of the estimates obtained by using all unique values of the continuous explanatory variables. -->
Subsequently, we can use the `plot()` function to construct a bar plot of the estimated values. In the code below, before creating the plot, we make the identifier for Henry more explicit. The resulting graph is shown in Figure \@ref(fig:CPoscGridForHenry).
(ref:CPoscGridForHenryDesc) Variable-importance measures based on ceteris-paribus oscillations estimated by using a specified grid of points for the random forest model and passenger Henry for the Titanic data.
```{r, warning=FALSE, message=FALSE, eval = FALSE}
oscillations_equidist$`_ids_` <- "Henry"
plot(oscillations_equidist) +
ggtitle("Ceteris-paribus Oscillations",
"Expectation over specified grid of points")
```
```{r CPoscGridForHenry, warning=FALSE, message=FALSE, eval = TRUE, fig.width=6, fig.height=3, fig.cap='(ref:CPoscGridForHenryDesc)', out.width = '70%', fig.align='center', echo=FALSE}
oscillations_equidist$`_ids_` <- "Henry"
plot(oscillations_equidist) +
ggtitle("Ceteris-paribus Oscillations",
"Expectation over specified grid of points") + theme_ema
```
<!--
Another approach is to calculate the expectation \@ref(eq:VIPCPdef) over the empirical distribution of an explanatory variable from the data used to fit the model, i.e., to use $\widehat{vip}_{CP}^{j,emp}(x_*)$, given in \@ref(eq:VIPCPemp). Toward this aim, we apply the `variable_splits` argument to explicitly use values from the data used to fit the model.
```{r titanicCeterisProfile02H, warning=FALSE, message=FALSE}
titanic <- na.omit(titanic)
oscillations_empirical <- variable_attribution(explain_rf, henry,
variable_splits = list(age = titanic_imputed$age,
fare = titanic_imputed$fare,
sibsp = titanic_imputed$sibsp,
parch = titanic_imputed$parch,
gender = titanic_imputed$gender,
embarked = titanic_imputed$embarked,
class = titanic_imputed$class),
type = "oscillations")
oscillations_empirical
```
The obtained estimates of the variable-importance measure for *gender* and *sibsp* are now markedly different from the values obtained by using all unique values of the explanatory variables. The changes result in a different ordering of the variables. By using the `plot()` function, we create a barplot of the estimated variable-importance measures. The resulting graph is shown in Figure \@ref(fig:CPoscEmpForHenry).
(ref:CPoscEmpForHenryDesc) Variable-importance measures based on ceteris-paribus oscillations estimated by using an empirical distribution of explanatory-variable values in the `variable_attribution()` function for the `titanic_rf` model and passenger Henry for the Titanic data.
```{r CPoscEmpForHenry, warning=FALSE, message=FALSE, eval = TRUE, fig.width=6, fig.height=4, fig.cap='(ref:CPoscEmpForHenryDesc)', out.width = '70%', fig.align='center'}
oscillations_empirical$`_ids_` <- "Henry"
plot(oscillations_empirical) +
ggtitle("Ceteris-paribus Oscillations",
"Expectation over empirical distribution")
```
-->
## Code snippets for Python {#CPOscPython}
At this point we are not aware about any Python libraries that would implement the methods presented in the current chapter.