-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy paththdqe.Rmd
578 lines (477 loc) · 27.1 KB
/
thdqe.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
---
title: Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width
author: |
Andrey Akinshin
JetBrains, andrey.akinshin@gmail.com
abstract: |
Traditional quantile estimators that are based on one or two order statistics are a common way to estimate
distribution quantiles based on the given samples.
These estimators are robust, but their statistical efficiency is not always good enough.
A more efficient alternative is the Harrell-Davis quantile estimator which uses
a weighted sum of all order statistics.
Whereas this approach provides more accurate estimations for the light-tailed distributions, it's not robust.
To be able to customize the trade-off between statistical efficiency and robustness,
we could consider *a trimmed modification of the Harrell-Davis quantile estimator*.
In this approach, we discard order statistics with low weights according to
the highest density interval of the beta distribution.
**Keywords:** quantile estimation, robust statistics, Harrell-Davis quantile estimator.
author-meta: Andrey Akinshin
lang: en-US
bibliography: references.bib
biblio-style: alphabetic
biblatexoptions: [sorting=anyt]
link-citations: true
colorlinks: true
linkcolor: linkcolor
citecolor: citecolor
urlcolor: urlcolor
output:
bookdown::pdf_document2:
citation_package: biblatex
keep_tex: true
extra_dependencies: ["amsmath", "float", "csquotes"]
toc: false
number_sections: false
includes:
in_header: "preamble.tex"
before_body: "before_body.tex"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, fig.align = "center", fig.pos = "ht!")
library(kableExtra)
source("utils.R", local = knitr::knit_global())
```
# Introduction
We consider a problem of quantile estimation for the given sample.
Let $x$ be a sample with $n$ elements: $x = \{ x_1, x_2, \ldots, x_n \}$.
We assume that all sample elements are sorted ($x_1 \leq x_2 \leq \ldots \leq x_n$) so that
we could treat the $i^\textrm{th}$ element $x_i$ as the $i^\textrm{th}$ order statistic $x_{(i)}$.
Based on the given sample, we want to build an estimation of the $p^\textrm{th}$ quantile $Q(p)$.
The traditional way to do this is to use a single order statistic
or a linear combination of two subsequent order statistics.
This approach could be implemented in various ways.
A classification of the most popular implementations could be found in [@hyndman1996].
In this paper, Rob J. Hyndman and Yanan Fan describe nine types of traditional quantile estimators
which are used in statistical computer packages.
The most popular approach in this taxonomy is Type 7 which is used by default in R, Julia, NumPy, and Excel:
$$
Q_{\operatorname{HF7}}(p) = x_{\lfloor h \rfloor}+(h-\lfloor h \rfloor)(x_{\lceil h \rceil}-x_{\lfloor h \rfloor}),
\quad h = (n-1)p+1.
$$
Traditional quantile estimators have simple implementations and a good robustness level.
However, their statistical efficiency is not always good enough:
the obtained estimations could noticeably differ from the true distribution quantile values.
The gap between the estimated and true values could be decreased by increasing the number of used order statistics.
In [@harrell1982], Frank E. Harrell and C. E. Davis suggest estimating quantiles using
a weighted sum of all order statistics:
$$
Q_{\operatorname{HD}}(p) = \sum_{i=1}^{n} W_{\operatorname{HD},i} \cdot x_i,\quad
W_{\operatorname{HD},i} = I_{i/n}(\alpha, \beta) - I_{(i-1)/n}(\alpha, \beta),
$$
where $I_x(\alpha, \beta)$ is the regularized incomplete beta function,
$\alpha = (n+1)p$, $\;\beta = (n+1)(1-p)$.
To get a better understanding of this approach,
we could look at the probability density function of the beta distribution $\operatorname{Beta}(\alpha, \beta)$ (see Figure \@ref(fig:beta)).
If we split the $[0;1]$ interval into $n$ segments of equal width,
we can define $W_{\operatorname{HD},i}$ as the area under curve in the $i^\textrm{th}$ segment.
Since $I_x(\alpha, \beta)$ is the cumulative distribution function of $\operatorname{Beta}(\alpha, \beta)$,
we can express $W_{\operatorname{HD},i}$ as $I_{i/n}(\alpha, \beta) - I_{(i-1)/n}(\alpha, \beta)$.
```{r beta, fig.cap="Beta distribution B(5.5, 5.5) probability density function", fig.height=3.0}
draw.beta(F)
```
The Harrell-Davis quantile estimator is suggested in [@david2003], [@grissom2005], [@wilcox2016], and [@gibbons2020]
as an efficient alternative to the traditional estimators.
In [@yoshizawa1985], asymptotic equivalence for the traditional sample median and the Harrell–Davis median estimator is shown.
The Harrell-Davis quantile estimator shows decent statistical efficiency in the case of light-tailed distributions:
its estimations are much more accurate than estimations of the traditional quantile estimators.
However, the improved efficiency has a price: $Q_{\operatorname{HD}}$ is not robust.
Since the estimation is a weighted sum of all order statistics with positive weights,
a single corrupted element may spoil all the quantile estimations, including the median.
It may become a severe drawback in the case of heavy-tailed distributions in which
it's a typical situation when we have a few extremely large outliers.
In such cases, we use the median instead of the mean as a measure of central tendency
because of its robustness.
Indeed, if we estimate the median using the traditional quantile estimators like $Q_{\operatorname{HF7}}$,
its asymptotic breakdown point is 0.5.
Unfortunately, if we switch to $Q_{\operatorname{HD}}$, the breakdown point becomes zero
so that we completely lose the median robustness.
Another severe drawback of $Q_{\operatorname{HD}}$ is its computational complexity.
If we have a sorted array of numbers,
a traditional quantile estimation could be computed using $O(1)$ simple operations.
For an unsorted sample,
there are approaches that allow getting an order statistic
using $O(n)$ operations (e.g., see [@alexandrescu2017]).
If we estimate the quantiles using $Q_{\operatorname{HD}}$, we need $O(n)$ operations for a sorted array.
Moreover, these operations involve computation of $I_x(\alpha, \beta)$ values
which are pretty expensive from the computational point of view.
Alternatively, we could consider
the Sfakianakis-Verginis quantile estimator (see [@sfakianakis2008]) or
the Navruz-Özdemir quantile estimator (see [@navruz2020]) which
are also based on a weighted sum of all order statistics.
However, they also have the disadvantages listed above.
Neither $Q_{\operatorname{HF7}}$ nor $Q_{\operatorname{HD}}$ fit all kinds of problem.
$Q_{\operatorname{HF7}}$ is simple, robust, and computationally fast,
but its statistical efficiency doesn't always satisfy the business requirements.
$Q_{\operatorname{HD}}$ could provide better statistical efficiency,
but it's computationally slow and not robust.
To get a reasonable trade-off between $Q_{\operatorname{HF7}}$ and $Q_{\operatorname{HD}}$,
we consider a trimmed modification of the Harrell-Davis quantile estimator.
The core idea is simple:
we take the classic Harrell-Davis quantile estimator,
find the highest density interval of the underlying beta distribution,
discard all the order statistics outside the interval,
and calculate a weighted sum of the order statistics within the interval.
The obtained quantile estimation is more robust than $Q_{\operatorname{HD}}$ (because it doesn't use extreme values)
and typically more statistically efficient than $Q_{\operatorname{HF7}}$
(because it uses more than only two order statistics).
Let's discuss this approach in detail.
# The trimmed Harrell-Davis quantile estimator
The estimators based on one or two order statistics are not efficient enough because they use too few sample elements.
The estimators based on all order statistics are not robust enough because they use too many sample elements.
It looks reasonable to consider a quantile estimator based on a variable number of order statistics.
This number should be large enough to ensure decent statistical efficiency
but not too large to exclude possible extreme outliers.
A robust alternative to the mean is the trimmed mean.
The idea behind it is simple: we should discard some sample elements at both ends
and use only the middle order statistics.
With this approach, we can customize the trade-off between robustness and statistical efficiency
by controlling the number of the discarded elements.
If we apply the same idea to $Q_{\operatorname{HD}}$,
we can build a trimmed modification of the Harrell-Davis quantile estimator.
Let's denote it as $Q_{\operatorname{THD}}$.
In the case of the trimmed mean, we typically discard the same number of elements on each side.
We can't do the same for $Q_{\operatorname{THD}}$
because the array of order statistic weights $\{ W_{\operatorname{HD},i} \}$ is asymmetric.
It looks reasonable to drop the elements with the lowest weights and keep the elements with the highest weights.
Since the weights are assigned according to the beta distribution,
the range of order statistics with the highest weight concentration could be found
using the beta distribution highest density interval.
Thus, once we fix the proportion of dropped/kept elements,
we should find the highest density interval of the given width.
Let's denote the interval as $[L;R]$ where $R-L=D$.
The order statistics weights for $Q_{\operatorname{THD}}$ should be defined
using a part of the beta distribution within this interval.
It gives us the truncated beta distribution $\operatorname{TBeta}(\alpha, \beta, L, R)$ (see Figure \@ref(fig:tbeta)).
```{r tbeta, fig.cap="Truncated beta distribution B(5.5, 5.5) probability density function", fig.height=4.7}
draw.beta(T)
```
\clearpage
We know the CDF for $\operatorname{Beta}(\alpha, \beta)$ which is used in $Q_{\operatorname{HD}}$:
$F_{\operatorname{HD}}(x) = I_x(\alpha, \beta)$.
For $Q_{\operatorname{THD}}$,
we need the CDF for $\operatorname{TBeta}(\alpha, \beta, L, R)$ which could be easily found:
$$
F_{\operatorname{THD}}(x) = \begin{cases}
0 & \textrm{for }\, x < L,\\
\big( F_{\operatorname{HD}}(x) - F_{\operatorname{HD}}(L) \big) /
\big( F_{\operatorname{HD}}(R) - F_{\operatorname{HD}}(L) \big)
& \textrm{for }\, L \leq x \leq R,\\
1 & \textrm{for }\, R < x.
\end{cases}
$$
The final $Q_{\operatorname{THD}}$ equation has the same form as $Q_{\operatorname{HD}}$:
$$
Q_{\operatorname{THD}} = \sum_{i=1}^{n} W_{\operatorname{THD},i} \cdot x_i, \quad
W_{\operatorname{THD},i} = F_{\operatorname{THD}}(i / n) - F_{\operatorname{THD}}((i - 1) / n).
$$
There is only one thing left to do:
we should choose an appropriate width $D$ of the beta distribution highest density interval.
In practical application, this value should be chosen based on the given problem:
researchers should *carefully* analyze business requirements,
describe desired robustness level via setting the breakdown point,
and come up with a $D$ value that satisfies the initial requirements.
However, if we have absolutely no information about the problem, the underlying distribution,
and the robustness requirements, we can use the following rule of thumb which gives the starting point:
$D=1/\sqrt{n}$.
We denote $Q_{\operatorname{THD}}$ with such a $D$ value as $Q_{\operatorname{THD-SQRT}}$.
In most cases, it gives an acceptable trade-off between the statistical efficiency and the robustness level.
Also, $Q_{\operatorname{THD-SQRT}}$ has a practically reasonable computational complexity:
$O(\sqrt{n})$ instead of $O(n)$ for $Q_{\operatorname{HD}}$.
For example, if $n=10\,000$, we have to process only 100 sample elements and calculate 101 values of $I_x(\alpha, \beta)$.
**An example**
Let's say we have the following sample:
$$
x = \{ -0.565, -0.106, -0.095, 0.363, 0.404, 0.633, 1.371, 1.512, 2.018, 100\,000 \}.
$$
Nine elements were randomly taken from the standard normal distribution $\mathcal{N}(0, 1)$.
The last element $x_{10}$ is an outlier.
The weight coefficient for $Q_{\operatorname{HD}}$ an $Q_{\operatorname{THD-SQRT}}$
are presented in Table \@ref(tab:t1).
Table: (\#tab:t1) Weight coefficients of $Q_{\operatorname{HD}}$ and $Q_{\operatorname{THD-SQRT}}$ for $n=10$
| $i$ | $x_i$ | $W_{\operatorname{HD},i}$ | $W_{\operatorname{THD-SQRT},i}$ |
|----:|---------------:|--------------------------:|--------------------------------:|
| 1 | -0.565 | 0.0005 | 0 |
| 2 | -0.106 | 0.0146 | 0 |
| 3 | -0.095 | 0.0727 | 0 |
| 4 | 0.363 | 0.1684 | 0.1554 |
| 5 | 0.404 | 0.2438 | 0.3446 |
| 6 | 0.633 | 0.2438 | 0.3446 |
| 7 | 1.371 | 0.1684 | 0.1554 |
| 8 | 1.512 | 0.0727 | 0 |
| 9 | 2.018 | 0.0146 | 0 |
| 10 | $100\,000.000$ | 0.0005 | 0 |
Here are the corresponding quantile estimations:
$$
Q_{\operatorname{HD}}(0.5) \approx 51.9169, \quad Q_{\operatorname{THD}}(0.5) \approx 0.6268.
$$
As we can see, $Q_{\operatorname{HD}}$ is heavily affected by the outlier $x_{10}$.
Meanwhile, $Q_{\operatorname{THD}}$ gives a reasonable median estimation
because it uses a weighted sum of four middle order statistics.
# Beta distribution highest density interval of the given width
In order to build the truncated beta distribution for $Q_{\operatorname{THD}}$,
we have to find the $\operatorname{Beta}(\alpha, \beta)$ highest density interval of the required width $D$.
Thus, for the given $\alpha, \beta, D$, we should provide an interval $[L;R]$:
$$
\operatorname{BetaHDI}(\alpha, \beta, D) = [L; R].
$$
Let's briefly discuss how to do this.
First of all, we should calculate the mode $M$ of $\operatorname{Beta}(\alpha, \beta)$
(non-degenarate cases are presented in Figure \@ref(fig:hdi)):
$$
M = \operatorname{Mode}_{\alpha, \beta} =
\begin{cases}
\{0, 1 \} \textrm{ or any value in } (0, 1) & \textrm{for }\, \alpha \leq 1,\, \beta \leq 1 \textit{ (Degenerate case),} \\
0 & \textrm{for }\, \alpha \leq 1,\, \beta > 1 \textit{ (Left border case),} \\
1 & \textrm{for }\, \alpha > 1,\, \beta \leq 1 \textit{ (Right border case),} \\
\frac{\alpha - 1}{\alpha + \beta - 2} & \textrm{for }\, \alpha > 1,\, \beta > 1 \textit{ (Middle case).}
\end{cases}
$$
```{r hdi, fig.cap="Highest density interval of the beta distribution based on the mode location"}
draw.hdi()
```
\clearpage
The actual value of $\operatorname{BetaHDI}(\alpha, \beta, D)$ depends on the specific case from the above list
which defines the mode location.
Three of these cases are easy to handle:
* Degenerate case $\alpha \leq 1, \beta \leq 1$:
There is only one way to get such a situation: $n = 1, p = 0.5$.
Since such a sample contains a single element, it doesn't matter how we choose the interval.
* Left border case $\alpha \leq 1, \, \beta > 1$:
The mode equals zero, so the interval should be "attached to the left border":
$\operatorname{BetaHDI}(\alpha, \beta, D) = [0; D]$.
* Right border case $\alpha > 1, \, \beta \leq 1$:
The mode equals one, so the interval should be "attached to the right border":
$\operatorname{BetaHDI}(\alpha, \beta, D) = [1 - D; 1]$
The fourth case is the middle case ($\alpha > 1,\, \beta > 1$),
the HDI should be inside $(0;1)$.
Since the density function of the beta distribution is a unimodal function, it consists of two segments:
a monotonically increasing segment $[0, M]$ and
a monotonically decreasing segment $[M, 1]$.
The HDI $[L;R]$ should contain the mode, so
$$
L \in [0; M], \quad
R \in [M; 1].
$$
Since $R - L = D$, we could also conclude that
$$
L = R - D \in [M - D; 1 - D], \quad
R = L + D \in [D; M + D].
$$
Thus,
$$
L \in [\max(0, M - D);\; \min(M, 1 - D)], \quad
R \in [\max(M, D);\; \min(1, M + D)].
$$
The density function of the beta distribution is also known (see [@sematech]):
$$
f(x) = \dfrac{x^{\alpha - 1} (1 - x)^{\beta - 1}}{\textrm{B}(\alpha, \beta)}, \quad
\textrm{B}(\alpha, \beta) = \dfrac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}.
$$
It's easy to see that for the highest density interval $[L; R]$, the following condition is true:
$$
f(L) = f(R).
$$
The left border $L$ of this interval could be found as a solution of the following equation:
$$
f(t) = f(t + D), \quad \textrm{where }\, t \in [\max(0, M - D);\; \min(M, 1 - D)].
$$
The left side of the equation is monotonically increasing, the right side is monotonically decreasing.
The equation has exactly one solution which could be easily found numerically using the binary search algorithm.
\clearpage
# Simulation study
Let's perform a few numerical simulations and see how $Q_{\operatorname{THD}}$ works in action.^[The source code of all simulations is available on GitHub: https://github.com/AndreyAkinshin/paper-thdqe]
## Simulation 1
Let's explore the distribution of estimations for
$Q_{\operatorname{HF7}}$, $Q_{\operatorname{HD}}$, and $Q_{\operatorname{THD-SQRT}}$.
We consider a contaminated normal distribution which is a mixture of two normal distributions:
$(1 - \varepsilon)\mathcal{N}(0, \sigma^2) + \varepsilon\mathcal{N}(0, c\sigma^2)$.
For our simulation, we use $\varepsilon = 0.01,\; \sigma = 1,\; c = 1\,000\,000$.
We generate $10\,000$ samples of size 7 randomly taken from the considered distribution.
For each sample, we estimate the median using
$Q_{\operatorname{HF7}}$, $Q_{\operatorname{HD}}$, and $Q_{\operatorname{THD-SQRT}}$.
Thus, we have $10\,000$ median estimations for each estimator.
Next, we evaluate lower and higher percentiles for each group of estimations.
The results are presented in Table \@ref(tab:t2).
Table: (\#tab:t2) Distribution of median estimations for a contaminated normal distribution
| quantile| HF7| HD| THD-SQRT|
|---------:|----------:|-----------:|----------:|
| 0.00| -1.6921648| -87.6286082| -1.6041220|
| 0.01| -1.1054591| -9.8771723| -1.0261234|
| 0.02| -0.9832125| -5.2690083| -0.9067884|
| 0.03| -0.9037046| -1.7742334| -0.8298706|
| 0.04| -0.8346268| -0.9921591| -0.7586603|
| 0.96| 0.8172518| 0.8964743| 0.7540437|
| 0.97| 0.8789283| 1.1240294| 0.8052421|
| 0.98| 0.9518048| 4.3675475| 0.8824462|
| 0.99| 1.0806293| 10.4132583| 0.9900912|
| 1.00| 2.0596785| 140.5802861| 1.7060750|
We also perform the same experiment using the Fréchet distribution (shape=1)
as an example of a right-skewed heavy-tailed distribution.
The results are presented in Table \@ref(tab:t3).
Table: (\#tab:t3) Distribution of median estimations for the Fréchet distribution (shape=1)
| quantile| HF7| HD| THD-SQRT|
|--------:|----------:|----------------:|----------:|
| 0.00| 0.3365648| 0.4121860| 0.3720898|
| 0.01| 0.5161896| 0.6684699| 0.5810966|
| 0.02| 0.5703807| 0.7578653| 0.6369594|
| 0.03| 0.6082605| 0.8058995| 0.6834209|
| 0.04| 0.6433384| 0.8460783| 0.7187727|
| 0.96| 4.2510264| 7.2021571| 4.6591661|
| 0.97| 4.6202217| 8.3669085| 5.0186522|
| 0.98| 5.2815341| 10.0274664| 5.6965864|
| 0.99| 6.5037105| 14.3159366| 7.1671722|
| 1.00| 42.0799646| $6\,501.9425729$| 35.3494053|
In Table \@ref(tab:t2), approximately 2\% of all $Q_{\operatorname{HD}}$ results exceed 10 by their absolute values
(while the true median value is zero).
Meanwhile, the maximum absolute value of the $Q_{\operatorname{THD-SQRT}}$ median estimations is approximately $1.7$.
In Table \@ref(tab:t3), the higher percentiles of $Q_{\operatorname{HD}}$ estimations are also noticeably higher
than the corresponding values of $Q_{\operatorname{THD-SQRT}}$.
Thus, $Q_{\operatorname{THD-SQRT}}$ is much more resistant to outliers than $Q_{\operatorname{HD}}$.
\clearpage
## Simulation 2
Let's compare the statistical efficiency of $Q_{\operatorname{HD}}$ and $Q_{\operatorname{THD}}$.
We evaluate the relative efficiency of these estimators against $Q_{\operatorname{HF7}}$
which is a conventional baseline in such experiments.
For the $p^\textrm{th}$ quantile, the classic relative efficiency can be calculated
as the ratio of the estimator mean squared errors ($\operatorname{MSE}$) (see [@dekking2005]):
$$
\operatorname{Efficiency}(p) =
\dfrac{\operatorname{MSE}(Q_{HF7}, p)}{\operatorname{MSE}(Q_{\textrm{Target}}, p)} =
\dfrac{\operatorname{E}[(Q_{HF7}(p) - \theta(p))^2]}{\operatorname{E}[(Q_{\textrm{Target}}(p) - \theta(p))^2]},
$$
where $\theta(p)$ is the true quantile value.
We conduct this simulation according to the following scheme:
* We consider a bunch of different symmetric and asymmetric, light-tailed and heavy-tailed distributions
listed in Table \@ref(tab:t4).
* We enumerate all the percentile values $p$ from 0.01 to 0.99.
* For each distribution, we generate 200 random samples of the given size.
For each sample, we estimate the $p^\textrm{th}$ percentile using
$Q_{\operatorname{HF7}}$, $Q_{\operatorname{HD}}$, and $Q_{\operatorname{THD-SQRT}}$.
For each estimator, we calculate the arithmetic average of $(Q(p) - \theta(p))^2$.
* $\operatorname{MSE}$ is not a robust metric, so we wouldn't get reproducible output in such an experiment.
To achieve more stable results, we repeat the previous step 101 times and take the median across
$\operatorname{E}[(Q(p) - \theta(p))^2]$ values for each estimator.
This median is our estimation of $\operatorname{MSE}(Q, p)$.
* We evaluate the relative efficiency of $Q_{\operatorname{HD}}$ and $Q_{\operatorname{THD-SQRT}}$
against $Q_{\operatorname{HF7}}$.
Table: (\#tab:t4) Distributions from Simulation 2
| Distribution | Support | Skewness | Tailness |
|:------------------------------|:--------------------|:-------------|:-------------|
| `Uniform(a=0, b=1)` | $[0;1]$ | Symmetric | Light-tailed |
| `Triangular(a=0, b=2, c=1)` | $[0;2]$ | Symmetric | Light-tailed |
| `Triangular(a=0, b=2, c=0.2)` | $[0;2]$ | Right-skewed | Light-tailed |
| `Beta(a=2, b=4)` | $[0;1]$ | Right-skewed | Light-tailed |
| `Beta(a=2, b=10)` | $[0;1]$ | Right-skewed | Light-tailed |
| `Normal(m=0, sd=1)` | $(-\infty;+\infty)$ | Symmetric | Light-tailed |
| `Weibull(scale=1, shape=2)` | $[0;+\infty)$ | Right-skewed | Light-tailed |
| `Student(df=3)` | $(-\infty;+\infty)$ | Symmetric | Light-tailed |
| `Gumbel(loc=0, scale=1)` | $(-\infty;+\infty)$ | Right-skewed | Light-tailed |
| `Exp(rate=1)` | $[0;+\infty)$ | Right-skewed | Light-tailed |
| `Cauchy(x0=0, gamma=1)` | $(-\infty;+\infty)$ | Symmetric | Heavy-tailed |
| `Pareto(loc=1, shape=0.5)` | $[1;+\infty)$ | Right-skewed | Heavy-tailed |
| `Pareto(loc=1, shape=2)` | $[1;+\infty)$ | Right-skewed | Heavy-tailed |
| `LogNormal(mlog=0, sdlog=1)` | $(0;+\infty)$ | Right-skewed | Heavy-tailed |
| `LogNormal(mlog=0, sdlog=2)` | $(0;+\infty)$ | Right-skewed | Heavy-tailed |
| `LogNormal(mlog=0, sdlog=3)` | $(0;+\infty)$ | Right-skewed | Heavy-tailed |
| `Weibull(shape=0.3)` | $[0;+\infty)$ | Right-skewed | Heavy-tailed |
| `Weibull(shape=0.5)` | $[0;+\infty)$ | Right-skewed | Heavy-tailed |
| `Frechet(shape=1)` | $(0;+\infty)$ | Right-skewed | Heavy-tailed |
| `Frechet(shape=3)` | $(0;+\infty)$ | Right-skewed | Heavy-tailed |
The results of this simulation for $n = \{ 5, 10, 20 \}$ are presented in
Figure \@ref(fig:eff5), Figure \@ref(fig:eff10), and Figure \@ref(fig:eff20).
\clearpage
```{r eff5, fig.cap="Relative statistical efficiency of quantile estimators for n=5", fig.height=3.95}
sim2 <- simulation2()
sim2(5)
```
```{r eff10, fig.cap="Relative statistical efficiency of quantile estimators for n=10", fig.height=3.95}
sim2(10)
```
\clearpage
```{r eff20, fig.cap="Relative statistical efficiency of quantile estimators for n=20", fig.height=3.95}
sim2(20)
```
As we can see, $Q_{\operatorname{THD-SQRT}}$ is not so efficient as $Q_{\operatorname{HD}}$
in the case of light-tailed distributions.
However, in the case of heavy-tailed distributions,
$Q_{\operatorname{THD-SQRT}}$ has better efficiency than $Q_{\operatorname{HD}}$
because the estimations of $Q_{\operatorname{HD}}$ are corrupted by outliers.
# Conclusion
There is no perfect quantile estimator that fits all kinds of problems.
The choice of a specific estimator has to be made
based on the knowledge of the domain area and the properties of the target distributions.
$Q_{\operatorname{HD}}$ is a good alternative to $Q_{\operatorname{HF7}}$ in the light-tailed distributions
because it has higher statistical efficiency.
However, if extreme outliers may appear, estimations of $Q_{\operatorname{HD}}$ could be heavily corrupted.
$Q_{\operatorname{THD}}$ could be used as
a reasonable trade-off between $Q_{\operatorname{HF7}}$ and $Q_{\operatorname{HD}}$.
In most cases, $Q_{\operatorname{THD}}$ has better efficiency than $Q_{\operatorname{HF7}}$
and it's also more resistant to outliers than $Q_{\operatorname{HD}}$.
By customizing the width $D$ of the highest density interval, we could set the desired breakdown point
according to the research goals.
Also, $Q_{\operatorname{THD}}$ has better computational efficiency than $Q_{\operatorname{HD}}$
which makes it a faster option in practical applications.
# Disclosure statement
The author reports there are no competing interests to declare.
# Acknowledgments
The author thanks Ivan Pashchenko for valuable discussions.
\newpage
# Reference implementation
Here is an R implementation of the suggested estimator:
```r
getBetaHdi <- function(a, b, width) {
eps <- 1e-9
if (a < 1 + eps & b < 1 + eps) # Degenerate case
return(c(NA, NA))
if (a < 1 + eps & b > 1) # Left border case
return(c(0, width))
if (a > 1 & b < 1 + eps) # Right border case
return(c(1 - width, 1))
if (width > 1 - eps)
return(c(0, 1))
# Middle case
mode <- (a - 1) / (a + b - 2)
pdf <- function(x) dbeta(x, a, b)
l <- uniroot(
f = function(x) pdf(x) - pdf(x + width),
lower = max(0, mode - width),
upper = min(mode, 1 - width),
tol = 1e-9
)$root
r <- l + width
return(c(l, r))
}
thdquantile <- function(x, probs, width = 1 / sqrt(length(x)))
sapply(probs, function(p) {
n <- length(x)
if (n == 0) return(NA)
if (n == 1) return(x)
x <- sort(x)
a <- (n + 1) * p
b <- (n + 1) * (1 - p)
hdi <- getBetaHdi(a, b, width)
hdiCdf <- pbeta(hdi, a, b)
cdf <- function(xs) {
xs[xs <= hdi[1]] <- hdi[1]
xs[xs >= hdi[2]] <- hdi[2]
(pbeta(xs, a, b) - hdiCdf[1]) / (hdiCdf[2] - hdiCdf[1])
}
iL <- floor(hdi[1] * n)
iR <- ceiling(hdi[2] * n)
cdfs <- cdf(iL:iR/n)
W <- tail(cdfs, -1) - head(cdfs, -1)
sum(x[(iL+1):iR] * W)
})
```
An implementation of the original Harrell-Davis quantile estimator could be found in the `Hmisc` package (see [@hmisc]).
\newpage