|
1 | 1 | # -*- coding: utf-8 -*-
|
2 | 2 |
|
3 | 3 | """
|
4 |
| -
|
5 |
| -Motivation |
6 |
| ----------- |
7 |
| -
|
8 |
| -The motivation behind the given re-implementation of some clustering metrics is |
9 |
| -to avoid the high memory usage of equivalent methods in Scikit-Learn. Using |
10 |
| -sparse dictionary maps avoids storing co-incidence matrices in memory leading to |
11 |
| -more acceptable performance in multiprocessing environment or on very large data |
12 |
| -sets. |
13 |
| -
|
14 |
| -A side goal was to investigate different association metrics with the aim of |
15 |
| -applying them to evaluation of clusterings in semi-supervised learning and |
16 |
| -feature selection in supervised learning. |
17 |
| -
|
18 |
| -Finally, I was interested in the applicability of different association metrics |
19 |
| -to different types of experimental design. At present, there seems to be both |
20 |
| -(1) a lot of confusion about the appropriateness of different metrics, and (2) |
21 |
| -relatively little attention paid to the type of experimental design used. I |
22 |
| -believe that, at least partially, (1) stems from (2), and that different types |
23 |
| -of experiments call for different categories of metrics. |
24 |
| -
|
25 |
| -Contingency Tables and Experimental Design |
26 |
| ------------------------------------------- |
27 |
| -
|
28 |
| -Consider studies that deal with two variables whose respective realizations can |
29 |
| -be represented as rows and columns in a table. Roughly adhering to the |
30 |
| -terminology proposed in [1]_, we distinguish four types of experimental design |
31 |
| -all involving contingency tables. |
32 |
| -
|
33 |
| -========= =================================== |
34 |
| -Model O all margins and totals are variable |
35 |
| -Model I only the grand total is fixed |
36 |
| -Model II one margin (either row or column totals) is fixed |
37 |
| -Model III both margins are fixed |
38 |
| -========= =================================== |
39 |
| -
|
40 |
| -Model O is rarely employed in practice because researchers almost always have |
41 |
| -some rough total number of samples in mind that they would like to measure |
42 |
| -before they begin the actual measuring. However, Model O situation might occur |
43 |
| -when the grand total is not up to researchers to fix, and so they are forced to |
44 |
| -treat it as a random variable. An example of this would be astronomy research |
45 |
| -that tests a hypothesis about a generalizable property such as dark matter |
46 |
| -content by looking at all galaxies in the Local Group, and the researchers |
47 |
| -obviously don't get to choose ahead of time how many galaxies there are near |
48 |
| -ours. |
49 |
| -
|
50 |
| -Model I and Model II studies are the most common and usually the most confusion |
51 |
| -arises from mistaking one for the other. In psychology, interrater agreement is |
52 |
| -an example of Model I approach. A replication study, if performed by the |
53 |
| -original author, is a Model I study, but if performed by another group of |
54 |
| -researchers, becomes a Model II study. |
55 |
| -
|
56 |
| -Fisher's classic example of tea tasting is an example of a Model III study [2]_. |
57 |
| -The key difference from a Model II study here is that the subject was asked to |
58 |
| -call four cups as prepared by one method and four by the other. The subject was |
59 |
| -not free to say, for example, that none of the cups were prepared by adding milk |
60 |
| -first. The hypergeometric distribution used in the subsequent Fisher's exact |
61 |
| -test shares the assumption of the experiment that both row and column counts are |
62 |
| -fixed. |
63 |
| -
|
64 |
| -Choosing an Association Metric |
65 |
| ------------------------------- |
66 |
| -
|
67 |
| -Given the types of experimental design listed above, some metrics seem to be |
68 |
| -more appropriate than others. For example, two-way correlation coefficients |
69 |
| -appear to be inappropriate for Model II studies where their respective regression |
70 |
| -components seem more suited to judging association. |
71 |
| -
|
72 |
| -Additionally, if there is implied causality relationship, one-sided measures |
73 |
| -might be preferred. For example, when performing feature selection, it seems |
74 |
| -logical to measure the influence of features on the class label, not the other |
75 |
| -way around. |
76 |
| -
|
77 |
| -Using Monte Carlo methods, it should be possible to test the validity of the |
78 |
| -above two propositions as well as to visualize the effect of the assumptions |
79 |
| -made. |
80 |
| -
|
81 |
| -References |
82 |
| ----------- |
83 |
| -
|
84 |
| -.. [1] `Sokal, R. R., & Rohlf, F. J. (2012). Biometry (4th edn). pp. 742-744. |
85 |
| - <http://www.amazon.com/dp/0716786044>`_ |
86 |
| -
|
87 |
| -.. [2] `Wikipedia entry on Fisher's "Lady Tasting Tea" experiment |
88 |
| - <https://en.wikipedia.org/wiki/Lady_tasting_tea>`_ |
89 |
| -
|
90 | 4 | """
|
91 | 5 |
|
92 | 6 | import warnings
|
|
0 commit comments