-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreadme.txt
275 lines (264 loc) · 14.9 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
# UMLAUT 1.0 - Little snail
--------------------------------------------------
DOI: 323649071.svg
https://zenodo.org/record/5772786#.YbN7tLvSLiQ
https://zenodo.org/badge/323649071.svg
(IDL VERSION)
Unsupervised Machine Learning Algorithm based on Unbiased Topology
By Ivano Baronchelli, May/2019 - Jan/2021
Baronchelli I. et al. (2021) describes UMLAUT and one example of use.
Additional simple examples are included in this distribution.
UMLAUT is a variant of the KNN (K-closest neighbor) algorithm.
- Given a set of reference data points (training set), for which
the value of N+1 parameters is known,
- Given one analysis data point with N parameters known and the
(N+1)-th parameter unknown,
--> UMLAUT estimates the value of the (N+1)-th parameter for the
analysis data point. To this purpose, UMLAUT finds the closest
data-points of the training set, in a N-dimensional space
"associated" (see NOTE 1 below) with the parameter space.
After finding the closest data points, the unknown parameter
is obtained as the combination (ex. average) of the values assumed
by the closest reference data points, along the (N+1)-th dimension.
NOTES:
1) the "associated" N-dimensional space is NOT the parameter
space itself. In fact, during the training phase,
- every dimension is "ordinalized": the actual value assumed
by each of the M data points of the reference sample along
the N dimensions is replaced by the position (1,2,...,M)
of the data point itself in a ordered scale
- The N ordinalized dimensions are scaled following a weighting
process that tries to minimize the dispersion along the
estimated (N+1)-th parameter.
2) The simplest configuration, with only one unknown
parameteris (the [N+1]-th), is described above. However,
UMLAUT can be used to determine many unknown parameters, with
no limitaions. Obviously, the value of the same parameters must
be known for the data points of the reference sample.
3) UMLAUT is originally designed for REGRESSION purposes, but
it can also be used for CLASSIFICATION. However, the current
version of UMLAUT does not support the weighting of the input
parameters (dimensions) when UMLAUT is used for
classification.
4) UMLAUT can be trained and tested using the same sample (the
keyword "test" must be set in this case). As demonstrated in
Baronchelli et al. (2021), this configuration does not
introduce overfitting problems, as the training is performed
using a "leave one out" strategy, wehere the data point left
out is properly the data point tested.
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
INPUT PARAMETERS:
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
AI= input array AI[N,M] characterized by:
--> N dimensions (indipendent input parameters)
--> M elements (elements or data points used to train the
algorithm).
The NxM values in AI are those assumed by the N input
parameters for each of the M data point of the training set.
For every data point in this array, the value of the
indipendent variable B (the N+1 unknown parameter) is known.
BI= input vector BI[M] containing the values assumed by the
M elements of the training sample along the (N+1)-th independent
dimension (B).
If there are more than just one independent variable that the
user wants to estimate, then BI[M,O] is a MxO array. Here, "O"
is the number of parameters that the user wants to estimate for
the analysis data point (they are known for the data points
in the training sample). When UMLAUT is used for classification
purposes, then BI contains the labels associated with the
training data points
AO= Input array AO[N,L] similar to AI, characterized by:
--> N dimensions (number of indipendent input parameters)
--> L elements (number of analysis data points for which the
user wants to compute the indipendent variable B).
The parameter B will be estimated for the L datapoints;
NVV= Input N dimensional vector specifying, for each of the N
dimensions considered, the Not Valid Values that should not
be considered (example: -99., 0, etc...). When one of the
dimensions, for a certain datum, assumes the value specified
in NVV, that dimension is not considered.
CLN= Number of closest elements to be considered for evaluating
the indipendent variable. If not set, a warning message is
issued and a default value corresponding to (M/50)+1 (i.e. ~2%
of the available training data points) is assumed.
CLN_MIN= when TYPE_CLN is set to "min_max", CLN_MIN sets the minimum
value of CLN to be considered (whereas CLN is considered
the maximum, in this case). This parameter is not taken
into account when the scope parameter is set to
"classification".
TYPE_CLN= if this keyword is not set, or if it is set to the default
option 'fixed', then CLN represents the amount of closest
datapoints that will be considered by UMLAUT to determine
BO. With TYPE_CLN='fixed' (or not set), the CLN_MIN option
is not taken into account. If TYPE_CLN is set to
"min_max", then the algorithm automatically finds, for
each element in AO, the best value of CLN that should be
used, ranging from CLN_MIN and CLN (that are set by the
user). TYPE_CLN is set to 'fixed' when
scope='classification'.
AVERAGE= Type of average to be considered. Options are: "median",
"mean", "mode", "weighted", "fit".
- Default option when scope="regression" is "mean".
- Default option when scope="classification" is "mode".
Besides "median" and "mean", whose meaning is clear, the
"weighted" option distributes the weights normally, with
CLN assumed to be sigma of that distribution. In order of
N-dimensional distance, the i-th element is weigthed as:
W=exp{- (i^2) /(2*CLN^2)}
Using the option "fit", B (the [N+1]-th parameter) is
obtained from a N-dimensional linear fit of the closest
reference data points. This option is particularly useful
when the values assumed by some of the N parameters of the
analysis data point(s) are located at the border of (or
outside) the range of values expressed by the training
sample. After selecting the closest reference elements,
the data points having the (N+1)th parameter that differs
from the average more than "fit_thresh" times the
dispersion, are excluded from the fit. The fit_thresh
paramter is optional.
The "mode" option is valid only for "classification"
purposes (scope="classification"), and it represents
the default option in this configuration. However, in this
case, also the "weighted" option is available. When UMLAUT
is used for classification purposes, then the options
"median", "mean" and "fit" have no sense, as the output is
not a real number but it is a label. In this case, the
output label can be obtained as the "mode" (default) of the
closest data points of the training set or as the
"weighted" mode. The weighting factor is computed as
described above and it takes into account the distances
from the analysis data point.
BAL= when this keyword is set (/BAL), (only for scope=classification
configuration), the estimated output label of the analysis data
point is obtained by weighting the probabilities associated to
each possible label (CLAS_PROB output) taking into account the
fraction of training data points that are labelled with the
same label, with respect to the total.
test= setting this keyword, the closest datapoint of the training
set is not taken into account when computing the output
parameter (BO). This configuration should be used when
UMLAUT is trained and tested on ovrelapping or even identical
datasets. Notice that training UMLAUT in this way does not
introduce overfitting problems, as the "leave one out"
testing strategy is adopted (See Baronchelli et al. 2021).
scope= set this keyword to "regression" or to "classification".
Defaulft is regression.
- If set to regression, BI must contain real numbers
(example BI[0]=3.45, BI[1]=2.21, BI[3]=1.5, BI[4]=5.7
etc...). In this case, UMLAUT provides the best estimation
of the output parameter (BO array) as a real number.
- If set to "classification", BI must contain a discrete
classification for each of the single data point of the
training set (example BI[0]='OII', BI[1]='Ha',
BI[3]='Hb',BI[4]='Ha', etc...). Moreover,
> in the input vector "CLAS_VECT", the user must specify
the different possibilities (labels);
> in the output array "CLAS_PROB", UMLAUT provides the
probabilities associated to each of the possible input
labels.
> in the output array "CLAS_UNC", UMLAUT provides the
poissonian uncertainties associated to each of the
probabilities indicated in "CLAS_PROB".
Example:
> CLAS_VECT=["Ha", "Hb", "OIII", "OII"] (input vector);
> CLAS_PROB=[0.6,0.1,0.2,0.1] (output vector for one
single analysis data point).
> CLAS_UNC=[0.12,0.02,0.03,0.01] (output vector for one
single analysis data point).
Under the "classification" configuration, TYPE_CLN is set to
"fixed" and the AVERAGE keyword is not considered (the
output is not an average).
GETPDF=setting this keyword, (/GETPDF), UMLAUT provides an output
PDF for the output parameter BO (See also the "X_PDF", "PDF",
and "PSM" parameters. The PDF is not provided under the
scope="classification" configuration.
def_x_PDF=setting this keyword (/def_x_PDF) "X_PDF" (see below) is
not considered as an input. Instead, x_PDF will be
overwritten by a default scale automatically selected by
UMLAUT. If scope="classification", this parameter is not
taken into account.
PSM=smoothing factor applied to the output PDF. A good compromise
for this parameter is PSM ~1/1000 - 1/200 of the total number
of data points in the training set. If scope="classification",
this parameter is not taken into account.
OPTIMIZE_DIM=set to "yes" (default) to let the algorithm free to
weight the input dimensions after their
ordinalization. If scope="classification", dimensions
are not optimized.
SO_TYPE=Type of uncertainties that will be incuded in the output
vector SO These uncertainties are associated with the
estimated values of the output parameter B (BO). OPTIONS:
sigma --> sigma(default),
perc --> percentiles 16%-84%,
aver --> average between sigma and symmetrized percentiles
IN_SCALINGS=optional NxO elements input vector containing the weights
associated to each of the N input parameters
(dimensions). If more than one single output parameter B
has to be estimated (O>1), than the user can provide a
different set of weights, one for each of the output
parameters to estimate. Notice that the this vector can
be obtained from previous runs of UMLAUT. In fact, the
output vector OUT_SCALINGS provides the list of weights
used for each of the analysis data points. Hence,
each of the N elements of IN_SCALINGS can be obtained
as the average weights reported in OUT_SCALINGS (along
each of the N dimensions), computed in previous
runs. In an iterative strategy, at each run of UMLAUT,
the IN_SCALING values of the previous iterations can be
multiplied for the weights obtained from the newer
iterations, in order to obtain more and more precise
weights at the end.
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
OUTPUT PARAMETERS:
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
BO=output vector BO[L] with L elements. It contains the output
estimations of the (N+1)th parameter (B) for all the analysis
data points. For each of the analysis data point, the values of
the N parameters are specified in the input array AO (see
above). If there are more than just one independent variable
that the user wants to estimate, then BO is a LxO array. Here,
"O" is the number of parameters that the user wants to estimate
for the analysis data point (they must be known for the data
points in the training sample).
SO=output vector SO[L] containing the values of dispersion
associated with the estimations of the independent parameter B,
(specified in BO). The type of uncertainty that the user wants
to use (sigma/percentiles) can be specified in the "SO_TYPE"
array.
CLAS_VECT=see (input parameter "scope" )
CLAS_PROB=see (input parameter "scope" )
CLAS_UNC=see (input parameter "scope" )
PDF= Output Probability Distribution Functions. One PDF for every
datum is given in utput. The PDF is computed in the binning
decided by the user if X_PDF is set to a particular vector.
OUT_SCALINGS=LxNxO vector specifying, for each of the L analysis
elements and for each of the N input dimensions
(parameters), the weight used to compute the output
value of the unknown paramter B. If more than one
output parameter has to be estimated (O>1), then the
weights are estimated for each of the output
parameters computed.
-------------------------------------------------------------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
INPUT/OUTPUT PARAMETERS:
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
X_PDF= x values for the output Probability Distribution Functions.
> IF the "def_x_PDF" keyword is set, X_PDF is considered an
output and it corresponds to the vector BI.
> if the "def_x_PDF" keyword is NOT set, X_PDF is an input
vector that should be defined by the user.
When the user wants to estimate just one single output
parameter (i.e., BI and BO are LxO arrays with O=1), then
X_PDF is a one dimensional vector made of H elements (the
number of elements H depends on how the user set "X_PDF"
itself and "def_x_PDF". Instead, when O>1, X_PDF is a HxO
array of elements.
--------------------------------------------------------------------
ADDITIONAL NOTES:
- When UMLAUT evaluates one of the input dimensions, if there are
too few elements in the training set with a valid value along the
same dimension (less than 3*CLN), then the dimension is not
considered at all.
- In the classification configuration, the output array BO is always
consiered as an array of strings.
--------------------------------------------------------------------