-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathVariable_Selection.Rmd
704 lines (495 loc) · 34.6 KB
/
Variable_Selection.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
---
title: Variable Selection Analysis
author: Amanda Skarlupka
date: "`r Sys.Date()`"
output:
html_document:
toc_depth: 3
---
# Overview
This document will guide you through some data analysis tasks with a focus on performing variable selection. For this exercise, we consider a categorical outcome.
While this is in some sense a stand-alone analysis, I assume that you have worked through the _Data Analysis_ exercise and are familiar with the dataset and all the things we discovered during the cleaning process. We'll use the same dataset here but focus on a different outcome. Other than that, the way to work through the exercise is like in the _Data Analysis_ exercise, namely by writing/completing the missing code.
# Project setup
We need a variety of different packages, which are loaded here. Install as needed.
If you use others, load them here.
```{r setup, warning=FALSE}
library('tidyr')
library('dplyr')
library('readr')
library('forcats')
library('ggplot2')
library('knitr')
library('mlr') #for model fitting.
library('parallelMap') #for using multiple processors when running models through mlr
library('tidyverse')
```
# Data loading and cleaning
We will again use the Norovirus dataset.
```{r loaddata}
#Write code that loads the dataset
#You can of course re-use code you wrote in the other file.
d <- read_csv("norodata.csv")
```
# Looking at the outcome
For this analysis, our main outcome of interest is if an outbreak was caused by the G2.4 strain of norovirus or not, and how other factors might be correlated with that strain.
```{r checkoutcome}
#write code to take a look at the outcome variable (gg2c4)
d %>%
ggplot(aes(x = gg2c4)) +
geom_histogram(stat = "count")
```
Overall, it looks ok, a decent amout in each category (i.e. no unbalanced data). However, we see an odd coding, there are only 2 types of entries, either "Yes" or "", i.e. an empty space. The codebook tells us that this is how things were coded, if it was G2.4 it got a "Yes", if it wasn't it did get an empty space instead of a "No". That's somewhat strange and while the analysis should work, it is better to re-code the empty slots with "No".
```{r clean-outcome}
#write code to change the empty values in gg2c4 to "No"
d$gg2c4[is.na(d$gg2c4)]<-"No"
d$gg2c4 <- as.factor(d$gg2c4)
d$gg2c4
```
# Selecting predictors
We will pick similar variables as previously, with some adjustments based on what we learned before. Keep the following variables: `Action1, CasesAll, Country, Deaths, GG2C4, Hemisphere, Hospitalizations, MeanD1, MeanI1, MedianD1, MedianI1, OBYear, Path1, RiskAll, Season, Setting, Trans1, Vomit.`
```{r reducepredictors}
d <- d %>%
dplyr::select(Action1, CasesAll, Country, Deaths, gg2c4, Hemisphere, Hospitalizations, MeanD1, MeanI1, MedianD1, MedianI1, OBYear, Path1, RiskAll, season, Setting_1, Trans1, Vomit)
dim(d)
```
# Cleaning predictors
We'll likely have to perform similar cleaning steps as before. A difference is that earlier we had to drop about half of the observations because no outcome was available. Here, we have outcome for every outbreak. Thus, there is more data to clean, which - as we will see - introduces a few issues we didn't have before, since we dropped the troublesome observations early in the process.
Let's first check for missing values.
```{r check-reduced-data}
# write code that looks at missing values and changes the charcters to factors
d <- as.data.frame(unclass(d))
print(colSums(is.na(d)))
visdat::vis_dat(d)
visdat::vis_miss(d)
```
Looks like we have again some missing in `Hospitalization` and `Deaths`, and some more in `RiskAll`. Since the missing is not excessive, and to make our life easier, we'll drop them for now. Note however the 'blocks' of missing values for RiskAll. Given that these outbreaks should be entered fairly randomly into the spreadsheet, it is strange to see the NA show up in such blocks. For a real data analysis, it would be worth looking closer and checking why there is a clustering like that.
```{r further-reduce-data}
# write code to remove any observations with NA
d <- d %>%
filter(is.na(Hospitalizations) == FALSE)
d <- d %>%
filter(is.na(Deaths) == FALSE)
d <- d %>%
filter(is.na(RiskAll) == FALSE)
visdat::vis_dat(d)
visdat::vis_miss(d)
```
Let's make sure everything has the right format (numeric/integer/factor). Adjust/recode variables as needed.
You will likely find that as you convert `OBYear` to numeric, something doesn't quite work. Take a look. Fix by removing the observation with the troublesome entry, then convert to numeric. Finally, remove the observations that have 0 as OByear - there are more than 1 now.
```{r adjust}
#write code that cleans OBYear, convert it to numeric. Remove observations with OBYear = 0.
#also convert any other variables as needed
str(d)
d$Vomit <- as.factor(d$Vomit)
d$OBYear <- as.numeric(levels(d$OBYear))[d$OBYear]
d <- d %>%
filter(OBYear != 0)
d %>%
ggplot(aes(x = OBYear)) +
geom_bar()
str(d)
```
Next, we remove the `Unspecified` entry in `Hemisphere` and recode `Action1` and `Path1` as described in the Data Analysis script, i.e. from `Unknown` to `Unspecified`. Also do the same grouping into just `Restaurant` and `Other` with the `Setting_1` variable. Again, remember that there are `restaurant` and `Restaurant` values, so you need to fix that too.
```{r clean-more-predictors}
# write code that performs the actions described above
# at the end, use the droplevels() command to remove empty factor levels
#remove the Unspecified entry in Hemisphere:
unique(d$Hemisphere)
d <- d %>%
filter(Hemisphere != "Unspecified") %>%
droplevels()
unique(d$Hemisphere)
#recode Action1 and Path1:
levels(d$Action1)
d$Action1 <- fct_collapse(d$Action1, Unspecified = c("Unknown", "Unspecified"), Yes = "Yes", No = "No")
levels(d$Path1)
d$Path1 <- fct_collapse(d$Path1, Unspecified = c("Unknown", "Unspecified"), Yes = "Yes", No = "No")
levels(d$Action1)
levels(d$Path1)
#Group setting_1 into only two factors
d <- d %>%
mutate(Setting = ifelse(str_detect(d$Setting_1, "[R|r]est*")== TRUE, "Restaurant", "Other"))
d <- d %>%
select(-Setting_1)
d$Setting <- as.factor(d$Setting)
str(d$Setting)
```
# Data visualization
Next, let's create a few plots showing the outcome and the predictors. For the continuous predictors, I suggest scatter/box/violinplots with the outcome on the x-axis.
```{r plots-1}
#write code that produces plots showing our outcome of interest on the x-axis and each numeric predictor on the y-axis.
#you can use the facet_wrap functionality in ggplot for it, or do it some other way.
d <- d[,c(5, 1:4, 6:18)]
d %>%
gather(c(CasesAll, Deaths, Hospitalizations, MeanD1, MeanI1, MedianD1, MedianI1, RiskAll, OBYear), key = "var", value = "value") %>%
ggplot(aes(x = gg2c4, y = value)) +
geom_violin() +
facet_wrap(~ var, scales = 'free')
```
Things look ok, apart from the skew in the predictors we discussed previously.
Next, let's create plots for the categorical variabless. You can use for instance `geom_count` for it, or some other representation. If you prefer lots of tables, that's ok too.
```{r plots-2}
#write code that produces plots or tables showing our outcome of interest and each categorical predictor.
d %>%
gather(c(Action1, Country, Hemisphere, Path1, season, Trans1, Vomit, Setting), key = "var", value = "value") %>%
ggplot(aes(x = gg2c4, y = value)) +
geom_count() +
facet_wrap(~ var, scales = 'free')
```
You should see from plots or tables that some of the categories are small, e.g. for Action1, the "No" category is very small. Very few entries for a given factor create problems during cross-validation (since we can't have a level show up in the holdout if it wasn't part of the fitting set). So let's look at those factor variables a bit closer and fix as needed.
I didn't initially get the results described. I initially interpreted the previous step where we dropped NAs to mean that we drop ALL NAs, not just the ones in the three categories.
```{r, more-factor-check}
#write code that looks at tables/summaries of factors
table(d$Hemisphere)
table(d$Path1)
table(d$season)
table(d$Setting)
table(d$Trans1)
table(d$Action1)
table(d$Country)
str(d$season)
levels(d$season)
```
You should see from your explorations above that there is only a single entry for _No_ in `Action1`, and small entries for _Multiple_ and _Unspecified_ in `Country` and _Environmental_ in `Trans1`. The single _No_ should be fixed, the other somewhat small groupings might be ok. It depends on scientific rationale and method of analysis if you should modify/group those or not. We'll do that.
Change things as follows: Remove the observation for which action is _No_, combine Countries into 3 groups Japan/USA/Other, and since I don't even know what biologically the difference is between _Environmental_ and _Waterborne_ transmission (seems the same route to me based on my norovirus knowledge), we'll move the Waterborne into the environmental. Finally, I noticed there is a blank category for season. That likely means it wasn't stated in the paper. Let's recode that as _Unknown_. Finally, re-order data such that the outcome is the first column and again remove empty factor levels. Then look at your resulting data frame.
```{r, more-factor-cleaning}
#write code that does the actions described above
#Remove observation with "No" for action1...I don't have that observation. I think I dropped it when I dropped NAs.
d <- d %>%
filter(Action1 != "No")
#Combine "Countries into 3 groups":
levels(d$Country)
d$Country <- fct_collapse(d$Country, Other = c("Australia", "Austria", "Brazil", "Ca da", "Chi", "Croatia", "Denmark", "France", "Iraq", "Israel", "Italy", "Multiple", "Netherlands", "New Zealand", "Norway", "Other", "Scotland", "Spain", "UK", "Unspecified"), USA = "USA", Japan = "Japan")
levels(d$Country)
#Combine Environmental and Waterborne transmission --- I also moved Unknown and Unspecified into the same level of Unspecified
levels(d$Trans1)
d$Trans1 <- fct_collapse(d$Trans1, Environmental = c("Environmental", "Waterborne"), Unspecified = c("Unknown", "Unspecified"))
levels(d$Trans1)
#Recode the blanks for season into Unknown
levels(d$season)
unique(d$season)
d$season <- forcats::fct_explicit_na(d$season, "Unknown")
unique(d$season)
levels(d$season)
d <- droplevels(d)
#I still have on NA in vomit, so I'm going to drop that to get to the 850 observations
#Drop NAs in Vomit
d <- d %>%
filter(is.na(Vomit) == FALSE)
visdat::vis_dat(d)
```
At this step, you should have a dataframe containing 850 observations, and 18 variables: 1 outcome, 9 numeric/integer predictors, and 8 factor variables. There should be no missing values. The outcome, `gg2c4`, should be in the 1st slot.
# Data splitting
We could do data splitting again as we did in the previous exercise, to have a final test set. But since you saw how it works in the previous exercise, we skip it here. We use the full data to fit to our models. We'll still use cross-validation to get a more honest estimate of model performance. For a real data analysis, the choice to keep some for a final test or not is based on your goals. If your focus is predictive performance, you should consider this split. If your focus is inference or exploratory analysis, you might want to skip this.
# Model fitting
So I had planned to use `caret` exclusivley for this course. Last time I tried feature/subset selection with `caret`, I found it buggy and not too well documented. I had hoped this had improved since. Unfortunately, I was again not really able to get things to work. So even though I said we likely won't use the `mlr` package, I decided that to be able to nicely practice feature/subset selection, we need to do so. It's not a bad idea to get familiar with that package, at times `caret` can do things `mlr` can't do and the reverse. So knowing how to use both is good. We'll thus use `mlr` to do our model fitting in this exercise.
# Parallelization
`mlr` allows you to run things in parallel by using multiple cores (`caret` does too). For instance, if you do 5x cross-validation 5x repeated, you essentially run a very similar piece of code 25 times. Normally, you would run one at a time. But if you have it on a machine with 25 cores, all those 25 could run at the same time, giving you a speed-up of 25 (or close to, there is usually some overhead related to doing things in parallel). This kind of parallel computing is sometimes called _embarassingly parallel_, because it's so embarassingly simple to split the task into parallel streams.
Since doing the subset selection below starts getting slow, and because it's a good topic to know about, we are going to use parallelization here. `mlr` uses the package `parallelMap` for this. All you need to do is specify the number of cores/processors you want to use and start the parallel system, and `mlr` then automatically does things in parallel if feasible.
```{r, start-parallel}
#set the number of cores you want to use. Note that this is actually the number of 'logical processors', which is 2x the number of cores. On a windows machine, the number of cores/logical processors can be found in the task manager. You should only set it to what your computer has (or less). So if your computer has say 4 or 6 cores, you can set it to that, or some lower number. Setting it to a number higher than the cores your computer has doesn't further speed up things, in fact it slows things down.
ncpu=4;
#if you don't want to run things in parallel, or don't have multiple cores (unlikely nowadays),
#just comment out the line below.
parallelStartSocket(ncpu, show.info=FALSE)
```
# Setup
Some setup settings that are used in various code chunks below.
```{r mlr-setup}
outcome <- d$gg2c4
outcomename = "gg2c4"
predictors <- d[,-1]
npred=ncol(predictors)
#set sampling method for performance evaluation
#here, we use 5-fold cross-validation, 5-times repeated
sampling_choice = makeResampleDesc("RepCV", reps = 5, folds = 5)
```
## A null model
To define a null model, we need to determine what performance measure we want to track. As mentioned in the course materials, there are different performance measures. Accuracy or misclassification error is simple, it just counts the number of times the model got it right/wrong. We'll start with that one, and then try another one later. `mlr` allows for a lot of different performance measures for both categorical and continuous outcomes, see [here](https://mlr.mlr-org.com/articles/tutorial/performance.html) and [here](https://mlr.mlr-org.com/articles/tutorial/measures.html).
For accuracy, the simplest null model always predicts the most frequent category. We can use that as baseline performance.
```{r nullmodel-acc}
#write code that computes accuracy for a null model
table(d$gg2c4)
#the null model always predicts "No" becuase this is the most frequent category of our expected outcome.
measureACC(d$gg2c4, "No")
```
You should find that the null model has an accuracy of around 0.69.
## Single predictor models
Now let's consider single predictor models, i.e. we'll fit the outcome to each predictor one at a time to get an idea of the importance of individual predictors. To evaluate our model performance, we will use cross-validation. Since our outcome is categorical, we'll use a logistic model.
```{r, unifit, warning=FALSE}
set.seed(1111) #makes each code block reproducible
#set learner/model. this corresponds to a logistic model.
#mlr calls different models different "learners"
learner_name = "classif.binomial";
mylearner = makeLearner(learner_name, predict.type = "prob")
# this will contain the results
unifmat=data.frame(variable = rep(0,npred), Accuracy = rep(0,npred))
# loop over each predictor, build simple dataset with just outcome and that predictor, fit it to a glm/logistic model
for (nn in 1:npred)
{
unidata = data.frame(gg2c4 = outcome, d[,nn+1] )
## Generate the task, i.e. define outcome and predictors to be fit
mytask = makeClassifTask(id='unianalysis', data = unidata, target = outcomename, positive = "Yes")
model = resample(mylearner, task = mytask, resampling = sampling_choice, show.info = FALSE, measures = acc )
unifmat[nn,1] = names(predictors)[nn]
unifmat[nn,2] = model$aggr
}
kable(unifmat)
```
So looks like none of the single predictor models have a higher accuracy than the null. Maybe surprising. We'll get back to that below.
# Full model
Now let's fit a full logistic model with all predictors.
```{r fullfit}
set.seed(1111) #makes each code block reproducible
#do full model with Cross-Validation - to get an idea for the amount of over-fitting a full model does
mytask = makeClassifTask(id='fullanalysis', data = d, target = outcomename, positive = "Yes")
fullmodel = resample(mylearner, task = mytask, resampling = sampling_choice, show.info = FALSE, measures = acc )
ACC_fullmodel = fullmodel$aggr[1]
print(ACC_fullmodel)
```
So a very small improvement over the simple models.
Now let's do subset selection. The code below does it several ways. It does regular forward and backward selection and floating versions of those. It also uses a genetic algorithm for selection. See the `mlr` website [here](https://mlr.mlr-org.com/articles/tutorial/feature_selection.html) for details.
Also note that I included a kind of timer in the code, to see how long things take. That's a good idea if you run bigger models. You first run a few iterations on maybe a few cores, then you can compute how long it would take if you doubled the iterations, or doubled the cores, etc. This prevents bad surprises of trying to _quickly_ run a piece of code and waiting hours. You should always use short runs to make sure everything works in principle, and only at the end do the real, long "production" runs. Otherwise you might waste hours/days/weeks waiting for results only to realize that you made a basic mistake and you have to do it all over.
```{r mlr-subset selection}
set.seed(1111)
tstart=proc.time(); #capture current CPU time for timing how long things take
#do 2 forms of forward and backward selection, just to compare
select_methods=c("sbs","sfbs","sfs","sffs")
resmat=data.frame(method = rep(0,4), Accuracy = rep(0,4), Model = rep(0,4))
ct=1;
for (select_method in select_methods) #loop over all stepwise selection methods
{
ctrl = makeFeatSelControlSequential(method = select_method)
print(sprintf('doing subset selection with method %s ',select_method))
sfeat_res = selectFeatures(learner = mylearner,
task = mytask,
resampling = sampling_choice,
control = ctrl,
show.info = FALSE,
measures = acc)
resmat[ct,1] = select_methods[ct]
resmat[ct,2] = sfeat_res$y
resmat[ct,3] = paste(as.vector(sfeat_res$x), collapse= ', ')
ct=ct+1;
}
# do feature selection with genetic algorithm
maxit = 100 #number of iterations - should be large for 'production run'
ctrl_GA =makeFeatSelControlGA(maxit = maxit)
print(sprintf('doing subset selection with genetic algorithm'))
sfeatga_res = selectFeatures(learner = mylearner,
task = mytask,
resampling = sampling_choice,
control = ctrl_GA,
show.info = FALSE,
measures = acc)
resmat[5,1] = "GA"
resmat[5,2] = sfeatga_res$y
resmat[5,3] = paste(as.vector(sfeatga_res$x), collapse= ', ')
runtime.minutes_SS=(proc.time()-tstart)[[3]]/60; #total time in minutes the optimization took
print(sprintf('subset selection took %f minutes',runtime.minutes_SS));
kable(resmat)
```
So we find that some of the sub-models perform somewhat better. Note that the different methods find different submodels and the forward selection method seems to fail (since none of the single-predictor models is better than the null, it stops there). Also note that the genetic algorithm was only allowed to run for a short time (the number if iterations is small). To get potentially better results, that should be increased to a larger number, e.g. 1000 or more. But this will take a longer time, unless you have many cores you can use when you parallelize.
# Different performance measures
So far, we found that some of the sub-models provide a minor improvement over the simple null or single-predictor models or the full model. However, the improvement is not much.
We could try tweaking further, e.g. by pre-processing the predictors (which is a good idea to try, but we won't do here) or testing a different model (which we'll do below).
For now, I want to discuss performance measure a bit more. The problem is that logistic regression, as well as many (though not all) algorithms that are used for classification (predicting categorical outcomes) return probabilities for an outcome to belong to a certain category. Those are then turned into Yes/No predictions (in the case with 2 outcomes like we have here), based on some cut-off. By default, a probability value of 0.5 is chosen. If the model predicts that there is a 0.5 or greater probability of the outcome being "Yes", it is scored as such in the prediction, otherwise as no. However, this 0.5 threshold is not always good. It could be that we might get a much better model if we use another threshold. There is a technique that scans over different thresholds and evaluates model performance for each threshold. This is generally called _receiver operating curve_. If this concept is new to you, check out [this article on Wikipedia](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) or [this tutorial on the mlr website](https://mlr.mlr-org.com/articles/tutorial/roc_analysis.html).
Using this approach, a model is evaluated based on the Area under the curve (AUC), with an AUC=0.5 meaning a model is not better than chance, and AUC=1 meaning a perfect model. Let's re-do the analysis above, but replace accuracy with AUC.
# Model performance with AUC
```{r nullmodel-auc}
#write code that computes AUC for a null model
#the null model always predicts "No"
#the auc function in the pROC package is useful
#or you can use any other package/function you want to use to compute AUC/ROC
library(pROC)
null_auc <- data.frame(matrix(1, nrow = 850, ncol = 1))
roc_null <- roc(d$gg2c4, null_auc[,1])
auc(roc_null)
```
You should find that the null model has an AUC of 0.5, as expected for a "no information" model.
Now we run the single-predictor models again.
```{r, unifit-auc, warning=FALSE}
#copy and paste the code from above for single predictor fits, but now switch to AUC instead of accuracy.
set.seed(1111) #makes each code block reproducible
#set learner/model. this corresponds to a logistic model.
#mlr calls different models different "learners"
learner_name = "classif.binomial";
mylearner = makeLearner(learner_name, predict.type = "prob")
# this will contain the results
unifmat=data.frame(variable = rep(0,npred), Accuracy = rep(0,npred))
# loop over each predictor, build simple dataset with just outcome and that predictor, fit it to a glm/logistic model
for (nn in 1:npred){
unidata = data.frame(gg2c4 = outcome, d[,nn+1] )
## Generate the task, i.e. define outcome and predictors to be fit
mytask = makeClassifTask(id='unianalysis', data = unidata, target = outcomename, positive = "Yes")
model = resample(mylearner, task = mytask, resampling = sampling_choice, show.info = FALSE, measures = mlr::auc )
unifmat[nn,1] = names(predictors)[nn]
unifmat[nn,2] = model$aggr
}
kable(unifmat)
```
Looks like there are a few single-predictor models with better performance than the null model.
Now let's again fit a full model with all predictors.
```{r fullfit-auc}
#copy and paste the code from above for the full model fit, but now switch to AUC instead of accuracy.
set.seed(1111) #makes each code block reproducible
#do full model with Cross-Validation - to get an idea for the amount of over-fitting a full model does
mytask = makeClassifTask(id='fullanalysis', data = d, target = outcomename, positive = "Yes")
fullmodel = resample(mylearner, task = mytask, resampling = sampling_choice, show.info = FALSE, measures = mlr::auc )
AUC_fullmodel = fullmodel$aggr[1]
print(AUC_fullmodel)
```
This is better than any single-predictor models. Let's see if a sub-model can do even better.
```{r mlr-subset-auc}
#copy and paste the code from above for the subset selection, but now switch to AUC instead of accuracy.
set.seed(1111)
tstart=proc.time(); #capture current CPU time for timing how long things take
#do 2 forms of forward and backward selection, just to compare
select_methods=c("sbs","sfbs","sfs","sffs")
resmat=data.frame(method = rep(0,4), Accuracy = rep(0,4), Model = rep(0,4))
ct=1;
for (select_method in select_methods) #loop over all stepwise selection methods
{
ctrl = makeFeatSelControlSequential(method = select_method)
print(sprintf('doing subset selection with method %s ',select_method))
sfeat_res = selectFeatures(learner = mylearner,
task = mytask,
resampling = sampling_choice,
control = ctrl,
show.info = FALSE,
measures = mlr::auc)
resmat[ct,1] = select_methods[ct]
resmat[ct,2] = sfeat_res$y
resmat[ct,3] = paste(as.vector(sfeat_res$x), collapse= ', ')
ct=ct+1;
}
# do feature selection with genetic algorithm
maxit = 100 #number of iterations - should be large for 'production run'
ctrl_GA =makeFeatSelControlGA(maxit = maxit)
print(sprintf('doing subset selection with genetic algorithm'))
sfeatga_res = selectFeatures(learner = mylearner,
task = mytask,
resampling = sampling_choice,
control = ctrl_GA,
show.info = FALSE,
measures = mlr::auc)
resmat[5,1] = "GA"
resmat[5,2] = sfeatga_res$y
resmat[5,3] = paste(as.vector(sfeatga_res$x), collapse= ', ')
runtime.minutes_SS=(proc.time()-tstart)[[3]]/60; #total time in minutes the optimization took
print(sprintf('subset selection took %f minutes',runtime.minutes_SS));
kable(resmat)
```
Looks like we get some sub-models that are slightly better than the full model, at least as evaluated using the (cross-validated) AUC measure.
# More on ROC/AUC
While this is a common measure and useful to evaluate how models perform at different cut-off levels, one needs to keep in mind that this doesn't really measure the performance of a single model, but instead the combined performance of all models with different cut-offs. In practice, if we want to use a model, we have to pick one cut-off. For that purpose, one can look at the whole ROC curve and one then usually chooses the model in the "top-left" corner of the curve, which gives the best overall performance with a good balance between FP and TP. Let's plot that ROC curve for the model found by the GA.
```{r roc-plot}
set.seed(1111) #makes each code block reproducible
d2 <- d %>% dplyr::select(gg2c4, sfeatga_res$x )
mytask = makeClassifTask(id='rocanalysis', data = d2, target = outcomename, positive = "Yes")
log_mod = train(mylearner, mytask)
log_pred = predict(log_mod, task = mytask)
df = generateThreshVsPerfData(list(logistic = log_pred), measures = list(fpr, tpr))
plotROCCurves(df)
```
Based on this plot, a FP of around 0.5 an TP of around 0.75 seem to produce a good model. Digging into the `df` object, you can find the raw data underlying the curve in `df$data` and will find
```{r roc-best}
print(df$data[29:31,])
```
This means for this model, if we were to use it we might want to set a threshold of around 0.3, i.e. everything above that probability would be predicted as positive/yes.
# Trying a different model
Let's see if we can find another model that might perform even better. Let's look at a Linear Discriminant Analysis (LDA) model. See e.g. chapter 4 of ISLR for more on that model. We again consider AUC as our measure. The null-model performance is the same, we need to re-do the remainder.
Let's re-run the single model, followed by the full model and subset selection.
```{r, unifit-auc-lda, warning=FALSE}
#copy and paste the code from above for single predictor fits. Switch the learner from a the logistic model to a LDA model. Check the mlr website to figure out the name for that learner. (see 'integrated learner' section)
set.seed(1111) #makes each code block reproducible
#set learner/model. this corresponds to a logistic model.
#mlr calls different models different "learners"
learner_name = "classif.lda";
mylearner = makeLearner(learner_name, predict.type = "prob")
# this will contain the results
unifmat=data.frame(variable = rep(0,npred), Accuracy = rep(0,npred))
# loop over each predictor, build simple dataset with just outcome and that predictor, fit it to a LDA model
for (nn in 1:npred)
{
unidata = data.frame(gg2c4 = outcome, d[,nn+1] )
## Generate the task, i.e. define outcome and predictors to be fit
mytask = makeClassifTask(id='unianalysis', data = unidata, target = outcomename, positive = "Yes")
model = resample(mylearner, task = mytask, resampling = sampling_choice, show.info = FALSE, measures = mlr::auc )
unifmat[nn,1] = names(predictors)[nn]
unifmat[nn,2] = model$aggr
}
kable(unifmat)
```
```{r fullfit-auc-lda}
#copy and paste the code from above for the full model fit, but now for LDA. Since you already switched the learner/model above, no further adjustment should be needed
set.seed(1111) #makes each code block reproducible
#do full model with Cross-Validation - to get an idea for the amount of over-fitting a full model does
mytask = makeClassifTask(id='fullanalysis', data = d, target = outcomename, positive = "Yes")
fullmodel = resample(mylearner, task = mytask, resampling = sampling_choice, show.info = FALSE, measures = mlr::auc )
AUC_fullmodel = fullmodel$aggr[1]
print(AUC_fullmodel)
```
```{r mlr-subset-auc-lda}
#copy and paste the code from above for subset selection, now for LDA. Since you already switched the learner/model above, no further adjustment should be needed
set.seed(1111)
tstart=proc.time(); #capture current CPU time for timing how long things take
#do 2 forms of forward and backward selection, just to compare
select_methods=c("sbs","sfbs","sfs","sffs")
resmat=data.frame(method = rep(0,4), Accuracy = rep(0,4), Model = rep(0,4))
ct=1;
for (select_method in select_methods) #loop over all stepwise selection methods
{
ctrl = makeFeatSelControlSequential(method = select_method)
print(sprintf('doing subset selection with method %s ',select_method))
sfeat_res = selectFeatures(learner = mylearner,
task = mytask,
resampling = sampling_choice,
control = ctrl,
show.info = FALSE,
measures = mlr::auc)
resmat[ct,1] = select_methods[ct]
resmat[ct,2] = sfeat_res$y
resmat[ct,3] = paste(as.vector(sfeat_res$x), collapse= ', ')
ct=ct+1;
}
# do feature selection with genetic algorithm
maxit = 100 #number of iterations - should be large for 'production run'
ctrl_GA =makeFeatSelControlGA(maxit = maxit)
print(sprintf('doing subset selection with genetic algorithm'))
sfeatga_res = selectFeatures(learner = mylearner,
task = mytask,
resampling = sampling_choice,
control = ctrl_GA,
show.info = FALSE,
measures = mlr::auc)
resmat[5,1] = "GA"
resmat[5,2] = sfeatga_res$y
resmat[5,3] = paste(as.vector(sfeatga_res$x), collapse= ', ')
runtime.minutes_SS=(proc.time()-tstart)[[3]]/60; #total time in minutes the optimization took
print(sprintf('subset selection took %f minutes',runtime.minutes_SS));
kable(resmat)
```
It Looks like we again get some sub-models that are slightly better than the full model, at least as evaluated using the (cross-validated) AUC measure. There seems to be a bit, but not much difference in performance to the logistic model. We can take the GA model again (because it's conveniently at the end, even if it's not the best) and look at its ROC and compare the curves for the LDA and logistic models.
```{r roc-plot-2}
#copy the code from roc-plot above and re-do train/predict but now save it as lda_mod and lda_pred
#then use generateThreshVsPerfData() with both lda_pred and log_pred (computed above) to create a structure that contains
#best model curves for both logistic and LDA, then plot the curves.
set.seed(1111) #makes each code block reproducible
d2 <- d %>% dplyr::select(gg2c4, sfeatga_res$x )
mytask = makeClassifTask(id='rocanalysis', data = d2, target = outcomename, positive = "Yes")
lda_mod = train(mylearner, mytask)
lda_pred = predict(lda_mod, task = mytask)
df = generateThreshVsPerfData(list(logistic = log_pred, lda = lda_pred), measures = list(fpr, tpr))
plotROCCurves(df)
```
You should see from the plot that the LDA model performs very similar to the Logistic model.
```{r parallelization-end}
#close down the parallelization started at the beginning to free up resources.
parallelStop()
```
# Wrapping up
Based on the above, we could choose a model simply by best performance. However, we likely also still want to look at model prediction uncertainty and do some residual plots and other diagnostics for our chosen model before we finalize our choice. As was the case for `caret`, `mlr` also comes with lots of functions that make these - and many other - tasks easier. We'll leave it at this for now, and might revisit some of those topics in further exercises.
Also, none of the models here are that great. We might want to think more about our initial premise, i.e. looking for associations between virus strain type and other variables and what the scientific rationale is for expecting variations. Here, we just ran the model and looked what we might find. That approach, sometimes called _data exploration_ or _data mining_ or - less charitable _fishing expedition_ is ok, but we need to be careful how we interpret and use the results. Going in with a clear hypothesis is usually better.
# A few more comments
Things are getting a bit slow now, despite the multiple cores we need minutes to run things. Also, code chunks get larger. At this point, it might be worth considering a structure whereby most of the code lives in one or several **well documented** R scripts, which can be run independently. Those R scripts should save their results, and those results are then loaded here. The advantage is that if one wants to make changes to a small part of the analysis, one can modify and run just that part and update the whole document by re-knitting without having to run all code. For any big/serious analysis, I suggest such a setup that splits R code from RMarkdown files, at least for the computationally intensive parts.