-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.Rmd
706 lines (507 loc) · 33.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
---
output: github_document
---
```{r include=FALSE}
for (f in list.files(path = here::here("R"), pattern = "*.R", full.names = T)) {
source(f)
}
```
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
<!-- README.md is generated from README.Rmd. Please edit that file -->
# About the playOmics package
<!-- badges: start -->
**playOmics** is a comprehensive R package tailored to meet the specific demands of multi-omics experiment analyses, providing robust tools for biomarker discovery and disease prognosis.
* **Feature Identification**: Conducts feature selection across datasets to pinpoint top prognostic markers and creates predictive models using logistic regression.
* **Model Superiority**: Constructs multi-marker models that often surpass single-variable models, with scores to highlight the best feature combinations.
* **Interpretability**: Offers a wealth of statistical data, visuals, and explainers, promoting a clear understanding of each marker's role.
* **Tailored Design**: Especially useful for studies with smaller sample sizes, such as research on rare diseases.
* **Practicality**: Ensures readiness with adaptable models, capable of working around missing datasets by employing alternative models.
In essence, **playOmics** stands out as a user-friendly, efficient tool that significantly elevates the potential of multi-omics data exploration, positioning it as a leading choice for both research and clinical applications.
<!-- badges: end -->
# Installation
## Development Version
You can install the development version of playOmics from GitHub with:
``` r
remotes::install_github("JagGlo/playOmics")
```
## Docker Installation
As an alternative installation method, the [playomics_env](https://github.com/JagGlo/playOmics_env) repository provides a setup for running playOmics within a Docker container. This setup includes testing data and scripts, offering a comprehensive R environment for users who prefer containerized solutions.
To get started with playOmics using Docker:
1. **Installation Requirements:** Ensure you have Docker and Docker Compose installed on your system.
2. **Repository Cloning:** Clone the [playomics_env](https://github.com/JagGlo/playOmics_env) repository to your local machine or server.
3. **Container Setup:** Follow the instructions within the repository to build and start the Docker container.
This method is recommended for users looking for an easy and consistent setup process, enabling a quick start into modelling process.
# Exploring the BRCA Data
## Introduction to BRCA Data
To explore the **playOmics** package, you will work with the BRCA dataset from TCGA data collection. You can delve deeper into the dataset’s details [here](https://portal.gdc.cancer.gov/projects/TCGA-BRCA).
This dataset is openly available for research and can be downloaded from the [LinkedOmics portal]( https://www.linkedomics.org/data_download/TCGA-BRCA/).
## Setting Up the Environment
To initiate, let’s load the necessary libraries. Ensure the playOmics and additional libraries are available in your workspace.
```{r setup, message=FALSE, warning=FALSE}
# load playOmics
library(playOmics)
```
```{r}
# Load the additional required libraries
library(tidyverse)
library(readxl)
```
## Establishing Working Directory & Naming the Experiment
Next, establish a working directory. It’s an essential step, allowing for streamlined logging and organization. You can achieve this using the setwd() function or the here package.
```{r message=FALSE, warning=FALSE}
here::set_here()
my_experiment_name <- "final_top5_cv5"
```
## Preparing the Data
Now, let’s proceed with reading each data frame, and assigning descriptive names to them. It’s crucial to unify identifier column names, correct any improper column types (e.g., char to numeric), eliminate unnecessary variables, and perform other needed cleaning tasks at this stage.
**Remember**: Structure each dataset with variables in columns, observations in rows, and importantly, the first column should contain the observation ID, uniformly named across datasets.
Let's start with the clinical data:
```{r}
clinical_data <-
read_delim("TCGA-BRCA/Human__TCGA_BRCA__MS__Clinical__Clinical__01_28_2016__BI__Clinical__Firehose.tsi", na = c("NA", "NA,NA")) %>%
data.table::transpose(keep.names = "ID", make.names = "attrib_name") %>%
select(-"overallsurvival") %>%
mutate_at(.vars = c("years_to_birth", "overall_survival", "number_of_lymph_nodes", "Tumor_purity"), as.numeric)
clinical_data %>%
count(histological_type)
```
In this exploration, your focus will be on distinguishing between two histological types: ductal and lobular cancer. So, let’s filter the data to include only these types and make the type names more readable.
```{r}
clinical_data <-
clinical_data %>%
filter(histological_type %in% c("infiltratingductalcarcinoma", "infiltratinglobularcarcinoma")) %>%
# increase readability
mutate(histological_type = case_when(
histological_type == "infiltratinglobularcarcinoma" ~ "lobular",
histological_type == "infiltratingductalcarcinoma" ~ "ductal"
))
```
As you move forward, you will integrate various omics datasets into your experiment, taking careful steps to ensure accurate and meaningful insights.
Omics datasets that will be incorporated in the experiment:
* clinical data (20 features),
* proteome (175 features),
* methylation (20107 features),
* miRNA (824 features),
* mutation (7967 features),
* RNASeq (20156 features),
* SCNV (24777 features).
```{r}
proteome <-
read_delim("TCGA-BRCA/Human__TCGA_BRCA__MDA__RPPA__MDA_RPPA__01_28_2016__BI__Gene__Firehose_RPPA.cct", na = c("NA", "NA,NA"), show_col_types = F) %>%
data.table::transpose(keep.names = "ID", make.names = "attrib_name") %>%
mutate_at(vars(-ID), as.numeric)
methylation <-
read_delim("TCGA-BRCA/Human__TCGA_BRCA__JHU_USC__Methylation__Meth450__01_28_2016__BI__Gene__Firehose_Methylation_Prepocessor.cct", na = c("NA", "NA,NA"), show_col_types = F) %>%
data.table::transpose(keep.names = "ID", make.names = "attrib_name") %>%
mutate_at(vars(-ID), as.numeric)
miRNA <-
read_delim("TCGA-BRCA/Human__TCGA_BRCA__BDGSC__miRNASeq__HS_miR__01_28_2016__BI__Gene__Firehose_RPKM_log2.cct", na = c("NA", "NA,NA"), show_col_types = F) %>%
data.table::transpose(keep.names = "ID", make.names = "attrib_name") %>%
mutate_at(vars(-ID), as.numeric)
mutation <-
read_delim("TCGA-BRCA/Human__TCGA_BRCA__WUSM__Mutation__GAIIx__01_28_2016__BI__Gene__Firehose_MutSig2CV.cbt", na = c("NA", "NA,NA"), show_col_types = F) %>%
data.table::transpose(keep.names = "ID", make.names = "attrib_name") %>%
mutate_at(vars(-ID), as.numeric)
RNAseq <-
read_delim("TCGA-BRCA/Human__TCGA_BRCA__UNC__RNAseq__HiSeq_RNA__01_28_2016__BI__Gene__Firehose_RSEM_log2.cct", na = c("NA", "NA,NA"), show_col_types = F) %>%
data.table::transpose(keep.names = "ID", make.names = "attrib_name") %>%
mutate_at(vars(-ID), as.numeric)
SCNV_log_ratio <-
read_delim("TCGA-BRCA/Human__TCGA_BRCA__BI__SCNA__SNP_6.0__01_28_2016__BI__Gene__Firehose_GISTIC2.cct", na = c("NA", "NA,NA"), show_col_types = F) %>%
data.table::transpose(keep.names = "ID", make.names = "attrib_name") %>%
mutate_at(vars(-ID), as.numeric)
```
## Connecting dataset
First off, create a list of dataframes. This step helps you to work with each dataframe on its own, while still keeping track of the overall structure. Use **connect_datasets()** to create a list where each dataframe gets its name. You can call an additional parameter, `remove_original_data`, to indicate whether the original dataframes should be removed. This is often needed as omics data can become quite heavy due to its dimension (tens to hundreds of thousands features).
```{r}
BRCA_data <- connect_datasets(clinical_data, proteome, methylation, miRNA, mutation, RNAseq, SCNV_log_ratio,
remove_original_data = TRUE
)
BRCA_data %>% summary()
```
This gives you a list, structured in seven sections, each named after its original dataframe, each starting with a shared "ID" variable.
## Exploring and Refining the Data
### Evaluating Data Availability
In omics experiments, it's common to have varied availability of data for different sets due to various reasons (e.g. detection limit, missing samples between laboratories, incorrect material for other types of analysis, etc.). Therefore, it is a primary need to check data coverage between different sets early on.
This can be easily obtained with **plot_coverage()** function:
```{r}
plot_coverage(BRCA_data)
```
For the BRCA data, you'll notice various patterns of data availability, with the largest group (379 subjects) having complete data.
### Understanding Your Dataset
The **data_summary()** function can be used to discover the data structure. It presents the number of samples, together with the number of variables and describes the data content (number of numeric/character/factor columns):
```{r}
data_summary(BRCA_data)
```
It's helpful to discover whether the data has the required structure at a glance. This might be especially important when reading data from text files (e.g. for proteomics experiment).
Next, you can explore each dataframe separately using the **check_data()** function.It will return basic statistics about each numerical and
non-numerical variables separately. It’s a simple way to check for
suspicious variables (e.g. low number of unique positions):
```{r}
check_data(BRCA_data$clinical_data)
```
### Quality check
This operation should be conducted separately for each dataframe, as each omic has its own golden standards for data preprocessing.
Two handy functions have been implemented into **playOmics** package for this purpose:
- **filter_below_threshold** - allows you to define a numeric threshold for which data are considered valid in a defined percentage of samples.
To illustrate how this function works, let's filter RNA data to keep only variables with more than 3 reads in more than 50% of samples:
```{r}
BRCA_data[["RNAseq"]] <- filter_below_threshold(
data = BRCA_data[["RNAseq"]],
numeric_threshold = 3,
pcent_of_samples = 0.5
)
BRCA_data[["RNAseq"]] %>% dim()
```
After applying this filter, data was reduced from 20k variables to around 15k.
- **filter missing data** - similar to above, but the variables are removed based on % of missing values (e.g. variables with more than 50% of missing values will be removed):
```{r}
# let's replace all zeros with NA to pretend missing data
BRCA_data[["miRNA"]][BRCA_data[["miRNA"]] == 0] <- NA
# apply filter
BRCA_data[["miRNA"]] <- filter_missing(
data = BRCA_data[["miRNA"]],
pcent_of_samples = 0.5
)
BRCA_data[["miRNA"]] %>% dim()
```
Initially, we had 824 columns in the miRNA data. After filtering for non-missing values in at least 50% of columns, we ended up with 755 columns.
## Classification
Finally, the heart of our package!
### Define analysis target
Right now, the playOmics package allows only for **supervised binary classification experiments**. It is crucial then to define the analysis target, which will be propagated to the classification algorithm.
If the data is structured as described in the previous sections, you will most likely have one dataset containing phenotype data, e.g., whether the patient survived or died. With **define_target()** function, you are obligated to pass the name of this dataframe (e.g. "clinical data"), the name of a column that contains the desired status, and an indication of "positive" class (the one you want to predict with our analysis). Another argument, `id_variable`, indicates the column's name containing sample identifiers. As discussed previously, it should be common for all datasets to allow data merging.
```{r}
my_target <-
define_target(
phenotype_df_name = "clinical_data",
id_variable_name = "ID",
target_variable_name = "histological_type",
positive_class_name = "lobular"
)
my_target
```
### Dividing the Data: Train/Test Split
The **split_data_into_train_test()** function comes in handy when you need to divide your dataset into training and test subsets. By using the `target` argument, you can achieve a stratified split based on a specific column. This stratification occurs on a phenotype dataframe, and the IDs are accordingly assigned to other datasets. However, due to potential missing data, the split might not always respect the predefined proportion.
To illustrate, you can use this function to split the data randomly with a proportion of 90%/10%, so we can use the remaining 10% as a simulation of real data at the end of this experiment for demonstration purposes:
```{r}
splitted_data <- split_data_into_train_test(BRCA_data, prop = 9 / 10, target = my_target)
```
Let's call the mentioned 10% data a "validation set" and leave it for later:
```{r}
validation_set <- splitted_data$test_data
```
We will treat the `splitted_data$train_data` as a background for the modeling experiment; let's name them "modelling_set" for now:
```{r}
modelling_set <- splitted_data$train_data
rm(splitted_data)
```
As a preparation for the modeling process, we will use again the function **split_data_into_train_test()**:
```{r}
BRCA_data_splitted <-
split_data_into_train_test(modelling_set, prop = 8 / 10, target = my_target)
```
### Prepare dataset for modeling
In the next step, you can continue the preparation of data for modeling using **prepare_data_for_modelling()** function. It will transform all character and factor variables into "dummy" columns (e.g. column `sex` with two values: `male` and `female` will be transformed into two columns: `sex_male` and `sex_female` filled with 1/0 values) and translate logical columns into numbers. It also adds the dataset name to each variable to distinguish data.
After the initial preprocessing, all the datasets are combined into one comprehensive dataframe using early concatenation.
```{r warning=FALSE}
data_prepared <-
prepare_data_for_modelling(data = BRCA_data_splitted$train_data, target = my_target, remove_correlated_features = F)
```
Also, a test set needs to be prepared so the models can be validated on corresponding data:
```{r warning=FALSE}
test_data_prepared <-
prepare_data_for_modelling(BRCA_data_splitted$test_data, target = my_target, remove_correlated_features = F)
rm(BRCA_data_splitted)
```
### Selecting features
In the omics experiment, it is often the case that the number of features is many times larger than the number of observations. Data like this are prone to overfit; therefore, before we attempt to predict the target, we need to reduce data size. For efficiency, the filter method is chosen over wrapper or embedded methods.
The **rank_and_select_features()** function is designed to rank and select features in a dataset. It provides two modes of operation: features can be ranked either separately for each dataset or in a combined manner across all datasets. This flexibility allows the function to adapt to different analytical scenarios, whether you need independent feature rankings for individual datasets or a unified ranking for the entire dataset.
Once the features are ranked, the function allows you to select the most relevant ones using one of three methods:
- by selecting defined "top n" features,
- by selecting % of variables from each dataset,
- or by defining a cut-off threshold for a selected metric.
The function supports all feature ranking methods available in the mlr3filters package (see more under <https://mlr3filters.mlr-org.com/>).
```{r}
data_filtered <-
rank_and_select_features(
data = data_prepared,
target = my_target,
filter_name = "auc",
cutoff_method = "top_n",
cutoff_treshold = 5,
selection_type = "separate",
return_ranking_list = TRUE,
n_fold = 5,
n_threads = 5)
rm(data_prepared)
```
If the parameter "return_ranking_list" is set to TRUE, the function will return a list containing the filtered datasets and the ranking list. The ranking can be visualized using the **visualize_ranking()** function:
```{r}
visualize_ranking(data_filtered$ranking_list)
```
### Feature selection method consideration
To support you in making an informed decision about optimal feature selection strategy for your data, we’ve prepared a notebook that is designed to fit the playOmics data structure. This notebook allows you to experiment with various feature selection methods and evaluate which approach works best for your analysis. You can find it in the playomics_env GitHub repository, where it is ready to be used within the dockerized environment: [FS_diagnostic_plots.Rmd](https://github.com/JagGlo/playOmics_env/feature selection simulations/FS_simulations.Rmd)
It includes bunch of filter methods provided by the mlr3filters package ( <https://mlr3filters.mlr-org.com/>), moreover, it also supports two embedded methods: Lasso and Random Forest (RF). This setup gives you flexibility in testing different feature selection strategies, helping you identify the optimal approach for your specific dataset and problem.
### Modelling
The **create_multiple_models()** function is a comprehensive tool to build multiple machine learning models using varying combinations of predictor variables to discover which yields the best results.
By default, the function examines combinations of predictor variables up to a set of three. Still, if you're feeling adventurous, you can instruct it to consider more combinations by adjusting the `n_max` parameter.
The logistic regression algorithm is at the heart of the function—a well-established method for predicting binary outcomes. The choice of logistic regression ensures that the models produced are both interpretable and reliable, making them an excellent fit for a wide range of practical applications.
You then feed in your training and testing data using the `train_data` and `test_data` parameters, respectively. Additionally, the function needs clarity about which column in your data is the `target` (or dependent) variable and which serves as an identifier. This is essential for the function to focus on the modeling process properly.
When forming the single logistic regression model, missing data are eventually removed. The maximum delay of this step is crucial because it ensures we extract the most value from your dataset. It's especially vital for omics data; sometimes, different omics don't always align, meaning a data point might be missing in one omic but present in another. And if you're dealing with rare diseases, this step is a lifesaver since it helps us maximize the data size, making the most of every single data point you have.
```{r eval=FALSE, echo = TRUE}
my_models <-
create_multiple_models(
experiment_name = my_experiment_name,
train_data = data_filtered,
test_data = test_data_prepared,
target = my_target,
n_max = 2,
trim_models = TRUE,
trim_metric = "train_mcc",
trim_threshold = 0.3,
# single model settings
validation_method = "cv",
n_prop = NULL,
n_repeats = 5,
log_experiment = TRUE,
explain = TRUE,
# configuration
n_cores = 5,
directory = here::here()
)
```
When you're running **create_multiple_models()**, you can trim away models that aren't up to the mark. By setting `trim_models` to TRUE, you're instructing the function to be watchful and eliminate models that don't meet a certain standard. This "standard" or "benchmark" is defined by you through the `trim_metric` and `trim_threshold` parameters. For instance, with the default settings, if a model's performance (measured using the Matthews Correlation Coefficient on the training data) is below 0.3, it gets deleted. This trimming process ensures you're left with the best-performing models, making your analysis more efficient and insightful, especially when handling complex datasets.
Lastly, for those who value robustness in their models, the function offers an option to validate the significance of your results using permutation tests. By turning on the `validate_with_permutation` parameter and specifying the number of permutations, the function will shuffle your data to ascertain the models' significance.
### Performance
The **create_multiple_models()** function operates with considerable efficiency, even as it juggles multiple computations in parallel. However, it’s important to note that the computational resources required might be substantial, depending on the specific experimental conditions.
Two key components significantly influence the performance and computational demands of this function. The first one is `n_max`. This parameter determines the maximum number of variables considered in one model, and consequently, the higher its value, the greater the number of combinations the function needs to assess. This impacts the computational load exponentially since the function when dealing with models comprising multiple predictors, will systematically sift through all possible combinations of lesser elements before assessing more complex, multifaceted combinations.
The second component is the actual number of variables selected for the experiment. The selection of variables is critical as the function generates combinations without repetition, increasing the computational demands with each added variable. This means that a thoughtful selection of variables and a calibrated setting of the `n_max` parameter are crucial to managing the function's computational intensity efficiently, allowing us to optimize the experiment's performance without compromising the accuracy and reliability of the results obtained.
```{r echo=FALSE}
# number of combinations
no_of_combinations <- function(n_max, n_var) {
factorial(n_var) / (factorial(n_max) * factorial(n_var - n_max))
}
# no_of_combinations(2, 35)
# size of combinations
# lapply(c(35, 40), function(n_var){
# lapply(1:5, function(n_max) no_of_combinations(n_max, n_var)) %>% cumsum()
# })
col1 <- seq(10, 50, by = 5)
col2 <- rep(2:5, times = length(col1))
col1_repeated <- rep(col1, each = length(2:5))
df <- data.frame(n_var = col1_repeated, n_max = col2)
df %>%
mutate(no_of_combinations = no_of_combinations(n_max, n_var)) %>%
group_by(n_var) %>%
mutate(
cumulated = cumsum(no_of_combinations),
n_max = as_factor(n_max)
) %>%
ggplot(aes(x = n_var, y = cumulated, color = n_max)) +
geom_line() +
scale_y_continuous(
trans = "log10",
breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))
) +
labs(x = "no. of variables in final set", y = "cumulated no. of combinations", color = "max no. of predictors in single model") +
theme_classic() +
theme(legend.position = "top") +
guides(color = guide_legend(
title.position = "top",
title.hjust = 0.5
))
```
### Logging Models and Details
Each model and its associated details are logged within the **create_multiple_models()** function. This not only preserves the model's structure and results, but also ensures that it can be easily retrieved and deployed in production environments, making it ready for real-world applications. When providing the unique name through the `experiment_name` parameter, the directory is created where all the detailed and summarized results get saved.
### Understanding the Results
The overall models' results can be easily collected by using **read_model_data()** function. The additional parameter `directory` might be passed if the directory differs from the working directory.
```{r eval = F}
results <- read_model_data(
experiment_name = my_experiment_name,
directory = here::here()
)
```
If, at any point in the analysis, something goes wrong and you do not obtain the complete results, or if you wish to examine the detailed model logs, you can use read_logged_data() to obtain one of the following: "metrics," "model_logs," "model_coef" or "params."
```{r eval = F}
results <- read_logged_data(
experiment_name = my_experiment_name,
file_type = "metrics",
directory = here::here()
) %>%
bind_rows()
```
The Shiny application has been developed to facilitate user interaction with the data. Due to data missingness, we will remove from the final assessment the models where the number of lobular samples in test data is lower than 10. You can access the Shiny application with the **results_GUI()** function:
```{r eval=FALSE}
results %>%
filter(!is.na(train_mcc), test_n_lobular > 10) %>%
results_GUI(target = my_target)
```
### Ensembling
```{r}
predictions <- read_logged_predictions(
experiment_name = my_experiment_name,
prediction_type = "test",
directory = here::here()
) %>%
left_join(
results %>%
transmute(dir = model_dir, model_name),
by = "dir"
) %>%
select(-dir)
```
![GUI overview](man/shiny_screenshots/GUI.png)
The primary results table in the panel's upper section showcases all models' statistics. It includes two additional columns: The "Data" column, revealing the training data, and the "Predict" column, enabling predictions directly from the selected model. Subsequent sections will elucidate the exact usage and functionalities of the bottom panel.
Through these panels and functionalities, you gain a comprehensive understanding of the models and their performance, enabling you to interact with and analyze the data effectively and intuitively.
#### Exploring the Experiment Overview Panel
In this panel, you have the option to create histogram plots of the selected experiment’s metric. By clicking the "Add New Plot" button, you can generate multiple plots to better visualize the outcomes:
![Experiment overview - MCC metric among the experiment for train and test data](man/shiny_screenshots/experiment_overview.png)
#### Diving into the Analytes Overview Panel
This panel discloses statistics for individual molecules evaluated during model creation, along with their average metrics across all models in which they were included:
![Analytes overview](man/shiny_screenshots/analytes_overview.png)
#### Single model overview panel
By activating the "show data" button next to the desired model, you are redirected to this panel. Here, you can observe the training data and either a 2D or 3D plot representing the data, accompanied by the predictor's statistics (median values + Q1-Q3).
![Single model overview](man/shiny_screenshots/single_model_overview.png)
#### Prediction panel
When you turn on the `explain` option in the **create_multiple_models()** function, DALEX looks closely at the model. This gives a clearer picture of how the model works and helps anyone who wants to know why a model makes a particular choice.
It uses local model explanations to understand how different composites affect a prediction. SHAP values, presented on boxplots, measure a variable's contribution to a final prediction.
For models with a binary outcome, such as classifying breast cancer types as "ductal" or "lobular", SHAP values help to interpret the influence of features on the model's predicted probabilities, rather than the explicit "ductal"/"lobular" statuses. In this setting, where the "lobular" status has been set as a positive class in a `target` object, a positive SHAP value for a specific feature means that it increases the predicted probability of an instance being classified as "lobular". Conversely, a negative SHAP value means the feature decreases that likelihood, making the instance more likely to be classified as "ductal". Suppose a model's average predicted probability for a cancer type being "lobular" is 0.9, and a particular sample's SHAP value for a specific feature (say, RNA species) is 0.25. In that case, it indicates that this feature increases its "lobular" probability by 25%. When you sum all the SHAP values for that type and add it to the baseline of 0.9, you obtain the model's predicted probability for that specific sample classified as "lobular".
For more information, check out: https://dalex.drwhy.ai/
![Prediction screen](man/shiny_screenshots/prediction.png)
## Validation
The validation dataset, which remains unused during the model training phase, can be utilized to generate predictions on novel data. This approach mirrors the practical deployment of the developed models.
The function prepare_data_for_modelling() is designed to ready newly acquired data for the modeling process. Subsequently, the Shiny application can be initiated through results_GUI(). To engage this functionality, activate the "Make prediction" button corresponding to the desired model and choose the "Select data from your env" option. This action will import the validation dataset into the application, facilitating the generation of predictions on new data. Additionally, the application offers the feature to dissect specific predictions associated with a selected ID, which can be chosen from a provided list.
```{r warning=FALSE}
validation_target <-
define_target(phenotype_df_name = "clinical_data",
id_variable_name = "ID",
target_variable_name = NULL,
positive_class_name = NULL
)
validation_data_prepared <-
prepare_data_for_modelling(data = validation_set, target = validation_target)
```
```{r eval=FALSE}
results_GUI(results, target = validation_target)
```
### Ensemble Learning Guide
At the end of the pipeline, ensemble learning is performed to identify a minimal number of models that can collectively maximize prediction quality. By combining the predictions of multiple models, you can achieve improved accuracy, robustness, and stability in your results. Ensemble learning helps balance the trade-off between using fewer models and retaining high-quality predictions. Below, we walk you through the steps of implementing ensemble learning using your dataset and predictions.
You can find the complete notebook in the playomics_env GitHub repository, where it is ready to be used within the dockerized environment: [ensembling_notebook.Rmd](https://github.com/JagGlo/playOmics_env/ensembling/ensembling_notebook.Rmd)
Step 1: **Read Model Data**
First, load the results of your models into memory. This data contains essential information about the models, including their directories and names, which will be used throughout the ensemble learning process.
```r
results <- read_model_data(
experiment_name = my_experiment_name,
directory = here::here()
)
selected_models <- results
```
Step 2: **Read Predictions and Prepare for Ensembling**
Next, import the logged predictions from the models. These predictions are merged with the model metadata (like model names) and formatted into a structure suitable for ensemble learning.
```r
predictions <- read_logged_predictions(
experiment_name = my_experiment_name,
prediction_type = "test",
directory = here::here()
) %>%
left_join(
results %>%
transmute(dir = model_dir, model_name),
by = "dir"
) %>%
select(-dir)
```
Step 3: **Define the Target Variable**
Define the target variable using the `playOmics::define_target()` function. This ensures that your predictions align correctly with the target data structure.
```r
my_target <- playOmics::define_target(
phenotype_df_name = "clinical",
id_variable_name = "ID",
target_variable_name = "histological_type"
)
my_target$target_levels <- levels(pull(predictions, !!sym(my_target$target_variable)))
```
If a positive class is defined, determine its predicted probability column name. Otherwise, use the first target level as the predicted positive class.
```r
predicted_as_positive <- if(!is.null(my_target$positive_class)){
paste0(".pred_", my_target$positive_class)
} else {
paste0(".pred_", my_target$target_levels[1])
}
```
Step 4: **Prepare the Data for Ensembling**
Spread the predictions from individual models into a wide format for ensembling. Each column represents the predictions of a single model.
```r
pred_spread <- predictions %>%
select(model_name, !!predicted_as_positive, ID) %>%
spread(model_name, !!predicted_as_positive) %>%
select(-ID) %>%
mutate_all(as.numeric)
```
Load the training, testing, and exemplary datasets. Prepare the exemplary dataset to align with the training data structure.
```r
load(paste(here::here(), my_experiment_name, "train_data.RData", sep = "/"))
load(paste(here::here(), my_experiment_name, "test_data.RData", sep = "/"))
exemplary_set <- readRDS(paste(here::here(), my_experiment_name, "exemplary_set.RDS", sep = "/"))
exemplary_set <- exemplary_dataset %>%
prepare_data_for_modelling(target = my_target) %>%
select(names(train_data))
```
Step 5: **Stepwise Metric Calculation**
Source the necessary helper functions for ensembling, then calculate stepwise metrics using different values of the regularization parameter (`lambda`). This step allows you to evaluate ensemble size and voting strategies (e.g., soft voting) under varying conditions.
```r
source("ensembling_helpers.R")
```
Run the stepwise metric calculation. Here, you evaluate the performance of ensembles containing different numbers of models (e.g., 1, 3, 5, 7, 9) while varying the regularization parameter `lambda` between 0 and 1.
```r
ensembling_results <-
lapply(seq(0, 1, 0.25), function(lambda){
metrics_test_stepwise(
selected_models,
exemplary_dataset,
pred_spread,
"miss",
"test_accuracy",
lambda = lambda,
vec_of_ensemble_size = c(1, 3, 5, 7, 9),
voting = "soft",
my_target
)
})
```
Step 6: **Visualize Results**
Visualize the stepwise ensemble performance using the `show_plots()` function. Save the resulting plots to a file for future reference.
```r
ensembling_results %>% show_plots()
```
Step 7: **Analyze Ensemble Performance**
Extract and analyze the predictions generated by the ensembles, focusing on ensemble size, accuracy, and other metrics of interest.
```r
ensembling_results %>%
lapply(function(x){
x$predictions
}) %>%
bind_rows() %>%
select(n_ensemble, lambda, `TRUE`, `FALSE`, n_not_classified, adjusted_acc, accuracy) %>%
filter(lambda == 0)
```
You can also examine the models included in each ensemble for different values of `lambda`.
```r
ensembling_results %>%
lapply(function(x){
x$models
}) %>%
bind_rows() %>%
filter(lambda == 0)
```
# Package scheme
![playOmics scheme](man/shiny_screenshots/playOmics_new_scheme.png)