Skip to content

Commit

Permalink
[README] add information from supplementary material
Browse files Browse the repository at this point in the history
  • Loading branch information
jcapels committed Apr 4, 2024
1 parent 327ee3d commit b836dee
Show file tree
Hide file tree
Showing 4 changed files with 57 additions and 0 deletions.
57 changes: 57 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@ A ML pipeline for the prediction of specialised metabolites starting substances.
- [Data](#data)
- [AutoML](#automl)
- [Analysis of the results](#analysis-of-the-results)
- [Metrics](#metrics)
- [Similarity matrix and t-SNE generation](#similarity-matrix-and-t-sne-generation)
- [TPE algorithm](#tpe-algorithm)
- [Statistical methods](#statistical-methods)
- [Results](#results)
- [AutoML results](#automl-results)

## Installation

Expand Down Expand Up @@ -141,8 +147,59 @@ For the analysis of the results refer to the following files:

The results for the MGCNN can be found at [this link](https://github.com/jcapels/mgcnn_alkaloid.git).

### Metrics

The formula for mF1 is defined as:

$$
\text{mF1} = \frac{1}{N} \sum_{i=1}^{N} \frac{2 \cdot \text{Precision}_i \cdot \text{Recall}_i}{\text{Precision}_i + \text{Recall}_i}
$$

The formula for mRecall is defined as:

$$
\text{mRecall} = \frac{1}{N} \sum_{i=1}^{N} \frac{\text{True Positives}_i}{\text{True Positives}_i + \text{False Negatives}_i}
$$

The formula for mPrecision is defined as:

$$
\text{mPrecision} = \frac{1}{N} \sum_{i=1}^{N} \frac{\text{True Positives}_i}{\text{True Positives}_i + \text{False Positives}_i}
$$

where $N$ denotes the total number of classes, with $\text{Precision}_i$ and $\text{Recall}_i$ corresponding to the precision and recall for class $i$, respectively. $\text{True Positives}_i$ are the true positive predictions for class $i$, and $\text{False Negatives}_i$ are the missed predictions for class $i$. Finally, the $\text{False Positives}_i$ are the wrong positive predictions.

### Similarity matrix and t-SNE generation

A similarity matrix between all the Morgan fingerprints of the compounds in the whole dataset was generated to assess their similarity. The similarity function was the Tanimoto similarity index. A t-distributed Stochastic Neighbor Embedding (t-SNE) was created from this matrix to reduce dimensionality and for visualization.

### TPE algorithm

The TPE algorithm optimizes hyperparameter selection by modelling the probability of hyperparameter effectiveness, prioritizing those regions that show promise based on an objective function $f(x)$, where $x$ represents the hyperparameters. This function is aimed at maximization. The algorithm divides the hyperparameters into two categories based on a threshold $\gamma$: $l(x)$ for those leading to higher (better) objective function values and $g(x)$ for those leading to lower (worse) values. It then preferentially samples new hyperparameters from $l(x)$, the distribution indicating better performance.


### Statistical methods

Given metric values for two models across $n$ tasks, $m_{1i}$ and $m_{2i}$, calculate the differences $d_i = m_{1i} - m_{2i}$ for each task $i$. For these differences, ignore $d_i = 0$ and rank the absolute differences $|d_i|$, assign ranks $R_i$ and compute $W^+ = \sum_{d_i > 0} R_i$ and $W^- = \sum_{d_i < 0} R_i$, the test statistic $W$ is defined as $W = \min(W^+, W^-)$. The p-value is calculated as the probability of observing a value of $W_{\text{ref}}$, determined by a reference distribution under the null hypothesis, as extreme as or more extreme than the observed value ($W$). The null hypothesis is that there are no significant differences between the metric values of the two models. A p-value lower than 0.05 is considered sufficient to reject the null hypothesis.

In the context of cross-validation, given two models evaluated across $n$ tasks and $r$ folds, resulting in performance metrics $m_{Aij}$ and $m_{Bij}$ for models $A$ and $B$ respectively, for each task $i$ and fold $j$, perform the following steps: calculate the differences $d_{ij} = m_{Aij} - m_{Bij}$, rank the absolute differences $|d_{ij}|$, and apply the Wilcoxon Signed-Rank test as explained above.


## Results

### AutoML results

The figures below show the automatic machine learning model results. The first figure shows the features used during the optimization and the mF1 score on the validation set for each trial. Morgan and layered fingerprints (FP) stood out as the best features.

![Fingerprints](fingerprints.png)

The figure below shows the models trained and the mF1 scores obtained by each model on the validation set. The ridge classifiers stood out unequivocally.

![Models](models.png)

Figure below shows the F1 scores for each precursor and model.

![label_f1_score](label_f1_score.png)



Expand Down
Binary file added fingerprints.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added label_f1_score.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added models.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit b836dee

Please sign in to comment.