This repository contains code for predicting the aqueous solubility of organic molecules using machine learning models. The models and dataset are based on the research paper: Predicting Aqueous Solubility of Organic Molecules Using Deep Learning Models with Varied Molecular Representations.
# pull the original PNNL codebase
git submodule init
git submodule update
# make the patch set
# this is not necessary since 2/19/2025 as the patch set is
#./make_patches
# apply the patch set
# preserved in the patches directory
./apply_patches
1. downlaod data as old instructions below
-
Download Data: Download the dataset from this link and save it as
data.csv
in the./data
folder. -
Generate Features:
- Generate Pybel coordinates and Molecular Dynamics (MDM) features by running
create_data.py
in the./data
folder:cd ./data python create_data.py
- Generate Pybel coordinates and Molecular Dynamics (MDM) features by running
-
Train Models:
- To train the MDM model, run
train.py
in the./mdm
folder:cd ../mdm python train.py
- To train the GNN model, run
train.py
in the./gnn
folder:cd ../gnn python train.py
- To train the SMI model, run
train.py
in the./smi
folder:cd ../smi python train.py
- To train the MDM model, run
-
Make Predictions:
- Use the
predict.ipynb
files in each model's folder to make predictions:Repeat the above steps for thecd ../mdm jupyter notebook predict.ipynb
gnn
andsmi
folders.
- Use the
-
Ensemble Models:
- To ensemble the models, run the following scripts:
cd ../ensemble python CV.py python Optuna.py python KNN.py
- To ensemble the models, run the following scripts:
-
Compare Predictions:
- To compare predictions from individual models with ensemble methods, use the
ensemble_prediction.ipynb
notebook:jupyter notebook ensemble_prediction.ipynb
- To compare predictions from individual models with ensemble methods, use the
For detailed instructions on how to run the models, featurize the data, and other specifics, please refer to the original research paper linked above. The methods and techniques described in the paper are critical for understanding and effectively using this repository.