Machine Learning models for pKa predictions of amino acid side-chains.
All training and test data splits as well as pretrained KaML-CBtree models.
KaML-CBTrees/train_test_splits contains the train/test sets for each of the 20 splits
KaML-CBTrees/models contains the 20 models for evaluation as well as the finalized models (catboost_acid_finalized and catboost_base_finalized ) trained on the whole dataset.
KaML-CBTrees/KaML-CBtree.py end-to-end prediction script. This script takes a PDB file as input and will find all Asp, Glu, His, Cys, Lys, and Tyr residues, calculate tree model features for each residue, predict their pKa values using the finalized KaML-CBtree models, and save the results in a csv file.
-
Clone repository.
-
Install requirements:
-
3.10.0 < python < 3.12 (At time of writing this, pycaret does not work with python 3.12)
-
pip install pycaret
-
pip install Biopandas
-
pip install catboost
- Set permissions for RIDA and DSSP
chmod +x tools/*
python KaML-CBtree.py path/to/input.pdb
will generate a new file input.csv in the current working directory with residues in the first column and predicted pKa values in the second.
- At the moment the code only works with single chain PDB files without missing atoms.
- KaML-CBtree.py depends on features.py. Relative paths to rida and dssp are hard-coded in features.py. Relative paths to the model files are hard-coded in KaML-CBtree.py.
KaML-CBtrees uses the follwing software to calculate the features:
-
RIDA: Dayhoff GW II, Uversky VN. Rapid prediction and analysis of protein intrinsic disorder. Protein Science. 2022; 31(12):e4496. https://doi.org/10.1002/pro.4496
-
DSSP: Kabsch W, Sander C (1983). "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features". Biopolymers. 22 (12): 2577–637. https://doi:10.1002/bip.360221211.
KaML-GAT/train_val_test contains the datasets for training, validation and test for the 20 indenpendent splits. (exptAAfB_train/validation.csv, AA: AA_th split ID, B: B_th fold)
KaML-GAT/model_inputs contains the input files (constructured graph for each residue) for training, validation and test for the 20 indenpendent splits
KaML-GAT/train_model.sh calls the train_model.py for training the model. Usage: bash train_model.sh
KaML-GAT/models trained_models.tar.gz contains all the trained models. model_training_results.tar.gz conatins the training recodings (*_traindtl.csv contains the training predictions, *_valdtl.csv contains the validation predictions, *_predictions contains the test predictions. *.training contaisn the loss for each epoch. ana_split0-19.ipynb is the analysis script (including convert the dpka preditions to the dpka before normalization, convert dpka to pka, ensemble creating etc).).
If you use KaML models in your research, please cite: doi: https://doi.org/10.1021/acs.jctc.4c01602
Mingzhe Shen, Daniel Kortzak, Simon Ambrozak, Shubham Bhatnagar, Ian Buchanan, Ruibin Liu, Jana Shen