-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME
executable file
·58 lines (36 loc) · 2.48 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
*******************************************************************************************
Feature tables and main scripts of the paper
"Prediction and characterization of human ageing-related proteins by using machine learning"
********************************************************************************************
Citation of the paper:
Kerepesi C, Daróczy B, Sturm Á, Vellai T, Benczúr A
Prediction and characterization of human ageing-related proteins by using machine learning
Scientific Reports, Vol 8, 4094 (2018).
Requirements (versions used by us in parenthesis):
- Linux (Ubuntu 16.04.3 LTS)
- Python 3.5.2
- Python packages: pandas (0.20.3), numpy (1.13.1), sklearn (0.19.0)
Running commands:
- Running 20 experiments of XGBoost with 5 fold CV (predictions are averaged, parameters: max_depth=1, n_est=50, n_exp=20)
$ python XGBoost_CV.py 1 50 20 Final_32_features.csv
- Dumping trees of the XGBoost model (max_d=1, n_est=50, inputfile=Final_feature_table.csv):
$ python TreeDumper.py 1 50 Final_32_features.csv
Description of the files (for more information please see Methods of the paper):
- aging_labels.csv:
Labels of the classification ('1' if the given protein included in GenAge database, 0 otherwise)
- uniprot_sprot_human.dat-GO_Digger_Sparse.py.txt-TableGen.py.csv.zip:
Gene Ontology features with ancestors and with ageing GOs.
- uniprot_sprot_human.dat-GO_Digger_Sparse.py.txt-TableGen.py.csv-woAgingGOs.csv.zip
Gene Ontology features with ancestors and without ageing GOs ( called 'GO' in Table 4 of the paper).
- uniprot_sprot_human.dat-GO_Digger_Sparse.py.txt-TableGen.py.csv-woAgingGOs.csv-f_sel_based_on_GO_stats.py-thr1000.csv.zip
Feature set containing only the GO features that occur in least 1000 proteins (called 'Frequent GOs' in Table 5 of the paper).
- uniprot_sprot_human.dat-InteractionDigger.py.txt-CreateNetworkFT_paired.py.csv-join_CytoscapeStats.csv:
PPI Network features.
- RNA_CoEx_vs_GenAge_FT.csv:
Co-expression features.
- Final_32_features.csv:
Table of the final 32 features (selected by XGBoost).
- Final_32_features.csv-XGBoost_CV_preds-n_est50-exp20.csv:
Output file of the command 'python XGBoost_CV.py 1 50 20 Final_32_features.csv'.
- Final_32_features.csv_Trees-n_est50-max_d1.txt:
Output file of the command 'python TreeDumper.py 1 50 Final_32_features.csv'.