Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continous Integration Tests #129

Merged
merged 30 commits into from
Nov 29, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
78fbcea
add continous integration
jcharkow Oct 30, 2024
b34300b
preinstall numpy
jcharkow Oct 30, 2024
955e1f1
remove numpy from setup
jcharkow Oct 30, 2024
a684136
install numpy in setup script
jcharkow Oct 30, 2024
ff5804c
convert to .toml setup
jcharkow Oct 30, 2024
4b6750a
remove numpy from requirement
jcharkow Oct 30, 2024
122753d
just ubuntu for now
jcharkow Oct 30, 2024
b9f35af
fix setup.py and .toml
jcharkow Oct 30, 2024
be62dfb
add line to build extension
jcharkow Oct 30, 2024
065dc28
Merge branch 'master' into ci
jcharkow Oct 31, 2024
15dd65b
fix: stats tests
jcharkow Oct 31, 2024
c99db26
add pytest-regtest to workflow
jcharkow Oct 31, 2024
fdb5513
update autotunning so does not fail
jcharkow Nov 1, 2024
3c8bfbd
fix: fix level context tests
jcharkow Nov 1, 2024
d320945
update export-parquet tests and fix tests
jcharkow Nov 1, 2024
71804e4
raise error if standard deviation computed is 0
jcharkow Nov 20, 2024
516389e
test: set tree method as exact for tests
jcharkow Nov 21, 2024
ca4a14e
update snapshot tests
jcharkow Nov 21, 2024
0704475
update actions to dependent on reuiqrements file
jcharkow Nov 21, 2024
ce075f3
add dependabot
jcharkow Nov 21, 2024
a0235e0
add tests for windows and mac
jcharkow Nov 21, 2024
88291fa
remove default_rng
jcharkow Nov 21, 2024
08df7cd
remove mac tests
jcharkow Nov 21, 2024
0c342e9
remove copy of np arrays
jcharkow Nov 21, 2024
527c365
remove windows tests
jcharkow Nov 21, 2024
9c64665
refactor: new function for normlaizing score to decoys
jcharkow Nov 21, 2024
9be0a9a
replace semi-supervised learning normalization with sklearn
jcharkow Nov 21, 2024
aa70dbe
revert to numpy std
jcharkow Nov 21, 2024
005af6c
minor updates to pyprophet.toml
jcharkow Nov 21, 2024
81f78bf
fix: ValueError: Buffer dtype mismatch, expected 'DATA_TYPE' but got …
jcharkow Nov 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: continuous-integration

on: [push]

jobs:
test:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest]
#os: [ubuntu-latest, windows-latest, macos-latest] # remove mac tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to run the CI on windows and mac as well, or does it not work?

# Requirements file generated with python=3.11
python-version: ["3.11"]
steps:
- uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt # test with requirements file so can easily bump with dependabot
pip install .

- name: Compile cython module
run: python setup.py build_ext --inplace

- name: Test
run: |
python -m pytest tests/
9 changes: 9 additions & 0 deletions .github/workflows/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
version: 2
updates:
- package-ecosystem: "pip"
directory: "/" # Location of your pyproject.toml or requirements.txt
schedule:
interval: "weekly" # Checks for updates every week
commit-message:
prefix: "deps" # Prefix for pull request titles
open-pull-requests-limit: 5 # Limit the number of open PRs at a time
51 changes: 51 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
[build-system]
requires = ["setuptools", "wheel", "numpy", "cython"] # Dependencies needed to build the package
build-backend = "setuptools.build_meta"

[project]
name = "pyprophet"
version = "2.2.8"
description = "PyProphet: Semi-supervised learning and scoring of OpenSWATH results."
readme = { file = "README.md", content-type = "text/markdown" }
license = { text = "BSD" }
authors = [{ name = "The PyProphet Developers", email = "rocksportrocker@gmail.com" }]
classifiers = [
"Development Status :: 3 - Alpha",
"Environment :: Console",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: BSD License",
"Operating System :: OS Independent",
"Topic :: Scientific/Engineering :: Bio-Informatics",
"Topic :: Scientific/Engineering :: Chemistry"
]
keywords = ["bioinformatics", "openSWATH", "mass spectrometry"]

# Dependencies required for runtime
dependencies = [
"Click",
"duckdb",
"duckdb-extensions",
"duckdb-extension-sqlite-scanner",
Comment on lines +26 to +28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duckdb is currently only used for OSW to parquet exporting right? I'm thinking if we can create a separate dependency version, so that if someone wants to be able to export to parquet, then they can install pyprophet[parquet] or something? Just so that we reduce the number of dependencies for the main library for just performing regular scoring tsv exporting? What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my initial tests duckdb tends to speed up sqlite statements with many table joins so I was thinking of extending its usage to scoring and tsv exporting as it is minimal changes required to do this.

"numpy >= 1.9.0",
"scipy",
"pandas >= 0.17",
"cython",
"numexpr >= 2.10.1",
"scikit-learn >= 0.17",
"xgboost",
"hyperopt",
"statsmodels >= 0.8.0",
"matplotlib",
"tabulate",
"pyarrow",
"pypdf"
]

# Define console entry points
[project.scripts]
pyprophet = "pyprophet.main:cli"

[tool.setuptools]
packages = { find = { exclude = ["ez_setup", "examples", "tests"] } }
include-package-data = true
zip-safe = false
14 changes: 8 additions & 6 deletions pyprophet/classifiers.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,8 @@ def objective(params):

clf = xgb.XGBClassifier(random_state=42, verbosity=0, objective='binary:logitraw', eval_metric='auc', **params)

score = cross_val_score(clf, X, y, scoring='roc_auc', n_jobs=self.threads, cv=KFold(n_splits=3, shuffle=True, random_state=np.random.RandomState(42))).mean()
rng = np.random.default_rng(42)
score = cross_val_score(clf, X, y, scoring='roc_auc', n_jobs=self.threads, cv=KFold(n_splits=3, shuffle=True, random_state=42)).mean()
jcharkow marked this conversation as resolved.
Show resolved Hide resolved
singjc marked this conversation as resolved.
Show resolved Hide resolved
# click.echo("Info: AUC: {:.3f} hyperparameters: {}".format(score, params))
return score

Expand All @@ -129,7 +130,8 @@ def objective(params):
xgb_params_complexity = self.xgb_params_tuned
xgb_params_complexity.update({k: self.xgb_params_space[k] for k in ('max_depth', 'min_child_weight')})

best_complexity = fmin(fn=objective, space=xgb_params_complexity, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=np.random.RandomState(42))
rng = np.random.default_rng(42)
best_complexity = fmin(fn=objective, space=xgb_params_complexity, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=rng)
best_complexity['max_depth'] = int(best_complexity['max_depth'])
best_complexity['min_child_weight'] = int(best_complexity['min_child_weight'])

Expand All @@ -139,31 +141,31 @@ def objective(params):
xgb_params_gamma = self.xgb_params_tuned
xgb_params_gamma['gamma'] = self.xgb_params_space['gamma']

best_gamma = fmin(fn=objective, space=xgb_params_gamma, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=np.random.RandomState(42))
best_gamma = fmin(fn=objective, space=xgb_params_gamma, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=rng)

self.xgb_params_tuned.update(best_gamma)

# Tune subsampling hyperparameters
xgb_params_subsampling = self.xgb_params_tuned
xgb_params_subsampling.update({k: self.xgb_params_space[k] for k in ('subsample', 'colsample_bytree', 'colsample_bylevel', 'colsample_bynode')})

best_subsampling = fmin(fn=objective, space=xgb_params_subsampling, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=np.random.RandomState(42))
best_subsampling = fmin(fn=objective, space=xgb_params_subsampling, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=rng)

self.xgb_params_tuned.update(best_subsampling)

# Tune regularization hyperparameters
xgb_params_regularization = self.xgb_params_tuned
xgb_params_regularization.update({k: self.xgb_params_space[k] for k in ('lambda', 'alpha')})

best_regularization = fmin(fn=objective, space=xgb_params_regularization, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=np.random.RandomState(42))
best_regularization = fmin(fn=objective, space=xgb_params_regularization, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=rng)

self.xgb_params_tuned.update(best_regularization)

# Tune learning rate
xgb_params_learning = self.xgb_params_tuned
xgb_params_learning['eta'] = self.xgb_params_space['eta']

best_learning = fmin(fn=objective, space=xgb_params_learning, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=np.random.RandomState(42))
best_learning = fmin(fn=objective, space=xgb_params_learning, algo=tpe.suggest, max_evals=self.xgb_hyperparams['autotune_num_rounds'], rstate=rng)

self.xgb_params_tuned.update(best_learning)
click.echo("Info: Optimal hyperparameters: {}".format(self.xgb_params_tuned))
Expand Down
2 changes: 1 addition & 1 deletion pyprophet/export_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ def export_to_parquet(infile, outfile, transitionLevel, onlyFeatures=False):

# transition level
if transitionLevel:
columns['FEATURE_TRANSITION'] = ['AREA_INTENSITY', 'TOTAL_AREA_INTENSITY', 'APEX_INTENSITY', 'TOTAL_MI'] + getVarColumnNames(condb, 'FEATURE_TRANSITION')
columns['FEATURE_TRANSITION'] = ['AREA_INTENSITY', 'TOTAL_AREA_INTENSITY', 'APEX_INTENSITY', 'TOTAL_MI'] + getVarColumnNames(con, 'FEATURE_TRANSITION')
columns['TRANSITION'] = ['TRAML_ID', 'PRODUCT_MZ', 'CHARGE', 'TYPE', 'ORDINAL', 'DETECTING', 'IDENTIFYING', 'QUANTIFYING', 'LIBRARY_INTENSITY']
columns['TRANSITION_PRECURSOR_MAPPING'] = ['TRANSITION_ID']

Expand Down
11 changes: 8 additions & 3 deletions pyprophet/levels_contexts.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,12 @@ def statistics_report(data, outfile, context, analyte, parametric, pfdr, pi0_lam
outfile = outfile + "_" + str(data['run_id'].unique()[0])

# export PDF report
save_report(outfile + "_" + context + "_" + analyte + ".pdf", outfile + ": " + context + " " + analyte + "-level error-rate control", data[data.decoy==1]["score"], data[data.decoy==0]["score"], stat_table["cutoff"], stat_table["svalue"], stat_table["qvalue"], data[data.decoy==0]["p_value"], pi0, color_palette)
save_report(outfile + "_" + context + "_" + analyte + ".pdf",
outfile + ": " + context + " " + analyte + "-level error-rate control",
data[data.decoy==1]["score"].values, data[data.decoy==0]["score"].values, stat_table["cutoff"].values,
stat_table["svalue"].values, stat_table["qvalue"].values, data[data.decoy==0]["p_value"].values,
pi0,
color_palette)

return(data)

Expand Down Expand Up @@ -184,7 +189,7 @@ def infer_proteins(infile, outfile, context, parametric, pfdr, pi0_lambda, pi0_m
con.close()

if context == 'run-specific':
data = data.groupby('run_id').apply(statistics_report, outfile, context, "protein", parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps, color_palette).reset_index()
data = data.groupby('run_id').apply(statistics_report, outfile, context, "protein", parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps, color_palette)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is reset_index no longer needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing reset index is required to prevent the error

  File "/home/joshua/mambaforge/envs/pyprophet_dev/lib/python3.11/site-packages/pandas/core/frame.py", line 5158, in insert
    raise ValueError(f"cannot insert {column}, already exists")
ValueError: cannot insert run_id, already exists

Same as below. Must be a change to pandas groupby functionality

Copy link
Contributor

@singjc singjc Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seems like it, some groupby deprecations occured for Pandas v2.2.0

Deprecated the Grouping attributes group_index, result_index, and group_arraylike; these will be removed in a future version of pandas (GH 56148)

If you don't mind, would you be able to test with a version prior to pandas v2.2.0, to see if the old code works with the .reset_index(), just so we know for sure that is the change.


elif context in ['global', 'experiment-wide']:
data = statistics_report(data, outfile, context, "protein", parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps, color_palette)
Expand Down Expand Up @@ -257,7 +262,7 @@ def infer_peptides(infile, outfile, context, parametric, pfdr, pi0_lambda, pi0_m
con.close()

if context == 'run-specific':
data = data.groupby('run_id').apply(statistics_report, outfile, context, "peptide", parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps, color_palette).reset_index()
data = data.groupby('run_id').apply(statistics_report, outfile, context, "peptide", parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps, color_palette)
jcharkow marked this conversation as resolved.
Show resolved Hide resolved

elif context in ['global', 'experiment-wide']:
data = statistics_report(data, outfile, context, "peptide", parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps, color_palette)
Expand Down
2 changes: 2 additions & 0 deletions pyprophet/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,8 @@ def score(infile, outfile, classifier, xgb_autotune, apply_weights, xeval_fracti
xgb_hyperparams = {'autotune': xgb_autotune, 'autotune_num_rounds': 10, 'num_boost_round': 100, 'early_stopping_rounds': 10, 'test_size': 0.33}

xgb_params = {'eta': 0.3, 'gamma': 0, 'max_depth': 6, 'min_child_weight': 1, 'subsample': 1, 'colsample_bytree': 1, 'colsample_bylevel': 1, 'colsample_bynode': 1, 'lambda': 1, 'alpha': 0, 'scale_pos_weight': 1, 'verbosity': 0, 'objective': 'binary:logitraw', 'nthread': 1, 'eval_metric': 'auc'}
if test:
xgb_params['tree_method'] = 'exact'
singjc marked this conversation as resolved.
Show resolved Hide resolved

xgb_params_space = {'eta': hp.uniform('eta', 0.0, 0.3), 'gamma': hp.uniform('gamma', 0.0, 0.5), 'max_depth': hp.quniform('max_depth', 2, 8, 1), 'min_child_weight': hp.quniform('min_child_weight', 1, 5, 1), 'subsample': 1, 'colsample_bytree': 1, 'colsample_bylevel': 1, 'colsample_bynode': 1, 'lambda': hp.uniform('lambda', 0.0, 1.0), 'alpha': hp.uniform('alpha', 0.0, 1.0), 'scale_pos_weight': 1.0, 'verbosity': 0, 'objective': 'binary:logitraw', 'nthread': 1, 'eval_metric': 'auc'}

Expand Down
9 changes: 6 additions & 3 deletions pyprophet/stats.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,10 @@ def posterior_chromatogram_hypotheses_fast(experiment, prior_chrom_null):


def mean_and_std_dev(values):
return np.mean(values), np.std(values, ddof=1)
std = np.std(values, ddof=1)
if std == 0:
raise RuntimeError("Computed standard deviation is 0, cannot perform normalization")
jcharkow marked this conversation as resolved.
Show resolved Hide resolved
return np.mean(values), std


def pnorm(stat, stat0):
Expand Down Expand Up @@ -233,7 +236,7 @@ def pi0est(p_values, lambda_ = np.arange(0.05,1.0,0.05), pi0_method = "smoother"

@profile
def qvalue(p_values, pi0, pfdr = False):
p = np.array(p_values)
p = np.array(p_values).copy()
jcharkow marked this conversation as resolved.
Show resolved Hide resolved

qvals_out = p
rm_na = np.isfinite(p)
Expand Down Expand Up @@ -277,7 +280,7 @@ def bw_nrd0(x):
@profile
def lfdr(p_values, pi0, trunc = True, monotone = True, transf = "probit", adj = 1.5, eps = np.power(10.0,-8)):
""" Estimate local FDR / posterior error probability from p-values according to bioconductor/qvalue """
p = np.array(p_values)
p = np.array(p_values).copy()
singjc marked this conversation as resolved.
Show resolved Hide resolved

# Compare to bioconductor/qvalue reference implementation
# import rpy2
Expand Down
118 changes: 118 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
#
# This file is autogenerated by pip-compile with Python 3.11
# by the following command:
#
# pip-compile --all-extras --output-file=requirements.txt
#
click==8.1.7
# via pyprophet (setup.py)
cloudpickle==3.1.0
# via hyperopt
contourpy==1.3.0
# via matplotlib
cycler==0.12.1
# via matplotlib
cython==3.0.11
# via pyprophet (setup.py)
duckdb==1.1.3
# via
# duckdb-extension-sqlite-scanner
# duckdb-extensions
# pyprophet (setup.py)
duckdb-extension-sqlite-scanner==1.1.3
# via pyprophet (setup.py)
duckdb-extensions==1.1.3
# via pyprophet (setup.py)
fonttools==4.55.0
# via matplotlib
future==1.0.0
# via hyperopt
hyperopt==0.2.7
# via pyprophet (setup.py)
iniconfig==2.0.0
# via pytest
joblib==1.4.2
# via scikit-learn
kiwisolver==1.4.7
# via matplotlib
matplotlib==3.9.2
# via pyprophet (setup.py)
networkx==3.2.1
# via hyperopt
numexpr==2.10.1
# via pyprophet (setup.py)
numpy==2.0.2
# via
# contourpy
# hyperopt
# matplotlib
# numexpr
# pandas
# patsy
# pyprophet (setup.py)
# scikit-learn
# scipy
# statsmodels
# xgboost
nvidia-nccl-cu12==2.23.4
# via xgboost
packaging==24.2
# via
# matplotlib
# pytest
# statsmodels
pandas==2.2.3
# via
# pyprophet (setup.py)
# statsmodels
patsy==1.0.1
# via statsmodels
pillow==11.0.0
# via matplotlib
pluggy==1.5.0
# via pytest
py4j==0.10.9.7
# via hyperopt
pyarrow==18.0.0
# via pyprophet (setup.py)
pyparsing==3.2.0
# via matplotlib
pypdf==5.1.0
# via pyprophet (setup.py)
pytest==8.3.3
# via
# pyprophet (setup.py)
# pytest-regtest
pytest-regtest==2.3.3
# via pyprophet (setup.py)
python-dateutil==2.9.0.post0
# via
# matplotlib
# pandas
pytz==2024.2
# via pandas
scikit-learn==1.5.2
# via pyprophet (setup.py)
scipy==1.13.1
# via
# hyperopt
# pyprophet (setup.py)
# scikit-learn
# statsmodels
# xgboost
six==1.16.0
# via
# hyperopt
# python-dateutil
statsmodels==0.14.4
# via pyprophet (setup.py)
tabulate==0.9.0
# via pyprophet (setup.py)
threadpoolctl==3.5.0
# via scikit-learn
tqdm==4.67.0
# via hyperopt
tzdata==2024.2
# via pandas
xgboost==2.1.2
# via pyprophet (setup.py)
Loading