Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet and sqlite support; add NNLS-based PEP and Q-value calculation #119

Merged
merged 269 commits into from
Sep 6, 2024

Conversation

gessulat
Copy link
Contributor

No description provided.

Siegfried Gessulat and others added 30 commits February 22, 2024 09:36
dev to main

See merge request msaid/inferys/mokapot!36
- adds line break in dataset.py
- updates call of ruff in CI
- updates pyproject.toml according to new ruff api
- adds line break in dataset.py
- updates call of ruff in CI
- updates pyproject.toml according to new ruff api
Improve speed and limit memory consumption

- stream input files for inference
- add feature: skip deduplication
- add feature: ensemble model
- add feature: rescale input before inference with pre-trained models
💄 fix linting
- fix bug member variables not assigned when model is not trained
- allow throw when input file is malformed: remove skip on bad lines from pandas read function
- Create new object of OnDiskPsmDataset to use for brew tests
- Update brew function outputs and assert statements
- remove assign confidence tests because datasets don't have assign confidence methods anymore
- add eval_fdr value to the _update_labels function
* Fix test confidence:
- fix bugs for grouped confidence
- fix test_one_group : create file using psm_df_1000 to create OnDiskPsmDataset.
- remove test_pickle because confidence does not return dataframe results anymore.
- add test_multi_groups to test that different group results are saved correctly.

* fix bugs:
- overwrite default fdr for update_labels function
- return dataframe for psm_df_1000 to use with LinearPsmDataset
- Remove test_cli_pepxml because xml files don't work with streaming
- Replace old output file names
- Add random generator 'rng' variable to confidence since it is required for proteins
- Remove subset_max_train from PluginModel
- Fix bug: convert targets column after reading in chunks
- Fix peptide column name for confidence
- Fix test cli plugins : replace DecisionTreeClassifier with LinearSVC BECAUSE DecisionTreeClassifier return scores as 0 or 1
- Refactor test structure : Separate brew and confidence functions, read results from output files.
- Fix bugs: fix output columns for proteins, sort proteins data by score
- Add label value to initial direction because it has to have a numerical number
- Read pin does not return dataframe anymore
- Compare output of read_pin function to example dataframe
- Add skip_deduplication flag test
- Add ensemble flag test
- Agg rescale flag test
- Fix bug: remove target_column variable from read file for read_data_for_rescale
- Remove writer tests with confidence object becaause LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence
…alue then raise error that model performed worse (#33)
* Create new executable to aggregate psms to peptides.
* Fix bugs:
- fix error no psms found during training : if no psms passed the fdr value then raise error that model performed worse
- raise error when pep values are all equal to 1
- prefixes paths to dest_dir to not pollute the workdir
- catch error to prevent traces logged: Catch all errors to not break structured logging by error traces
- fixes parallelism in parse_in_chunks to max_workers
- fix indeterminism
- fixed small column chunk bug
- fix bug when using multiple input files
* Fix and add tests:
- remove writer tests with confidence object because LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence
- add test for the new function "get_unique_peptides_from_psms"
- add cli test for aggregatePsmsToPeptides
- adds line break in dataset.py
- updates call of ruff in CI
- updates pyproject.toml according to new ruff api
- adds line break in dataset.py
- updates call of ruff in CI
- updates pyproject.toml according to new ruff api
@wfondrie
Copy link
Owner

It looks like we're adding a lot of files to the test directory. At the very least, I think we need to document what they are, say by adding a README within the data directory.

The better alternative would be to move toward all of these being programmatically generated (ie simulated results), when possible. Alas, adding a bunch of real files was one of the mistakes I made early on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image Since the trailing tab is the standard way in which comet generates its output, I think it is critical to support that, thus this change of the tabs as colons needs to be reverted. (or the file can be duplicated with the other option)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouh, I wasn't aware that this is due to Comet's output format. The use of tabs as separater in the protein column in a tab separated file, breaks some of the functionality that we use in pandas, I believe. We need to specify the column names, their types and a separator. If Pandas read function, after specifying this encounters such an example, it will throw an error.

I think that pandas behaviour makes sense, because if \t is the columns separation character, then it must not be used inside columns...

I would be in favour of a design that explicitly specified how the input is formatted and having a single reader function per specified format. If other software does not adhere to the given specifications, one would need converters that could run as preprocessing step.

Probably, we would need a follow up on this as well. Let me know your thoughts!

Copy link
Collaborator

@jspaezp jspaezp Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also hate that they do it that way but alas I think its widely spread so ... I think its critical to support it (we believe this is a real requirement for us).

As implementation alternatives .. (revision: I tried it and its harder than I thought ...) pandas can read anything that implements a .read() method ... so we could do a try->catch option, where it by defaults tries to read the standard way, and if it errors out due to the "non-uniform-number-of-columns", we can wrap it in a way that it stores the "right" number of columns and wraps the proteins with the new separator.

I think that would be a pretty low-effort way to support it.
LMK what you think!

Related: UWPR/Comet#66

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like Jimmy will add that feature, so we would just need to add a warning pointing to a fix!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He also points tot he fact that the percolator-defined pin is tab delimited between proteins ... https://github.com/percolator/percolator/wiki/Interface#tab-delimited-file-format ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is amazing.

@jspaezp
Copy link
Collaborator

jspaezp commented Jul 16, 2024

Post Megamerge wishlist (capture as checklists the future issues detected while discussing this PR):

@gessulat
Copy link
Contributor Author

gessulat commented Jul 17, 2024

Post Megamerge wishlist (capture as checklists the future issues detected while discussing this PR):

assert Path(tmp_path, "mokapot.peptides.txt").exists()
params = [scope_files[0], "--dest_dir", tmp_path]
run_mokapot_cli(params)
assert Path(tmp_path, "targets.psms").exists()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezander pointed out that we changed the file suffix from before ".psms.txt" to just ".psms". The".txt" version more clearly informs the user what she can expect, which would be preferable. We would need to investigate which parts of the code base we need to revert.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you are doing that change, I would vote to make it even clearer. I think of .txt as space separated, we could go for a csv/tsv here (even .parquet would be viable IMO). @wfondrie LMK what you think!

gessulat and others added 3 commits July 29, 2024 11:32
* ✨ remove patched nnls (using fixed scipy version now)
* 💄 linting; line breaks

Co-authored-by: Elmar Zander <elmar@zandere.de>
* Remove nnls patch and fix scipy version (with fixed nnls)

* ✨ remove patched nnls (using fixed scipy version now)

* 💄 linting; line breaks

* 🚑 fix dtypes to ensure windows ci to pass

---------

Co-authored-by: Elmar Zander <elmar@zandere.de>
* Remove nnls patch and fix scipy version (with fixed nnls)

* ✨ remove patched nnls (using fixed scipy version now)

* 💄 linting; line breaks

* 🚑 fix dtypes to ensure windows ci to pass

* 💄  linting

---------

Co-authored-by: Elmar Zander <elmar@zandere.de>
Copy link

codecov bot commented Sep 6, 2024

Codecov Report

Attention: Patch coverage is 86.35671% with 179 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@af44229). Learn more about missing BASE report.

Files with missing lines Patch % Lines
mokapot/tabular_data.py 84.52% 54 Missing ⚠️
mokapot/parsers/pin_to_tsv.py 39.34% 37 Missing ⚠️
mokapot/streaming.py 87.50% 20 Missing ⚠️
mokapot/mokapot.py 65.85% 14 Missing ⚠️
mokapot/brew_rollup.py 92.69% 13 Missing ⚠️
mokapot/peps.py 71.11% 13 Missing ⚠️
mokapot/dataset.py 85.96% 8 Missing ⚠️
mokapot/confidence.py 95.85% 7 Missing ⚠️
mokapot/confidence_writer.py 88.67% 6 Missing ⚠️
mokapot/brew.py 84.61% 2 Missing ⚠️
... and 5 more
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #119   +/-   ##
=======================================
  Coverage        ?   84.60%           
=======================================
  Files           ?       26           
  Lines           ?     2949           
  Branches        ?        0           
=======================================
  Hits            ?     2495           
  Misses          ?      454           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wfondrie wfondrie merged commit 6b8fbfd into wfondrie:main Sep 6, 2024
3 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants