Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet and sqlite support; add NNLS-based PEP and Q-value calculation #119

Merged
merged 269 commits into from
Sep 6, 2024
Merged
Show file tree
Hide file tree
Changes from 244 commits
Commits
Show all changes
269 commits
Select commit Hold shift + click to select a range
58e8481
Merge branch 'develop' into 'main'
Feb 22, 2024
74f91f1
💄 lint mokapot
gessulat Feb 22, 2024
2985a7f
💄 lints tests
gessulat Feb 22, 2024
12ebe26
💄 fixes format with ruff
gessulat Feb 22, 2024
49608e1
💄 fixes format with ruff
gessulat Feb 22, 2024
6ccc88e
Merge branch 'main' of gitlab:msaid/inferys/mokapot into main
gessulat Feb 22, 2024
0b4fdc5
💄 make ruff and black happy together
gessulat Feb 22, 2024
86045d9
Fix problems with nnls
ezander Feb 23, 2024
270efb5
Merge branch 'main' into feature/elmars_algorithms
ezander Feb 23, 2024
f595804
Feature/improve speed and limit memory (#11)
sambenfredj Apr 12, 2023
46fbf6b
:lipstick: linting (#12)
gessulat Apr 17, 2023
ee95fbd
Fix bugs (#17)
gessulat Apr 20, 2023
f3d50c8
fix test model: remove subset_max_train from percolator model (#18)
sambenfredj May 5, 2023
4293410
Fix test brew: (#20)
sambenfredj May 9, 2023
623b7d8
fix test datasets: (#19)
sambenfredj May 11, 2023
8f417dd
Fix test confidence (#22)
sambenfredj May 11, 2023
2e1723e
Fix cli tests: (#28)
sambenfredj May 15, 2023
6355834
Fix system tests: (#29)
sambenfredj May 16, 2023
296fb73
Fix parser pin test: (#30)
sambenfredj May 17, 2023
096b07f
Add tests: (#31)
sambenfredj May 22, 2023
d497fcc
Fix writer tests: (#32)
sambenfredj May 22, 2023
d241adb
fix error no psms found during training : if no psms passed the fdr v…
sambenfredj May 31, 2023
41ed445
Introduce new executable and bug fixes
sambenfredj Aug 4, 2023
ac43547
✨ force ci re-run
gessulat Feb 16, 2024
4a9872f
💄 lint mokapot
gessulat Feb 22, 2024
346a0c0
💄 lints tests
gessulat Feb 22, 2024
f543166
💄 fixes format with ruff
gessulat Feb 22, 2024
0742dc2
💄 fixes format with ruff
gessulat Feb 22, 2024
a2602df
💄 make ruff and black happy together
gessulat Feb 22, 2024
f12a43d
✨ removed deprecated error ignore
gessulat Feb 27, 2024
0fd515b
Merge branch 'main' into 'feature/sync'
Feb 27, 2024
6726dea
Merge branch 'feature/sync' into 'main'
Feb 27, 2024
982de49
Merge branch 'feature/elmars_algorithms' into 'main'
Feb 27, 2024
a823663
Fix two boolean conditions in nnls algorithm
ezander Mar 8, 2024
f094cfe
Set tolerance to fixed value in fit_nnls to avoid non-convergence
ezander Mar 8, 2024
c74074e
Adjust unittest for hist_nnls to new error cases
ezander Mar 8, 2024
fca3a7c
Merge branch 'feature/elmars_algorithms' into fix/nnls_bug2
ezander Mar 8, 2024
46cdd24
Add documentation and test for create_chunks
ezander Mar 11, 2024
3dc71c0
Make cli unit tests for aggregateP2P easier debuggable
ezander Mar 11, 2024
f681d9a
Improve test for peptide_csv in test_utils
ezander Mar 13, 2024
187543f
Improve and test convert_targets_column function
ezander Mar 13, 2024
f6ae3dd
Enable switching in system tests from subprocess to direct calls
ezander Mar 13, 2024
6787feb
Fix cli system and utils tests
ezander Mar 14, 2024
3a0b54d
Fix unit tests
ezander Mar 15, 2024
f2bc2fd
Merge branch 'fix/nnls_bug2' into 'main'
ezander Mar 20, 2024
c184598
Add documentation and test for create_chunks
ezander Mar 11, 2024
8966115
Make cli unit tests for aggregateP2P easier debuggable
ezander Mar 11, 2024
2066b87
Improve test for peptide_csv in test_utils
ezander Mar 13, 2024
60e8bc0
Improve and test convert_targets_column function
ezander Mar 13, 2024
9787f24
Enable switching in system tests from subprocess to direct calls
ezander Mar 13, 2024
e08edf1
Fix cli system and utils tests
ezander Mar 14, 2024
bfe642f
Fix unit tests
ezander Mar 15, 2024
e11b8ff
Merged origin/fix/cleanup into fix/cleanup
ezander Mar 20, 2024
79c0bcf
parquet reader for mokapot
Mar 20, 2024
a1facaf
merge sort function adapted for parquet
Mar 20, 2024
e6390fa
brew function adapted for parquet input
Mar 20, 2024
04e3416
confidence assignment modified for parquet format
Mar 20, 2024
5e5cb2e
merge sort chunk size added as constant
Mar 20, 2024
47e87b1
update label func modified for parquet
Mar 20, 2024
ed35208
main function uses format arg to choose between csv and parquet
Mar 20, 2024
9758b4e
pyarrow added to dependancies
Mar 20, 2024
5688549
Change conversion of target column values
ezander Mar 21, 2024
5cf32a5
fixed failing tests
Mar 22, 2024
217f536
added new tests for parquet
Mar 22, 2024
c70db4b
refactor: Add type hinting to tuplize function
ezander Mar 22, 2024
f170730
refactor: Refactor find_column(s) functions and use it in read_percol…
ezander Mar 22, 2024
d448728
refactor: Insert some newlines for improved readability
ezander Mar 22, 2024
c403c1d
refactor: Insert some newlines for improved readability
ezander Mar 22, 2024
317c5dd
refactor: Insert some newlines for improved readability
ezander Mar 22, 2024
6b4792d
Merged origin/fix/cleanup into fix/cleanup
ezander Mar 22, 2024
4d86c6c
refactor: Remove redundant test case for case-sensitive column matching.
ezander Mar 22, 2024
452abdb
Add typeguard
ezander Mar 22, 2024
831b52f
refactored unchunked file reader for parquet and csv
Mar 26, 2024
3ce5ca1
Merge branch 'feature/parquet_parser' into 'main'
Mar 26, 2024
a3b6f5f
Add map_columns_to_indices function and more type checking
ezander Mar 28, 2024
6d770ab
Make debugging dataframe issues easier in unit test
ezander Mar 28, 2024
b2405b4
Fix test_utils
ezander Mar 28, 2024
083a770
Add level_columns to OnDiskPsmDataset
ezander Mar 28, 2024
386f55f
Rename deduplication to do_rollup
ezander Mar 28, 2024
99c11dc
Change deduplication to do_rollup
ezander Mar 28, 2024
6e472af
Fix pin reading by adding level_columns
ezander Mar 28, 2024
4c97c0c
Revert "Merge branch 'feature/parquet_parser' into 'main'"
Mar 28, 2024
b53aa66
Move peptides.csv to dest_dir and remove it where it's created
ezander Mar 28, 2024
8da9f4e
Save changes
ezander Mar 28, 2024
5d33cd5
Get rid of path manipulation via strings
ezander Mar 28, 2024
3385726
Clean up more path related stuff
ezander Mar 28, 2024
34d3281
Correct documentation of return values of brew function
ezander Mar 29, 2024
e88540b
Simplify and generalize path definitions in confidence.py
ezander Mar 29, 2024
e238b56
Move confidence related functions to confidence
ezander Mar 29, 2024
c8f0a71
Fix problem with parameter lists in cli tests
ezander Mar 29, 2024
28267f4
Add checking of column names for OnDiskPsmDataset
ezander Apr 1, 2024
3e07c55
Fix column index stuff
ezander Apr 1, 2024
5231699
Merge branch 'revert-3ce5ca1f' into 'main'
Apr 2, 2024
3737bc7
Disable parallel unit tests when debugging
ezander Apr 2, 2024
39a3c1d
Merge branch 'fix/cleanup' into 'main'
micgrab Apr 2, 2024
084b3c5
Refactor the confidence.to_txt function
ezander Apr 3, 2024
0f52364
Fix chunked reader to read in column order as passed to the function
ezander Apr 3, 2024
d04746c
Add comments and put temp file naming for merge-sorting in one place
ezander Apr 3, 2024
edef280
Add test for chunked reader
ezander Apr 3, 2024
21c7008
Fix warning in read_file_in_chunks test
ezander Apr 3, 2024
7721322
Remove unnecessary conversion (back and fro) in confidence.py
ezander Apr 3, 2024
3c72fcf
Add pyarrow as a dependency
ezander Apr 3, 2024
c7e80dd
Remove superfluous conversions and some more superfluous stuff
ezander Apr 3, 2024
1d40579
Refactor sorted iterator creation into context manager
ezander Apr 3, 2024
38415ff
Improve find_column* and use consistently in pin parser
ezander Apr 3, 2024
1720dc6
Introduce tabbed reader and writers (for csv for now)
ezander Apr 3, 2024
802a520
Correct buggy import statement
ezander Apr 3, 2024
466838c
Fix bug and add test related to --aggregate flag
ezander Apr 4, 2024
fb6fee9
Remove ignoring of warnings
ezander Apr 4, 2024
25f9d18
Comment regarding to_txt function (can be removed)
ezander Apr 4, 2024
4cf19ce
Add file type detection to readers and writers
ezander Apr 4, 2024
4b99571
Ignore warning in PIN reader (locally now)
ezander Apr 4, 2024
768753a
Improve column ordering/mapping and add unit test
ezander Apr 4, 2024
1cb5ab6
Correct targets conversion function and fix offending unit tests
ezander Apr 4, 2024
5d5048f
Improve on label updating and type safety
ezander Apr 4, 2024
948bf1a
Correct output capturing in test helper
ezander Apr 4, 2024
0d322cd
Use TabbedWriter in save_sorted_metadata_chunks
ezander Apr 4, 2024
3aa7078
Progress towards rollup
ezander Apr 4, 2024
3cf8ab6
Improve map_columns_to_indices for dicts
ezander Apr 4, 2024
0677338
Add stringify methods to tabbed readers
ezander Apr 5, 2024
da38823
Change assign_confidence for rollup
ezander Apr 5, 2024
3c87b4b
Fix column ordering problem (temp)
ezander Apr 5, 2024
1e629b9
Add tests for the rollup
ezander Apr 5, 2024
ced16dd
Fix a bug in the rollup unit test
ezander Apr 5, 2024
024280d
Rename pcms to precursors
ezander Apr 8, 2024
a8582d1
Remove a now superfluous function and comments
ezander Apr 8, 2024
b16cf8d
Merge branch 'feature/rollup' into 'main'
Apr 10, 2024
b6c51da
squashed sqlite writer branch
micgrab Apr 10, 2024
65d6f7e
failing test fixed
Apr 10, 2024
c628a34
unused function get_unique_peptides_from_psms removed
Apr 10, 2024
4ca6442
format interpreted implicitly using filename
Apr 15, 2024
7258909
Confidencewriter class implemented
Apr 15, 2024
3403167
sqlite path changed to Path type
Apr 15, 2024
c717d8b
perquet module deleted and integrated into read_pin
Apr 15, 2024
a8146c0
failing tests fixed
Apr 15, 2024
6a40e7f
instantiation done before returning object
Apr 16, 2024
8918ebf
PSM_PEP column name changed to POSTERIOR_ERROR_PROBABILITY
Apr 16, 2024
8a53bdc
Merge branch 'feature/sqlite_write_cherry_picked' into 'main'
Apr 16, 2024
4e3b8a5
add pipeline status for main branch
Apr 17, 2024
45386c9
✨ fixes #54
gessulat Apr 17, 2024
53c198d
Merge branch 'feature/54-markdown-licenses' into 'main'
Apr 17, 2024
f12eb5c
Do some cosmetics
ezander Apr 17, 2024
43e8f41
Remove rescale stuff
ezander Apr 18, 2024
a6d846c
Renamed TabbedFileReader and Writer to TabularDataReader and Writer
ezander Apr 18, 2024
90f8601
Separate general tabular data and confidence writer stuff
ezander Apr 22, 2024
04dbd44
Fix import bug in confidence
ezander Apr 23, 2024
7c38b1c
Fix another import bug
ezander Apr 23, 2024
4bf1a34
Add better unit test for (chunked) confidence
ezander Apr 24, 2024
64465d5
Make chunked confidence unit test fail with small chunk size
ezander Apr 24, 2024
1c65adf
Fix confidence chunk size bug
ezander Apr 24, 2024
8d71506
test case added for sqlite writer
Apr 24, 2024
a7b567d
test data added for sqlite writer
Apr 24, 2024
4d46608
Add option for suppressing warnings
ezander Apr 24, 2024
1da20a4
Merge branch 'fix/suppress_warnings' into 'main'
Apr 24, 2024
803ccf2
Revert the change in confidence and adapt unit test
ezander Apr 24, 2024
89b5d2a
Merge branch 'fix/chunking' into 'main'
Apr 24, 2024
67d0a84
prepare tables sqlite db added as helper func
Apr 24, 2024
d5d84ef
Merge branch 'feature/sqlite_test' into 'main'
Apr 24, 2024
251a2ba
Merge branch 'main' into feature/new_cli
ezander Apr 25, 2024
0cb411f
Fix problems with sqlit after the merge
ezander Apr 25, 2024
8a6c1c4
Remove all group related stuff
ezander Apr 25, 2024
f7d18e4
Remove crosslink stuff
ezander Apr 26, 2024
367b844
Remove plugins
ezander Apr 26, 2024
1fcc318
Remove skipped tests and skip marks
ezander Apr 26, 2024
54c0341
Do some minor cleanup
ezander Apr 29, 2024
883c0c0
Add type inference and tests for tabular data
ezander Apr 29, 2024
1a06a23
Add reader and tests for in memory dataframe reader
ezander Apr 29, 2024
b3cd5a5
Add streaming module
ezander Apr 29, 2024
fc8427c
Add checks to merged reader
ezander Apr 30, 2024
b068661
Add creation of dataframe reader from series and arrays
ezander Apr 30, 2024
fadbed9
Add JoinedTabularData and tests
ezander May 6, 2024
b140f92
Add column renaming for tabular data readers
ezander May 7, 2024
3624a2c
Add context manager to tabular data writer and make confidence writer…
ezander May 7, 2024
a80fd47
Get rid of all kinds of warnings during tests
ezander May 7, 2024
7a98e59
Fix another warning
ezander May 7, 2024
2d706f0
Add context manager to TabularDataWriter
ezander May 8, 2024
0ee32fc
Fix a problem with indexing in the merged reader
ezander May 8, 2024
672195f
Add buffering to writers
ezander May 8, 2024
4b9784f
Correct problem in unit test with log output and typechecking
ezander May 8, 2024
49bd14c
Add functionality to add computed columns to TabularData
ezander May 8, 2024
1db4df6
Add method to get an associated reader from a writer
ezander May 8, 2024
66a0c38
Add cli and test for the rollup
ezander May 8, 2024
3ede1e3
Fix underscore problem
ezander May 8, 2024
5fd274d
Fix problem with typechecked/contextmanager order
ezander May 8, 2024
6dc328d
Remove typechecking from auto_finalizer for the moment
ezander May 8, 2024
5708f10
Fix path problem in rollup unit test
ezander May 8, 2024
873f30e
Show rollup levels not found only if non-empty
ezander May 14, 2024
30d1bb8
Simplify sqlite connection in unit tests
ezander May 14, 2024
51b1be2
Add new suffixes for csv
ezander May 14, 2024
bc73192
Change options: add src_dir and remove keep_decoys
ezander May 14, 2024
6ed3f7d
Add files for rollup testing
ezander May 14, 2024
73ab5e6
buffered write for parquet intermediary files implemented
May 15, 2024
4dc3e17
tests updated for parquet writing
May 15, 2024
c17c3a2
fixed aggregatePsmstoPeptides cli
May 15, 2024
f5265a2
test data structure changed to list of dicts to match merge sort outp…
May 15, 2024
75a74af
test data updated to be dataframe readable
May 15, 2024
4a6cfe4
Remove aggregatePsmsToPeptides
ezander May 15, 2024
8ae2151
Merge branch 'feature/new_cli' into feature/parquet_buffer
May 16, 2024
4893c81
Merge branch 'feature/parquet_buffer' into 'feature/new_cli'
May 16, 2024
867afb4
Move remove_columns function to tabular_data
ezander May 16, 2024
d5d707c
Fix program name in cli output
ezander May 16, 2024
df67d3d
Let brew_rollup also search for parquet files
ezander May 16, 2024
a83e573
Use csv or parquet suffix also for temp and output files
ezander May 16, 2024
e583f48
Filter a warning in the system tests
ezander May 16, 2024
0d429cd
Make the column types a bit more lenient
ezander May 16, 2024
f2c90f1
fixed rollup app for parquets
May 17, 2024
0f7a78f
Merge branch 'feature/new_cli' into feature/new_cli_parquet
May 17, 2024
931fcbf
Merge branch 'feature/new_cli_parquet' into 'feature/new_cli'
May 17, 2024
f0d1aac
Merge branch 'feature/new_cli' into 'main'
May 17, 2024
d80d899
Fix unclosed files problem
ezander May 28, 2024
721d017
Remove unused parameter target_column from merge_sort
ezander May 28, 2024
1962632
Remove superfluous passing of sep
ezander May 28, 2024
cd49bd9
Change tabs to colons in protein(s) column of pin file
ezander May 28, 2024
09931a8
Test parquet merge_sort more extensively
ezander May 28, 2024
418897b
Unify csv and parquet methods in merge_sort
ezander May 29, 2024
efd1e45
Simplify get_row_iterator
ezander May 30, 2024
c690b80
Make brew rollup faster
ezander May 30, 2024
aa4da8c
Fix bug in MergedTabularReader
ezander May 30, 2024
509d090
Merge branch 'fix/rollup_performance' into 'main'
tkschmidt May 31, 2024
dcfb1e4
Merge branch 'main' into fix/merge_sort_problem
ezander May 31, 2024
4cf872e
Fix problem with last line in buffering
ezander May 31, 2024
1877517
Merge branch 'fix/buffered_writer' into 'main'
tkschmidt May 31, 2024
28a23fc
Merge branch 'fix/merge_sort_problem' into 'main'
tkschmidt Jun 5, 2024
49aad73
Fix problem with type conversions in merge_sort
ezander Jun 14, 2024
2276182
Merge branch 'fix/fail_on_bad_input' into 'main'
Jun 17, 2024
0f58675
✨ addresses @jspaezp suggestions from PR #119
gessulat Jun 20, 2024
2cf8265
✨ addresses review suggestions
gessulat Jun 21, 2024
c021f09
Merge branch 'feature/parquet-sqlite-hist-nnls-review' into 'feature/…
Jun 21, 2024
02c42dd
Add check for length of mokapot output file
ezander Jun 24, 2024
b742c73
Make test for output file length more "elastic"
ezander Jun 24, 2024
88065fd
Revert documentation on psms parameter for brew function.
ezander Jun 24, 2024
3a9a667
Change f-string to normal string where unnecessary
ezander Jun 24, 2024
b9bac1c
Remove types_from_dataframe function, since unnecessary
ezander Jun 24, 2024
719eb0f
(chore) updated cicd,ruff,black and tests (#42)
jspaezp Jul 8, 2024
662b036
✨ fix setting scores when training failed
gessulat Jul 9, 2024
cea95a9
Readd previously commented out check for feature columns
ezander Jul 16, 2024
8d4ffb2
✨ proper docstring for `sqlite_db_path`
gessulat Jul 16, 2024
dc62eff
Add doc strings for write_confidence (and improve types)
ezander Jul 16, 2024
9a2b3ae
Merge branch 'feature/parquet-sqlite-hist-nnls' of gitlab.com:msaid/i…
ezander Jul 16, 2024
f90f389
Improve documentation and type hints of assign_confidence
ezander Jul 16, 2024
3d0e592
Add class and module documentation for the tabular data classes
ezander Jul 16, 2024
653179f
Feature/remove nnls patch (#43)
gessulat Jul 29, 2024
68fd52b
Fix/windows tests (#44)
gessulat Jul 29, 2024
b666ef9
Fix/windows tests (#45)
gessulat Jul 29, 2024
3a3e5ea
✨ draft of pin to tsv converter
gessulat Aug 28, 2024
00a3c64
✨ adds is_valid_tsv
gessulat Aug 28, 2024
3eb62a3
✨ adds tsv verification for pin files and conversion
gessulat Aug 28, 2024
47c6c0b
:pencil: remove print
gessulat Aug 28, 2024
9197db9
🔥 add required default for --dest_dir
gessulat Aug 28, 2024
55ad062
Merge pull request #46 from msaid-de/feature/pin-to-tsv-convert
tkschmidt Sep 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .github/workflows/tests.yml
gessulat marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,13 @@ name: tests

on:
push:
branches: [ main ]
branches:
- main
- develop
pull_request:
branches: [ main ]
branches:
- main
- develop

jobs:
build:
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -109,3 +109,5 @@ venv.bak/
# idea
.idea/

tests/integration_tests/run*

50 changes: 50 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
image: python:3.10.5

variables:
PIP_CACHE_DIR: "${CI_PROJECT_DIR}/.cache/pip"

stages:
- publish
- test

cache:
key: "${CI_COMMIT_REF_SLUG}"
paths:
- .cache/pip
- .venv

.with_twine:
before_script:
- python -m pip install --upgrade pip
- pip install setuptools wheel twine build pip-licenses


gather-licences:
extends: .with_twine
stage: publish
script:
- pip-licenses --from=mixed --order=license --with-system -f markdown --output-file LICENSES.md
artifacts:
when: always
paths:
- LICENSES.md
expire_in: 4 week


publish:
extends: .with_twine
stage: publish
rules:
- if: '$CI_COMMIT_TAG != null'
script:
- python -m build --sdist --wheel .
- TWINE_PASSWORD=${CI_JOB_TOKEN} TWINE_USERNAME=gitlab-ci-token python -m twine upload --repository-url https://gitlab.com/api/v4/projects/${CI_PROJECT_ID}/packages/pypi dist/*

unit_test:
extends: .with_twine
stage: test
script:
- pip install .[dev]
- pip install pytest
- pytest tests/

17 changes: 6 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,5 @@
<img src="https://raw.githubusercontent.com/wfondrie/mokapot/master/static/mokapot_logo_dark.svg" width=300>

---
[![conda](https://img.shields.io/conda/vn/bioconda/mokapot?color=green)](http://bioconda.github.io/recipes/mokapot/README.html)
[![PyPI](https://img.shields.io/pypi/v/mokapot?color=green)](https://pypi.org/project/mokapot/)
[![tests](https://github.com/wfondrie/mokapot/workflows/tests/badge.svg)](https://github.com/wfondrie/mokapot/actions?query=workflow%3Atests)
[![docs](https://readthedocs.org/projects/mokapot/badge/?version=latest)](https://mokapot.readthedocs.io/en/latest/?badge=latest)

[![pipeline status](https://gitlab.com/msaid/inferys/mokapot/badges/main/pipeline.svg)](https://gitlab.com/msaid/inferys/mokapot/-/commits/main)

wfondrie marked this conversation as resolved.
Show resolved Hide resolved

Fast and flexible semi-supervised learning for peptide detection.
Expand Down Expand Up @@ -70,11 +64,12 @@ Alternatively, the Python API can be used to perform analyses in the Python
interpreter and affords greater flexibility:

```Python
>>> import mokapot
>>> psms = mokapot.read_pin("psms.pin")
>>> results, models = mokapot.brew(psms)
>>> results.to_txt()
import mokapot
psms = mokapot.read_pin("psms.pin")
results, models = mokapot.brew(psms)
results.to_txt()
```

Check out our [documentation](https://mokapot.readthedocs.io) for more details
and examples of mokapot in action.

Binary file added data/10k_psms_test.parquet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how hard would it be to have this file generated programmatically? (after the xz vulnerability I am trying to have less files that are not plain text in repos ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's depending on how real the data should look like. If it's just for testing if it runs through and some standard cases, this would be possible. For a more real-life integration test even the 10k sample might be small.
Would it be ok, to separate this out from this PR into a separate issue?

Binary file not shown.
10,001 changes: 10,001 additions & 0 deletions data/10k_psms_test.pin

Large diffs are not rendered by default.

1,001 changes: 1,001 additions & 0 deletions data/confidence_results_test.tsv

Large diffs are not rendered by default.

1,001 changes: 1,001 additions & 0 deletions data/percolator-noSplit-extended-1000.tab

Large diffs are not rendered by default.

10,001 changes: 10,001 additions & 0 deletions data/percolator-noSplit-extended-10000.tab

Large diffs are not rendered by default.

1,001 changes: 1,001 additions & 0 deletions data/percolator-noSplit-extended-1000b.tab

Large diffs are not rendered by default.

1,001 changes: 1,001 additions & 0 deletions data/percolator-noSplit-extended-1000c.tab

Large diffs are not rendered by default.

202 changes: 202 additions & 0 deletions data/percolator-noSplit-extended-201-bad.tab

Large diffs are not rendered by default.

1,001 changes: 1,001 additions & 0 deletions data/percolator-noSplit-non-extended-1000.tab

Large diffs are not rendered by default.

10,001 changes: 10,001 additions & 0 deletions data/percolator-noSplit-non-extended-10000.tab

Large diffs are not rendered by default.

3,816 changes: 1,908 additions & 1,908 deletions data/phospho_rep1.pin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image Since the trailing tab is the standard way in which comet generates its output, I think it is critical to support that, thus this change of the tabs as colons needs to be reverted. (or the file can be duplicated with the other option)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouh, I wasn't aware that this is due to Comet's output format. The use of tabs as separater in the protein column in a tab separated file, breaks some of the functionality that we use in pandas, I believe. We need to specify the column names, their types and a separator. If Pandas read function, after specifying this encounters such an example, it will throw an error.

I think that pandas behaviour makes sense, because if \t is the columns separation character, then it must not be used inside columns...

I would be in favour of a design that explicitly specified how the input is formatted and having a single reader function per specified format. If other software does not adhere to the given specifications, one would need converters that could run as preprocessing step.

Probably, we would need a follow up on this as well. Let me know your thoughts!

Copy link
Collaborator

@jspaezp jspaezp Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also hate that they do it that way but alas I think its widely spread so ... I think its critical to support it (we believe this is a real requirement for us).

As implementation alternatives .. (revision: I tried it and its harder than I thought ...) pandas can read anything that implements a .read() method ... so we could do a try->catch option, where it by defaults tries to read the standard way, and if it errors out due to the "non-uniform-number-of-columns", we can wrap it in a way that it stores the "right" number of columns and wraps the proteins with the new separator.

I think that would be a pretty low-effort way to support it.
LMK what you think!

Related: UWPR/Comet#66

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like Jimmy will add that feature, so we would just need to add a warning pointing to a fix!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He also points tot he fact that the percolator-defined pin is tab delimited between proteins ... https://github.com/percolator/percolator/wiki/Interface#tab-delimited-file-format ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is amazing.

Large diffs are not rendered by default.

5,690 changes: 2,845 additions & 2,845 deletions data/scope2_FP97AA.pin

Large diffs are not rendered by default.

Loading
Loading