-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
windows- zipped files not detected/unzipped yet #60
Comments
@abbyjerger Is this still an issue for you? |
changelog: - fix: update pyproject.toml to account for package files. - build: change structure of repo to src-layout for automatic file detection and better compatibility with setuptools - refactor: change wildcard globbing for input files to rely on snakemake.glob_wildcards rather than manual globbing via glob.glob (in progress - addresses #60) - build: remove src from .gitignore - (note: installation tested and is working locally. snekmer model will also now run, though with errors as the wildcard globbing is still in progress)
Just a status update - On windows, I am still receiving this error while testing on the 'background' branch. Command, Error message, Files in directory:
As I have a windows PC available, I'll look more into this. @christinehc how would you like me to commit changes once this is working? Should I do this directly on the background branch or elsewhere? |
@christinehc @biodataganache I can see why this issue has been so elusive - there is a lot of code built around this that is difficult to reconcile across multiple smk scripts. I was wondering if there might be advantages to an alternative approach? Instead of unzipping these files as their own step, we could enable the script to read gzipped files directly. In kmerize.smk and we would simply replace
In learn.smk and others, we would remove all of the UZ variables and scripts relating to unzipping. And we would replace the FA_MAP with the following code:
This would create a dictionary with something like this as an output: Advantages:
Disadvantages:
|
I suggest we (for now) drop gzipped file support. There’s no reason to have it other than convenience. We can continue to work on this issue as a development branch?
Jason McDermott, Ph.D. (he/him)
Senior Research Scientist
Pacific Northwest National Laboratory, MSIN: J4-18
902 Battelle Boulevard PO Box 999
Richland, Washington 99352
Phone: 509-372-4360
Fax : 509-371-6946
Email: ***@***.******@***.***>
From: Jeremy Jacobson ***@***.***>
Date: Friday, April 12, 2024 at 11:28 AM
To: PNNL-CompBio/Snekmer ***@***.***>
Cc: Mcdermott, Jason E ***@***.***>, Mention ***@***.***>
Subject: Re: [PNNL-CompBio/Snekmer] windows- zipped files not detected/unzipped yet (Issue #60)
Check twice before you click! This email originated from outside PNNL.
@christinehc<https://github.com/christinehc> @biodataganache<https://github.com/biodataganache> I can see why this issue has been so elusive - there is a lot of code built around this that is difficult to reconcile across multiple smk scripts. I was wondering if there might be advantages to an alternative approach?
Instead of unzipping these files as their own step, we could enable the script to read gzipped files directly.
In kmerize.smk and we would simply replace fasta = SeqIO.parse(input.fasta, "fasta") with the following:
if input.fasta.endswith('.gz'):
fasta_handle = gzip.open(input.fasta, 'rt')
else:
fasta_handle = open(input.fasta, 'r')
fasta = SeqIO.parse(fasta_handle, "fasta")
...
fasta_handle.close()
In learn.smk and others, we would remove all of the UZ variables and scripts relating to unzipping. And we would replace the FA_MAP with the following code:
input_files = glob(join(input_dir, "*"))
FA_MAP = {
f.split('.')[0]: '.'.join(f.split('.')[1:]) for f in (os.path.basename(x) for x in input_files)
}
This would create a dictionary with something like this as an output:
{'UP000004358_314230': 'fasta', 'UP000031546_45670': 'fasta.gz',...}
Advantages:
* Simpler
* Less code to maintain long term
* There should be no differences between Mac, Windows and Linux
* No duplicate files created in a zipped directory within input. (less overall storage space used + files aren't unzipped)
Disadvantages:
* Removal and updating current code.
* Maybe speed changes?
—
Reply to this email directly, view it on GitHub<#60 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AC5RUP7ZBVOIUBX3NZP655LY5ARUNAVCNFSM56GITZUKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBVGIZDMMBYGM2Q>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
FYI the changes to gzipping have been implemented in model, search, and cluster modes but not to learn, apply, or motif modes, which have not been pulled into this branch yet. I changed the underlying code to use |
The initial reason why I didn't use a similar if/else to handle file unzipping is because these can complicate Snakemake's understanding of how to handle files, hence a higher level rule to optionally handle gzipped files. I would try testing the API changes for file getting (see syntax in model.smk lines 38-69: https://github.com/PNNL-CompBio/Snekmer/blob/background/snekmer/rules/model.smk#L38) and then unzipping (model.smk lines 142-152: https://github.com/PNNL-CompBio/Snekmer/blob/background/snekmer/rules/model.smk#L142) on snekmer learn/apply and see if those changes work. Before doing that, I would test snekmer model/cluster/search on Windows to answer the original question of whether the new unzipping code works on Windows systems works. |
In Windows, when running Snekmer with an input of the 4 files from the /resources/tutorial/demo_files/input folder (2 .faa and 2 .faa.gz files), the following message is given in the command line and in the log:
Building DAG of jobs...
MissingInputException in line 46 of C:\Users\jerg881\Miniconda3\envs\snekmer\lib\site-packages\snekmer\rules\kmerize.smk:
Missing input files for rule vectorize:
input\NapB.faa
This output is given for all the commands "snekmer model --dryrun", "snekmer model", "snekmer cluster --dryrun", and "snekmer cluster". No changes to the directory are made. The 2 zipped input files remain zipped, and no output directory is generated.
The text was updated successfully, but these errors were encountered: