Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows- zipped files not detected/unzipped yet #60

Open
abbyjerger opened this issue Aug 10, 2022 · 6 comments · May be fixed by #118
Open

windows- zipped files not detected/unzipped yet #60

abbyjerger opened this issue Aug 10, 2022 · 6 comments · May be fixed by #118
Assignees
Labels
bug Something isn't working HighPriority Something that needs to be done soon

Comments

@abbyjerger
Copy link
Collaborator

In Windows, when running Snekmer with an input of the 4 files from the /resources/tutorial/demo_files/input folder (2 .faa and 2 .faa.gz files), the following message is given in the command line and in the log:

Building DAG of jobs...
MissingInputException in line 46 of C:\Users\jerg881\Miniconda3\envs\snekmer\lib\site-packages\snekmer\rules\kmerize.smk:
Missing input files for rule vectorize:
input\NapB.faa

This output is given for all the commands "snekmer model --dryrun", "snekmer model", "snekmer cluster --dryrun", and "snekmer cluster". No changes to the directory are made. The 2 zipped input files remain zipped, and no output directory is generated.

@abbyjerger abbyjerger added the bug Something isn't working label Aug 10, 2022
@christinehc christinehc added the HighPriority Something that needs to be done soon label Aug 11, 2022
@christinehc
Copy link
Collaborator

@abbyjerger Is this still an issue for you?

christinehc added a commit that referenced this issue Dec 27, 2022
Changelog:
- Add Windows zip/unzip issue to docs (#60)
- Add Docker instructions for Apple silicon compatibility (#102)
- Add notes for BSF incompatibility with Apple silicon (#102)
- Specify `bash` command must be called to run demo example
christinehc added a commit that referenced this issue Jul 28, 2023
changelog:
- fix: update pyproject.toml to account for package files.
- build: change structure of repo to src-layout for automatic file detection and better compatibility with setuptools
- refactor: change wildcard globbing for input files to rely on snakemake.glob_wildcards rather than manual globbing via glob.glob (in progress - addresses #60)
- build: remove src from .gitignore
- (note: installation tested and is working locally. snekmer model will also now run, though with errors as the wildcard globbing is still in progress)
@christinehc christinehc linked a pull request Dec 12, 2023 that will close this issue
@jjacobson95
Copy link
Collaborator

Just a status update - On windows, I am still receiving this error while testing on the 'background' branch.

Command, Error message, Files in directory:

(snekmer) C:\Users\jaco059\Desktop\snekmer_test\learn>snakemake --snakefile ..\Snekmer\snekmer\rules\learn.smk --cores=1 --configfiles=config.yaml
Building DAG of jobs...
MissingInputException in line 57 of C:\Users\jaco059\Desktop\snekmer_test\Snekmer\snekmer\rules\kmerize.smk:
Missing input files for rule vectorize:
input\UP000004358_314230.fasta

(snekmer) C:\Users\jaco059\Desktop\snekmer_test\learn>ls
annotations  config.yaml  input

(snekmer) C:\Users\jaco059\Desktop\snekmer_test\learn>ls input
UP000004358_314230.fasta.gz  UP000056630_1739114.fasta.gz  UP000198893_569882.fasta.gz  UP000199168_556533.fasta.gz  UP000305778_2571141.fasta.gz
UP000031546_45670.fasta.gz   UP000057938_361183.fasta.gz   UP000199134_645273.fasta.gz  UP000239203_155976.fasta.gz  UP000323646_2593411.fasta.gz

As I have a windows PC available, I'll look more into this. @christinehc how would you like me to commit changes once this is working? Should I do this directly on the background branch or elsewhere?

@jjacobson95
Copy link
Collaborator

@christinehc @biodataganache I can see why this issue has been so elusive - there is a lot of code built around this that is difficult to reconcile across multiple smk scripts. I was wondering if there might be advantages to an alternative approach?

Instead of unzipping these files as their own step, we could enable the script to read gzipped files directly.

In kmerize.smk and we would simply replace fasta = SeqIO.parse(input.fasta, "fasta") with the following:

        if input.fasta.endswith('.gz'):
            fasta_handle = gzip.open(input.fasta, 'rt') 
        else:
            fasta_handle = open(input.fasta, 'r')
        fasta = SeqIO.parse(fasta_handle, "fasta")

       ...  
        fasta_handle.close()

In learn.smk and others, we would remove all of the UZ variables and scripts relating to unzipping. And we would replace the FA_MAP with the following code:

input_files = glob(join(input_dir, "*"))

FA_MAP = {
    f.split('.')[0]: '.'.join(f.split('.')[1:]) for f in (os.path.basename(x) for x in input_files)
}

This would create a dictionary with something like this as an output:
{'UP000004358_314230': 'fasta', 'UP000031546_45670': 'fasta.gz',...}

Advantages:

  • Simpler
  • Less code to maintain long term
  • There should be no differences between Mac, Windows and Linux
  • No duplicate files created in a zipped directory within input. (less overall storage space used + files aren't unzipped)

Disadvantages:

  • Removal and updating current code.
  • Maybe speed changes?

@biodataganache
Copy link
Collaborator

biodataganache commented Apr 12, 2024 via email

@christinehc
Copy link
Collaborator

Just a status update - On windows, I am still receiving this error while testing on the 'background' branch.

Command, Error message, Files in directory:

(snekmer) C:\Users\jaco059\Desktop\snekmer_test\learn>snakemake --snakefile ..\Snekmer\snekmer\rules\learn.smk --cores=1 --configfiles=config.yaml
Building DAG of jobs...
MissingInputException in line 57 of C:\Users\jaco059\Desktop\snekmer_test\Snekmer\snekmer\rules\kmerize.smk:
Missing input files for rule vectorize:
input\UP000004358_314230.fasta

(snekmer) C:\Users\jaco059\Desktop\snekmer_test\learn>ls
annotations  config.yaml  input

(snekmer) C:\Users\jaco059\Desktop\snekmer_test\learn>ls input
UP000004358_314230.fasta.gz  UP000056630_1739114.fasta.gz  UP000198893_569882.fasta.gz  UP000199168_556533.fasta.gz  UP000305778_2571141.fasta.gz
UP000031546_45670.fasta.gz   UP000057938_361183.fasta.gz   UP000199134_645273.fasta.gz  UP000239203_155976.fasta.gz  UP000323646_2593411.fasta.gz

As I have a windows PC available, I'll look more into this. @christinehc how would you like me to commit changes once this is working? Should I do this directly on the background branch or elsewhere?

FYI the changes to gzipping have been implemented in model, search, and cluster modes but not to learn, apply, or motif modes, which have not been pulled into this branch yet. I changed the underlying code to use glob_wildcards rather than glob to pull files. Thus I would not expect unzipping to work with learn.smk, hence where the error is coming from

@christinehc
Copy link
Collaborator

@christinehc @biodataganache I can see why this issue has been so elusive - there is a lot of code built around this that is difficult to reconcile across multiple smk scripts. I was wondering if there might be advantages to an alternative approach?

Instead of unzipping these files as their own step, we could enable the script to read gzipped files directly.

In kmerize.smk and we would simply replace fasta = SeqIO.parse(input.fasta, "fasta") with the following:

        if input.fasta.endswith('.gz'):
            fasta_handle = gzip.open(input.fasta, 'rt') 
        else:
            fasta_handle = open(input.fasta, 'r')
        fasta = SeqIO.parse(fasta_handle, "fasta")

       ...  
        fasta_handle.close()

In learn.smk and others, we would remove all of the UZ variables and scripts relating to unzipping. And we would replace the FA_MAP with the following code:

input_files = glob(join(input_dir, "*"))

FA_MAP = {
    f.split('.')[0]: '.'.join(f.split('.')[1:]) for f in (os.path.basename(x) for x in input_files)
}

This would create a dictionary with something like this as an output: {'UP000004358_314230': 'fasta', 'UP000031546_45670': 'fasta.gz',...}

Advantages:

  • Simpler
  • Less code to maintain long term
  • There should be no differences between Mac, Windows and Linux
  • No duplicate files created in a zipped directory within input. (less overall storage space used + files aren't unzipped)

Disadvantages:

  • Removal and updating current code.
  • Maybe speed changes?

The initial reason why I didn't use a similar if/else to handle file unzipping is because these can complicate Snakemake's understanding of how to handle files, hence a higher level rule to optionally handle gzipped files. I would try testing the API changes for file getting (see syntax in model.smk lines 38-69: https://github.com/PNNL-CompBio/Snekmer/blob/background/snekmer/rules/model.smk#L38) and then unzipping (model.smk lines 142-152: https://github.com/PNNL-CompBio/Snekmer/blob/background/snekmer/rules/model.smk#L142) on snekmer learn/apply and see if those changes work. Before doing that, I would test snekmer model/cluster/search on Windows to answer the original question of whether the new unzipping code works on Windows systems works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working HighPriority Something that needs to be done soon
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants