Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort input to krakenuniq to enable retrieval of cached batch runs. #570 #576

Merged
merged 14 commits into from
Mar 6, 2025

Conversation

muniheart
Copy link

@muniheart muniheart commented Feb 3, 2025

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/taxprofiler branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

jfy133 and others added 4 commits January 15, 2025 10:25
Copy link

github-actions bot commented Feb 3, 2025

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.1.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

@jfy133
Copy link
Member

jfy133 commented Feb 4, 2025

I will need to sit down and study it @muniheart but likes very nice and concise!

In the meantime, don't forget to

  1. Update the CHANGELOG
  2. Add yourself to the contributors list on the README and nextflow.config manifest :) (can be anonymous if you prefer by using your github handle)

Copy link
Collaborator

@Midnighter Midnighter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for contributing the code for your idea 🙂. It makes sense to me and I've only added two nitpicks that you can do with as you please.

I would love to have a test for what you are trying to achieve here, but since we haven't historically tested any of the pipeline-level channel operations, I would feel bad about asking you to add tests now.

As suggested by @Midnighter, using toSortedList in place of collect improves clarity.

Also: [ suggested by @Midnighter ]
Remove closure argument to flatMap.  The operator with closure argument performed the default behavior of the operator without closure.
@muniheart
Copy link
Author

Thanks for suggestions, @Midnighter and @jfy133. I'm not comfortable adding myself to list of main contributors for this project at this point.

@muniheart muniheart closed this Feb 4, 2025
@muniheart muniheart reopened this Feb 4, 2025
@muniheart
Copy link
Author

Hi @Midnighter. At 'contributing guidlines' ( https://github.com/nf-core/taxprofiler/blob/master/.github/CONTRIBUTING.md ), I don't find information regarding the writing of tests. How would I upload test data to ${params.pipelines_testdata_base_path}, for example?

@jfy133
Copy link
Member

jfy133 commented Feb 6, 2025

@muniheart don't worry about the nf-test thing yet, this is not yet implemented within the pipeline

And sorry, I didn't get to this today, I promise to get to it though (that said, I very much trust @Midnighter so techcnially we could merge in already)

@Midnighter
Copy link
Collaborator

I appreciate your willingness to dive but as @jfy133 says, we're not ready for it. You don't break any existing tests, all good =]

@muniheart
Copy link
Author

muniheart commented Feb 6, 2025

I'm okay with not adding tests.

Concerning the use of groupTuple here, according to documentation it is blocking unless a size or groupKey is passed. Provided this is true, any effort to avoid blocking prior to calling groupTuple is moot.

I have only tested with paired-read samples. I suspect it won't work as expected intended when both single- and paired-read samples are present, and that, therefore, sort should: be called just prior to groupTuple; sort by: prefix, db_meta.db_name, db. sort order should be: id, run_accession, single_end, db_meta.db_name, db.

@jfy133
Copy link
Member

jfy133 commented Feb 12, 2025

Ah ok the new method does not seem to be working:

ERROR ~ Invalid method invocation `call` with arguments: java.lang.IllegalArgumentException: Cannot compare java.lang.String with value 'ERR3201952' and java.lang.Integer with value '2,613' (java.lang.IllegalArgumentException) on _closure36 type

@jfy133
Copy link
Member

jfy133 commented Feb 12, 2025

It maybe partially because the a[0].id etc calls doesn't work:

The channel structure going into the toSortListed looks like this:

[DUMP: info] [['id':2612, 'instrument_platform':'ILLUMINA', 'single_end':true, 'is_fasta':false, 'type':'short'], [/workspace/taxprofiler-dev/testing/work/b8/d0fd82b267f76746bb1f57fce6a6e1/2612.merged.fastq.gz], ['tool':'krakenuniq', 'db_name':'db6', 'db_params':'', 'type':'short'], /workspace/taxprofiler-dev/testing/work/0c/5a72e20f68794c13dcb893e5301f61/testdb-krakenuniq]

So I guess the toSortedList may be in the wrong place?

@muniheart
Copy link
Author

@jfy133, it looks like type conversion was performed on the input samplesheet and that the type of id is not constant across records. The two id values compared here are: 2612 and 'ERR3201952'. I had assumed id would always be a string value. I could cast to string before sort.

'

@jfy133
Copy link
Member

jfy133 commented Feb 12, 2025

Aha! Yes, casting as string would solve the problem:)

2612 is meant to be a sample name (just happens to be an integer)

A samplesheet with both integer and non-numeric ( string ) values of sample caused type-mismatch error when compared.  Generate prefix string from meta and compare on prefix.
@muniheart
Copy link
Author

@jfy133, I sort by prefix as defined in the call to map operator. This will resolve the type-mismatch. The added bonus is that the sort order should be preserved by the groupTuple call.

@jfy133
Copy link
Member

jfy133 commented Feb 13, 2025

@nf-core-bot fix linting

Sorting within each group produced by groupTuple is sufficient to ensure the batches will remain the same, regardless of the order of the input channel.  Position the sort after input channel has been grouped, and after prefix has been defined.  Sort by prefix, guaranteed to be distinct within each group, to preserve order across workflow runs.
@jfy133
Copy link
Member

jfy133 commented Feb 20, 2025

@muniheart I still haven't forgotten and still on my TODO list, sick kids (again) this week means I've not been able to work much :(

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean now, I've tested it and we have at least have no regressions

However I don't get the cached results unfortauntely, if I run

nextflow run ../main.nf -profile test_krakenuniq,docker --outdir ./results -ansi-log false -dump-channels

Then

nextflow run ../main.nf -profile test_krakenuniq,docker --outdir ./results -ansi-log false -dump-channels -resume --standardisation_taxpasta_format 'csv'

Krakenuniq executs again :(

When I .dump() around each function in the group/flatmap chain, I see no difference, so I'm not sure why this is happening :(

So we could theoretically merge this in but I don't think it makes a difference...

@jfy133
Copy link
Member

jfy133 commented Feb 27, 2025

We might need to ask a Nextflow expert who better understands what causes resume to retrigger

@jfy133
Copy link
Member

jfy133 commented Feb 27, 2025

(and sorry for long response again, new round of sick kids which my partner is dealing with this week but I'm very backuped )

@muniheart
Copy link
Author

@jfy133 Thanks for testing it again. I have reproduced your pair of runs, also dumping task hashes with '-dump-hashes json'. The hashes for the krakenuniq tasks on the resume run are distinct from those on the first run. Comparing the component hashes, I observer that only one component of each task hash, that corresponding to testdb-krakenuniq, differs between runs. This indicates the issue involves the preparation of a krakenuniq database. I had only tested with a locally installed database.
kuniq_hash.resume.json
kuniq_hash.json

.

@jfy133
Copy link
Member

jfy133 commented Feb 28, 2025

The database?!?!

By preparation do you mean the untarring?

I'm with my kids now so can't look more closely, but do you have any intuition what it could be?

I'm surprised because the database itself should not have anything different between runs (theoretically

@muniheart
Copy link
Author

@jfy133, I appreciate your skepticism :). I will explain in further detail my findings. The PRELOADEDKRAKENUNIQ process takes the db path as input, so a link to the db directory is created in the task's working directory. Based on timestamps in that directory, it appears that the krakenuniq executable creates/modifies files in that directory. Files that appear to be created/modified by the process are: database.kdb.counts, taxDB. Changes would be reflected in the corresponding component of the hash for that task. That is what I observer. I haven't determined conclusively that this is the problem. I will experiment with deleting these 2 files at the end of the task script.

@muniheart
Copy link
Author

I was able to get this to use cached results. I don't have time to update PR today, but I will explain how. In each BASH script block of process KRAKENUNIQ_PRELOADEDKRAKENUNIQ ( there are 2, joined by if-then-else ), make a copy of the database, following links:
process KRAKENUNIQ_PRELOADEDKRAKENUNIQ {

        """
        # Store the batch of samples for later command input.
        cat <<-END_INPUTS > ${command_inputs_file}
        ${command_inputs.join('\n        ')}
        END_INPUTS

        # Make copy of db and pass it to krakenuniq so we don't modify original db directory.
        cp -aL $db db

        # Preload the KrakenUniq database into memory.
        krakenuniq \\
            $args \\
            --db db \\
            --preload \\
            --preload-size $ram_chunk_size \\
            --threads $task.cpus

Pass 'db' as database directory when calling krakenuniq so that it does not modify original database. With these changes the 2nd run will retrieve cached results of process KRAKENUNIQ_PRELOADEDKRAKENUNIQ.

@Midnighter
Copy link
Collaborator

I see two potential other options:

  1. Perhaps the paths that are created could be ignored in the input to processes?
  2. Instead of using bash to copy, we can hard code or configure the stage in mode to using a copy. Creating a copy creates quite some space requirements, at least for local use, with remote executors the data has to be copied anyway.

@jfy133
Copy link
Member

jfy133 commented Mar 4, 2025

I see two potential other options:

  1. Perhaps the paths that are created could be ignored in the input to processes?
  2. Instead of using bash to copy, we can hard code or configure the stage in mode to using a copy. Creating a copy creates quite some space requirements, at least for local use, with remote executors the data has to be copied anyway.
  1. Im Not sure what you mean here... Can you expand?
  2. For this I'm not a huge fan because KrakenUniq databases can be VERY large (and if you have multiple batches with the same copied file in parallel that's quite scary

Ultimately I don't really understand what is changing within the database between each run to cause it invalid

@Midnighter
Copy link
Collaborator

Without having run any code locally myself, here is my understanding:

When run locally, the database is a directory and that gets passed around as a symlink. KrakenUniq made a poor design choice by writing files to that directory when profiling samples rather than to a temporary or cache directory. So after profiling once, the directory contents are changed and nextflow's cache gets invalidated.

I had similar problems during database creation and my solution was to specify concrete content of the database directory as output and then stage those files (symlinks) in a specific local directory. I know this works but will take some extra effort.

@muniheart
Copy link
Author

That is a nice explanation, @Midnighter. The file created in the database directory is 'database.kdb.counts'. The creation of this file also forces the timestamp of the database directory to change. An alternative work-around I have found:

  1. Set cache = 'lenient' for this process. This will ignore changed timestamp of database directory.
  2. Delete file database.kdb.counts within the BASH script after each krakenuniq call.

@Midnighter
Copy link
Collaborator

Does the first option really work? The docs say that file name and size are still considered, so there should still be a difference between the file present or not. Additionally, the size of that file would be different for every profiling.

Do we actually know what happens in a race condition where multiple KrakenUniq processes write to the same file? Or does it automatically generate new files with a different name for each profiling? I should play with the code, but I can't at the moment.

@muniheart
Copy link
Author

Thanks @Midnighter for pointing out the race condition problem. To resolve this, I will stage each database file as symlink in workdir or tmpdir therein.

@jfy133
Copy link
Member

jfy133 commented Mar 5, 2025

Or maybe the counts file is being rebuilt every time if it's not present in the database, due to a badly doucmented clean up step?

fbreitwieser/krakenuniq#96 (comment)

@Midnighter
Copy link
Collaborator

That seems to be the case then, yes.

@jfy133
Copy link
Member

jfy133 commented Mar 5, 2025

Ok then I think first before we make any more modifications to this PR, we need do the following:

  1. Update the kraken2-build module (not used in this pipeline, but related) to not remove that file
  2. Update our test data used in the pipeline to include the counts file

Then re-test this PR to see if resume then works

If yes, we can merge. If not we need to investigate further - does that sound good? If so I can try and do 1/2 tomorrow.

@muniheart
Copy link
Author

That sounds good, @jfy133.

@jfy133 jfy133 marked this pull request as draft March 6, 2025 08:33
@jfy133
Copy link
Member

jfy133 commented Mar 6, 2025

krakenuniq build module* obviously 😅 working on this now

@jfy133
Copy link
Member

jfy133 commented Mar 6, 2025

Woohoo! Fixing the database worked! I will push the new test data, then once that's done, I ask you @muniheart to verify the resume works and then I think we are ready!

@jfy133
Copy link
Member

jfy133 commented Mar 6, 2025

OK @muniheart the test data is updated, plesae just re run your tests as before to check it now works, but all krakenuniq proceses were cached and resumed for me :)

#576 (review)

@Midnighter
Copy link
Collaborator

Unfortunately, deNBI is down today and tomorrow. It would be great to test there on a larger scale.

@muniheart
Copy link
Author

I reran your test, @jfy133, and all krakenuniq tasks were cached/retrieved.

@jfy133
Copy link
Member

jfy133 commented Mar 6, 2025

Perfect thank you very much for your patience @muniheart !

I learnt a lot too!

@jfy133 jfy133 marked this pull request as ready for review March 6, 2025 21:03
@jfy133 jfy133 merged commit 3c82ed6 into nf-core:dev Mar 6, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suggestion: Maintain order of input to KRAKENUNIQ_PRELOADEDKRAKENUNIQ across runs
3 participants