-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort input to krakenuniq to enable retrieval of cached batch runs. #570 #576
Conversation
v1.2.2 - Bouncy Basenji Patch PR
Sort the input channel to ensure batches of input will remain constant across runs.
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.1.1. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
I will need to sit down and study it @muniheart but likes very nice and concise! In the meantime, don't forget to
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for contributing the code for your idea 🙂. It makes sense to me and I've only added two nitpicks that you can do with as you please.
I would love to have a test for what you are trying to achieve here, but since we haven't historically tested any of the pipeline-level channel operations, I would feel bad about asking you to add tests now.
As suggested by @Midnighter, using toSortedList in place of collect improves clarity. Also: [ suggested by @Midnighter ] Remove closure argument to flatMap. The operator with closure argument performed the default behavior of the operator without closure.
Thanks for suggestions, @Midnighter and @jfy133. I'm not comfortable adding myself to list of main contributors for this project at this point. |
Hi @Midnighter. At 'contributing guidlines' ( https://github.com/nf-core/taxprofiler/blob/master/.github/CONTRIBUTING.md ), I don't find information regarding the writing of tests. How would I upload test data to ${params.pipelines_testdata_base_path}, for example? |
@muniheart don't worry about the nf-test thing yet, this is not yet implemented within the pipeline And sorry, I didn't get to this today, I promise to get to it though (that said, I very much trust @Midnighter so techcnially we could merge in already) |
I appreciate your willingness to dive but as @jfy133 says, we're not ready for it. You don't break any existing tests, all good =] |
I'm okay with not adding tests. Concerning the use of groupTuple here, according to documentation it is blocking unless a size or groupKey is passed. Provided this is true, any effort to avoid blocking prior to calling groupTuple is moot. I have only tested with paired-read samples. I suspect it won't work as |
Ah ok the new method does not seem to be working:
|
It maybe partially because the The channel structure going into the
So I guess the toSortedList may be in the wrong place? |
@jfy133, it looks like type conversion was performed on the input samplesheet and that the type of id is not constant across records. The two id values compared here are: 2612 and 'ERR3201952'. I had assumed id would always be a string value. I could cast to string before sort. ' |
Aha! Yes, casting as string would solve the problem:) 2612 is meant to be a sample name (just happens to be an integer) |
A samplesheet with both integer and non-numeric ( string ) values of sample caused type-mismatch error when compared. Generate prefix string from meta and compare on prefix.
@jfy133, I sort by prefix as defined in the call to map operator. This will resolve the type-mismatch. The added bonus is that the sort order should be preserved by the groupTuple call. |
@nf-core-bot fix linting |
Sorting within each group produced by groupTuple is sufficient to ensure the batches will remain the same, regardless of the order of the input channel. Position the sort after input channel has been grouped, and after prefix has been defined. Sort by prefix, guaranteed to be distinct within each group, to preserve order across workflow runs.
@muniheart I still haven't forgotten and still on my TODO list, sick kids (again) this week means I've not been able to work much :( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean now, I've tested it and we have at least have no regressions
However I don't get the cached results unfortauntely, if I run
nextflow run ../main.nf -profile test_krakenuniq,docker --outdir ./results -ansi-log false -dump-channels
Then
nextflow run ../main.nf -profile test_krakenuniq,docker --outdir ./results -ansi-log false -dump-channels -resume --standardisation_taxpasta_format 'csv'
Krakenuniq executs again :(
When I .dump()
around each function in the group/flatmap chain, I see no difference, so I'm not sure why this is happening :(
So we could theoretically merge this in but I don't think it makes a difference...
We might need to ask a Nextflow expert who better understands what causes resume to retrigger |
(and sorry for long response again, new round of sick kids which my partner is dealing with this week but I'm very backuped ) |
@jfy133 Thanks for testing it again. I have reproduced your pair of runs, also dumping task hashes with '-dump-hashes json'. The hashes for the krakenuniq tasks on the resume run are distinct from those on the first run. Comparing the component hashes, I observer that only one component of each task hash, that corresponding to testdb-krakenuniq, differs between runs. This indicates the issue involves the preparation of a krakenuniq database. I had only tested with a locally installed database. . |
The database?!?! By preparation do you mean the untarring? I'm with my kids now so can't look more closely, but do you have any intuition what it could be? I'm surprised because the database itself should not have anything different between runs (theoretically |
@jfy133, I appreciate your skepticism :). I will explain in further detail my findings. The PRELOADEDKRAKENUNIQ process takes the db path as input, so a link to the db directory is created in the task's working directory. Based on timestamps in that directory, it appears that the krakenuniq executable creates/modifies files in that directory. Files that appear to be created/modified by the process are: database.kdb.counts, taxDB. Changes would be reflected in the corresponding component of the hash for that task. That is what I observer. I haven't determined conclusively that this is the problem. I will experiment with deleting these 2 files at the end of the task script. |
I was able to get this to use cached results. I don't have time to update PR today, but I will explain how. In each BASH script block of process KRAKENUNIQ_PRELOADEDKRAKENUNIQ ( there are 2, joined by if-then-else ), make a copy of the database, following links:
Pass 'db' as database directory when calling krakenuniq so that it does not modify original database. With these changes the 2nd run will retrieve cached results of process KRAKENUNIQ_PRELOADEDKRAKENUNIQ. |
I see two potential other options:
|
Ultimately I don't really understand what is changing within the database between each run to cause it invalid |
Without having run any code locally myself, here is my understanding: When run locally, the database is a directory and that gets passed around as a symlink. KrakenUniq made a poor design choice by writing files to that directory when profiling samples rather than to a temporary or cache directory. So after profiling once, the directory contents are changed and nextflow's cache gets invalidated. I had similar problems during database creation and my solution was to specify concrete content of the database directory as output and then stage those files (symlinks) in a specific local directory. I know this works but will take some extra effort. |
That is a nice explanation, @Midnighter. The file created in the database directory is 'database.kdb.counts'. The creation of this file also forces the timestamp of the database directory to change. An alternative work-around I have found:
|
Does the first option really work? The docs say that file name and size are still considered, so there should still be a difference between the file present or not. Additionally, the size of that file would be different for every profiling. Do we actually know what happens in a race condition where multiple KrakenUniq processes write to the same file? Or does it automatically generate new files with a different name for each profiling? I should play with the code, but I can't at the moment. |
Thanks @Midnighter for pointing out the race condition problem. To resolve this, I will stage each database file as symlink in workdir or tmpdir therein. |
Or maybe the counts file is being rebuilt every time if it's not present in the database, due to a badly doucmented clean up step? |
That seems to be the case then, yes. |
Ok then I think first before we make any more modifications to this PR, we need do the following:
Then re-test this PR to see if resume then works If yes, we can merge. If not we need to investigate further - does that sound good? If so I can try and do 1/2 tomorrow. |
That sounds good, @jfy133. |
krakenuniq build module* obviously 😅 working on this now |
Woohoo! Fixing the database worked! I will push the new test data, then once that's done, I ask you @muniheart to verify the resume works and then I think we are ready! |
OK @muniheart the test data is updated, plesae just re run your tests as before to check it now works, but all krakenuniq proceses were cached and resumed for me :) |
Unfortunately, deNBI is down today and tomorrow. It would be great to test there on a larger scale. |
I reran your test, @jfy133, and all krakenuniq tasks were cached/retrieved. |
Perfect thank you very much for your patience @muniheart ! I learnt a lot too! |
PR checklist
nf-core pipelines lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).