Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck in Clustering MASH database step #259

Open
neptuneyt opened this issue Mar 10, 2025 · 4 comments
Open

Stuck in Clustering MASH database step #259

neptuneyt opened this issue Mar 10, 2025 · 4 comments

Comments

@neptuneyt
Copy link

Dear developer,
Thanks for your such amazing tools.
When I used 30,000 genomes, I got the results within 10h successfully, especially from "Clustering MASH database" to "primary clusters made" step in only 7 minutes. However, when I increased to 58,375 genomes, the programme keeps getting stuck at "Clustering MASH database" and directory "Clustering_files" is empty. Below are the commands a part of the key logs for both. I would appreciate if you know what kind of reason and the solution.

version

dRep v3.4.5

hardware

Linux node-fat 3.10.0-1160.el7.x86_64, 2.2 Tb memory

30,000 genomes

command

$ dRep dereplicate 30kMAGs_dRep99.99_1st --genomeInfo 1th_30k_cm2.csv -g 1th_30k.path  -p 30 -sa 0.9999 -comp 50 -con 10 --skip_plots &>drep1st.log

key log

....
.:: dRep dereplicate Step 2. Cluster ::..
03-08 11:09 INFO Running primary clustering
03-08 11:09 INFO Running pair-wise MASH clustering
03-08 11:09 INFO Will split genomes into 6 groups for primary clustering
03-08 12:16 DEBUG Clustering MASH database
03-08 12:23 DEBUG Saving primary_linkage pickle to 30kMAGs_dRep99.99_1st/data/Clustering_files/
03-08 12:23 INFO 7292 primary clusters made
03-08 12:23 INFO Running secondary clustering
03-08 12:23 INFO Running 831313 fastANI comparisons- should take ~ 1511.4 min
03-08 12:23 DEBUG running cluster 4950
...

58,375 genomes

command

$ ulimit -s 10000000 # to avoid the error  of mash arguments too long 
$ dRep dereplicate 58375MAGs_dRep99.99_all --genomeInfo 58375MAGs_sort_cm2.csv -g 58375MAGs_sort.path   -p 30 -sa 0.9999 -comp 50 -con 10 --primary_chunksize 10000 --skip_plots &>drep_all.log

key log and output

...
03-09 21:43 DEBUG Filtering genomes
03-09 21:43 INFO 98.87% of genomes passed checkM filtering
03-09 21:43 DEBUG Storing resulting files
03-09 21:43 INFO
..:: dRep dereplicate Step 2. Cluster ::..

03-09 21:43 INFO Running primary clustering
03-09 21:43 INFO Running pair-wise MASH clustering
03-09 21:43 INFO Will split genomes into 6 groups for primary clustering
03-10 01:09 DEBUG Clustering MASH database
This step state lasts longer than 48 h

(drep) [yut@node-fat 58375MAGs_dRep99.99]$ ll 58375MAGs_dRep99.99_all/data/MASH_files/MASH_files/
总用量 576G
-rw-r--r-- 1 yut 510 452M 3月 9 09:49 ALL.msh
-rw-r--r-- 1 yut 510 575G 3月 9 23:17 chunk_all_MASH_table.tsv
drwxr-xr-x 14 yut 510 240 3月 9 09:43 sketches
(drep) [yut@node-fat 58375MAGs_dRep99.99]$ ll 58375MAGs_dRep99.99_all/data/Clustering_files/
总用量 0

@MrOlm
Copy link
Owner

MrOlm commented Mar 12, 2025

Hi @neptuneyt ,

Interesting- sorry you're hitting this error. I would normally say that the issue is running our of RAM, but 2.2 Tb should be plenty. Do you have access to all that RAM, or are you requesting a smaller amount?

I'll also note that if you run with -d, it'll cache the results before clustering, so that if clustering fails, you can re-run with the same parameters and it'll pick back up at clustering.

Best,
Matt

@neptuneyt
Copy link
Author

Thanks for your reply in time. I'm sure there is no memory limlit by ulimit -a. Currently I'm the only one using this node and it's the only task, top shows that the task is still running and has about 300Gb of resident memory but load average is too low (1.05-1.1) considering 30 threads in this task , and the "Clustering_files" is still empty. So now could we conclude that it's a mash clustering issue, do you have any good suggestions for this? Looking forward your reply.

  • memory limit
$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 9285790
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10000000
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
  • top
$ top 
top - 09:20:40 up 1 day, 17:37,  1 user,  **load average: 1.10, 1.07, 1.05**
Tasks: 1152 total,   2 running, 1150 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.8 us,  0.0 sy,  0.0 ni, 99.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 23774438+total, 12433056+free, 37380403+used, 76033420+buff/cache
KiB Swap:        0 total,        0 free,        0 used. 20011358+avail Mem

   PID USER      PR  NI    VIRT    **RES**    SHR S  %CPU %MEM     TIME+ COMMAND
 54143 yut       20   0 2141.4g **289.7g**  33216 R 100.0 12.8   2149:08 dRep
  • mash version
(drep) [yut@node-fat 58375MAGs_dRep99.99]$ mash --version
1.1

@MrOlm
Copy link
Owner

MrOlm commented Mar 12, 2025

Hi @neptuneyt,

The issue is actually with Python's clustering using SciPy, not Mash. Mash generates the distance matrix, while SciPy performs the clustering on it. Unfortunately, this step is single-threaded and can be extremely memory-intensive.

If you run dRep with the -d flag, it will output the distance matrix directly, allowing you to attempt clustering with an alternative method if needed.

Just a heads-up- this clustering step is typically the main RAM bottleneck in dRep. I wish I had a better workaround to offer- apologies for that!

Best,
Matt

@neptuneyt
Copy link
Author

Thank you so much for helping me troubleshoot the problem. I've also noticed that #174 seems to be the limit of dRep at around 50000, and even though I can split into groups, I still run into this problem if the final merge exceeds this value. Nonetheless, dRep is still an outstanding tool and has become an important processing part of the general standard process. As omics data expands fast, dRep will be even more useful if this problem can be solved.
Thanks again anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants