-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck in Clustering MASH database step #259
Comments
Hi @neptuneyt , Interesting- sorry you're hitting this error. I would normally say that the issue is running our of RAM, but 2.2 Tb should be plenty. Do you have access to all that RAM, or are you requesting a smaller amount? I'll also note that if you run with Best, |
Thanks for your reply in time. I'm sure there is no memory limlit by
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 9285790
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10000000
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
$ top
top - 09:20:40 up 1 day, 17:37, 1 user, **load average: 1.10, 1.07, 1.05**
Tasks: 1152 total, 2 running, 1150 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.8 us, 0.0 sy, 0.0 ni, 99.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 23774438+total, 12433056+free, 37380403+used, 76033420+buff/cache
KiB Swap: 0 total, 0 free, 0 used. 20011358+avail Mem
PID USER PR NI VIRT **RES** SHR S %CPU %MEM TIME+ COMMAND
54143 yut 20 0 2141.4g **289.7g** 33216 R 100.0 12.8 2149:08 dRep
(drep) [yut@node-fat 58375MAGs_dRep99.99]$ mash --version
1.1 |
Hi @neptuneyt, The issue is actually with Python's clustering using SciPy, not Mash. Mash generates the distance matrix, while SciPy performs the clustering on it. Unfortunately, this step is single-threaded and can be extremely memory-intensive. If you run dRep with the -d flag, it will output the distance matrix directly, allowing you to attempt clustering with an alternative method if needed. Just a heads-up- this clustering step is typically the main RAM bottleneck in dRep. I wish I had a better workaround to offer- apologies for that! Best, |
Thank you so much for helping me troubleshoot the problem. I've also noticed that #174 seems to be the limit of dRep at around 50000, and even though I can split into groups, I still run into this problem if the final merge exceeds this value. Nonetheless, dRep is still an outstanding tool and has become an important processing part of the general standard process. As omics data expands fast, dRep will be even more useful if this problem can be solved. |
Dear developer,
Thanks for your such amazing tools.
When I used 30,000 genomes, I got the results within 10h successfully, especially from "Clustering MASH database" to "primary clusters made" step in only 7 minutes. However, when I increased to 58,375 genomes, the programme keeps getting stuck at "Clustering MASH database" and directory "Clustering_files" is empty. Below are the commands a part of the key logs for both. I would appreciate if you know what kind of reason and the solution.
version
dRep v3.4.5
hardware
Linux node-fat 3.10.0-1160.el7.x86_64, 2.2 Tb memory
30,000 genomes
command
key log
58,375 genomes
command
key log and output
03-09 21:43 INFO Running primary clustering
03-09 21:43 INFO Running pair-wise MASH clustering
03-09 21:43 INFO Will split genomes into 6 groups for primary clustering
03-10 01:09 DEBUG Clustering MASH database
This step state lasts longer than 48 h
The text was updated successfully, but these errors were encountered: