-
Notifications
You must be signed in to change notification settings - Fork 14
FAQ
Elaina edited this page Feb 8, 2019
·
4 revisions
Why does Binsanity use more memory than other programs like CONCOCT or MetaBat?
- Binsanity implements Affinity Propagation (AP), an algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. This is done by considering every data point as a potential exemplar and exchanges messages until a good set of clusters emerges. AP is a deterministic algorithm in which time and memory requirements scale linearly with the number of similarities. BinSanity's accuracy is due in part to the biphasic approach in which BinSanity separates coverage and composition during clustering, but also relies heavily on the implementation of AP. Unfortunately our attempts to use other clustering algorithms that were less computatonally intensive ultimately sacrificed accuracy.
How much memory will BinSanity Require?
- On a Dell PowerEdge R920 with 1TB of available RAM and Intel Xeon 2.3GHz processors, it took 191 minutes and ~ 54 GB RAM to run 27,643 contigs. Due to the linear increase of memory we have chosen to cap contigs at 100,000 by choosing appropiate size cut-offs for use of this machine. Please contact us with any questions regarding this or suggestions on the best way to implement BinSanity using whatever computer cluster you have access to.
If Binsanity-lc uses less memory then why not implement this all the time?
- Binsanity-lc reduces memory complexity by subseting the contigs into groups based on roughly clustering contigs using k-means. Unlike Affinity Propagation, K-means clustering requires human input of information criteria that dictate the ultimate number of clusters (N). You could estimate this number by using single copy genes to estiamte how many genomes you may have in an assembly (such as here) and use this as a guide to initialize clustering. Methods that require a priori identification of cluster numbers (N) in some cases can mis-cluster contigs because they can end up forcing a contig to fit in one of N number of bins when a fit may not exists. So in essence the most memory efficient route isn't always the best one. The computational intensity of Affinity Propagation may make the method more difficult to implement, but ultimately maintains a consistent level of accuracy. That being said I have done a lot of testing with Binsanity-lc and find the results to still be highly robust. Depending on the number of contigs I have set the initial cluster number
-C
anywhere between 10-500 on datasets between 100K and 1M contigs with very good results.
Other questions?
Post them here!
Please reach out if there are any questions or comments.