-
Notifications
You must be signed in to change notification settings - Fork 14
Usage
Before running BinSanity Profile, you should check your fasta headers. If your fasta headers are all numerical like below:
>1
AGTCGATGCAGCTAGCTGATGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCT
>2
ACGTGATGCTAGCTAGCTGACTAGCTGCGTTTTACCTATCTATCATCATCTC
....
Or if your sequences are not simplified, meaning headers should have a single word with no spaces as a descriptor as follows:
>Contig_1
AGTCGATGCAGCTAGCTGATGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCT
>Contig_2
ACGTGATGCTAGCTAGCTGACTAGCTGCGTTTTACCTATCTATCATCATCTC
....
The BinSanity Profile step or subsequent Binsanity clustering may fail in either of these cases. If your contigs are not simplified or are all numerical use the simplify-fasta
script provided with Binsanity.
usage: simplify-fasta -i inputFasta -o outputFasta
optional arguments:
-h, --help show this help message and exit
-i INPUTFASTA Specify the name of the input file
-o INPUTOUT Specify the name for the output file
BINSANITY-PROFILE USAGE
usage: Binsanity-profile -i fasta_file -s {sam,bam}_file -c output_file
***********************************************************************
******************************BinSanity********************************
** **
** Binsanity-profile is used to generate coverage files for **
** input to BinSanity. This uses Featurecounts to generate a **
** a coverage profile and transforms data for input into Binsanity, **
** Binsanity-refine, and Binsanity-wf **
** **
***********************************************************************
***********************************************************************
optional arguments:
-h, --help show this help message and exit
-i INPUTFASTA Specify fasta file being profiled
-s INPUTMAPLOC
identify location of BAM files
BAM files should be indexed and sorted
-c OUTCOV
Identify name of output file for coverage information
--transform TRANSFORM
Indicate what type of data transformation you want in the final file [Default:log]:
scale --> Scaled by multiplying by 100 and log transforming
log --> Log transform
None --> Raw Coverage Values
X5 --> Multiplication by 5
X10 --> Multiplication by 10
X100 --> Multiplication by 100
SQR --> Square root
We recommend using a scaled log transformation for initial testing.
Other transformations can be useful on a case by case basis
-T THREADS Specify Number of Threads For Feature Counts [Default: 1]
-o OUTDIRECTORY Specify directory for output files to be deposited [Default: Working Directory]
--version show program's version number and exit
First you need to generate input files for Binsanity (e.g the coverage profile). To generate input files for BinSanity the script `Binsanity-profile` is provided: * `Binsanity-profile` generates a `coverage` file containing both average contig coverage from a `.bam` file calculated via featureCounts. In our tests we used Bowtie2 to produce a `.sam` file, and the converted to a `.bam` file using SamTools. To maintain consistency we used the `.cov` suffix for all files output from this script. The script also takes as input the contig ids to be used in generating a profile. This is to reduce the size of the file by excluding contigs that wont be clustered (for example if you wanted to cluster only contigs greater than 1000 bp you would have a list of contig ids greater than 1000bp).
- There are multiple transformation options identified by the flag
--transform
. We recommend the scaled option.- scale --> Scaled by multiplying by 100 and log transformed
- log --> Log transform
- None --> Raw Coverage Values
- X5 --> Multiplication by 5
- X10 --> Multiplication by 10
- X100 --> multiplication by 100
- SQR --> Square root
Other transformations can be useful in cases where there is an extremely low range distribution of coverages and when coverage values are low
$ Binsanity-profile -i assembly.fa -s directory/to/BAM/files
- Note that it will read in ALL BAM files in the directory indicated and try to incorporate those into the coverage profile
- This script will output two files. The raw
.cov
file and the transformed.cov.x100.lognorm
coverage profile.
$less assmebly.cov.x100.lognorm
contig-1 1.2 0.4
contig-2 1.0 0.4
contig-3 1.3 4.2
contig-4 1.1 5.1
....
EXAMPLE
Lets say you have a directory system that looks like this:
$ ls
igmBAM_Files igm.fa
Where igmBAM_Files is a directory containing the mapping files for the contigs contained in igm.fa. To run the Binsanity-profile
you would do this:
Binsanity-profile -i igm.fa -s igmBAM_Files -c igm_coverage
This will output two files, igm_coverage.cov
and igm_coverage.cov.x100.lognorm
into the working directory. The file you will input into the next step will be the transformed file igm_coverage.cov.x100.lognorm
.
Once the coverage profile is generated for your fasta file you can run either Binsanity-wf
, Binsanity
, Binsanity-refine
, or Binsanity-lc
. Here we will explain the usage of Binsanity-wf
.
-
Binsanity-wf
is a workflow designed to implement the biphasic approach of Binsanity wherebyBinsanity
(which only implements coverage for clustering) andBinsanity-refine
(which incorporates kmers and %GC into clustering) are integrated into a single workflow that optimizes the final cluster output. The workflow does the following:- Uses the
Binsanity
script to cluster contigs using ONLY coverage - Uses CheckM to estimate completeness and redundancy prior to refinement
- Uses
Binsanity-refine
to refine bins from Step 1. that have high redundancy and low completion (defined in section below) - Creates a final set of refined bins
- Uses the
Please take note of default preferences for the initial binning and refinement step (-3 and -25 respectively). In many cases these work well, but modifications should be made relative to the sample type and expected level of diversity.
sage: Binsanity-wf -f [/path/to/fasta] -l [FastaFile] -c [CoverageFile] -o [OutputDirectory]
************************************************************************************************
**************************************BinSanity*************************************************
** Binsanity-wf is a workflow script that runs Binsanity and Binsanity-refine sequentially. **
** The following is including in the workflow: **
** STEP 1. Run Binsanity **
** STEP 2: Run CheckM to estimate completeness for Refinement **
** STEP 3: Run Binsanity-refine **
** STEP 4: Create Final BinSanity Clusters **
** **
************************************************************************************************
optional arguments:
-h, --help show this help message and exit
-c INPUTCOVFILE
Specify a Transformed Coverage File
e.g Log transformed
-f FastaLocation Specify directory containing your contigs
-p PREFERENCE Specify a preference [Default: -3]
Note: decreasing the preference leads to more lumping,
increasing will lead to more splitting. If your range
of coverages are low you will want to decrease the
preference, if you have 10 or less replicates increasing
the preference could benefit you.
-m MAXITER
Specify a max number of iterations [Default: 4000]
-v ConvergenceIteration
Specify the convergence iteration number [Default: 400]
e.g Number of iterations with no change in the number
of estimated clusters that stops the convergence.
-d DampeningFactor Specify a damping factor between 0.5 and 1, [Default: 0.95]
-l FastaFile Specify the fasta file containing contigs you want to cluster
-x SizeThreshold Specify the contig size cut-off [Default: 1000 bp]
-o OutputDirectory Give a name to the directory BinSanity results will be output in
[Default: 'BINSANITY-RESULTS']
--threads THREADS Indicate how many threads you want dedicated to the subprocess CheckM. [Default=1]
--kmer KMER Indicate a number for the kmer calculation, the [Default: 4]
--Prefix PREFIX Specify a prefix to append to the start of all files generated during Binsanity
--refine-preference INPUTREFINEDPREF
Specify a preference for refinement. [Default: -25]
--binPrefix BinPrefix
Sepcify what prefix you want appended to final Bins {optional}
--version show program's version number and exit
Following the example schema above if your working directory has the following files:
$ ls
igmBAM_Files igm.fa igm_contig_ids.txt igm_coverage.cov igm_coverage.cov.x100.lognorm
You would run the following command while in this directory to use Binsanity-wf
:
$ Binsanity-wf -f . -l igm.fa -c Infant_gut_assembly.cov.x100.lognorm -o IGM-BinsanityWF
NOTE there have been some reported issues that if you DO NOT provide the FULL path to the coverage file that there are some issues in reading and opening
The final output will be deposited in the output specifed by -o
or by default in a directory called BINSANITY-RESULTS
. In this case we chose to output into a directory called IGM-BinsanityWF
which contains the following after running Binsanity-wf:
$ ls IGM-BinsanityWF
BinSanity-Final-bins BinSanityWf.log Binsanity-records.tar.gz
Now we can break down each of these files to better understand the output.
-
Binsanity-Final-bins
is a directory that will contain all of the final bins generated fromBinsanity-lc
. When running the example dataset you should have 33 final bins output into this directory. -
BinSanityWf.log
is a copy of the logfile for the run which you can scroll through to track the clustering process. -
Binsanity-records.tar.gz
is a gzip directory that contains all the intermediate results of clustering and the CheckM results following the initial clustering with Affinity Propagation which have been classified as either high completion, high redundancy, and strain heterogeneous. Here is a breif description of some of the files you will find in there:-
[Prefix]_4mer_frequencies.txt
contains the k-mer frequency table for the last refined cluster -
[Prefix]_GC_count.txt
contains the GC% table for the last refined clusterBinSanityLC-kmerGC.txt -
[Prefix]_kmerGC.txt
contains the combined coverage profile with both GC% and k-mer frequencies for the last refined cluster. -
[Prefix]_checkm_lineagewf-results.txt
contains thecheckm qa
reults following Step 1 clustering with Binsanity. -
[Prefix]_binsanity-checkm
is the output file for CheckM run at Step 2. -
high_completion
contains high completion genomes generated during Step 1. clustering -
high_redundancy
contains high contamination genomes generated during Step 1. clustering -
stain_redundancy
contains genomes with high strain heterogeneity during Step 1. -
REFINED-BINS
contains the refined genomes from Step 3. where high_redundancy genomes are refined.
-
BinSanity-Final-bins contains the final clusters
- When running BinSanity if you encounter an error similar to
S += ((np.finfo(np.double).eps * S + np.finfo(np.double).tiny * 100) *MemoryError
you may be hitting a memory wall when attempting to cluster. Affinity propagation (AP) has time and memory requirements that scale linearly with the number of inputs. In essence this means the difference between clustering 90,000 contigs and 120,000 contigs can be 100's of gb's of memory. The most common error you'll get is due to numpy crashing due to insufficient memory. As of this update there are two ways I suggest dealing with this error.-
You could look at your assembly statistics. If you are running BinSanity with defaults the size cut-off is 1000bp. So looking at your assembly how many contigs are less than 5kb? Although we set a default cut off of 1000bp smaller contigs such as this can have slightly divergent coverage profiles and compositional profiles and may cause difficulty while clustering. While running the Tara Oceans dataset in our lab we found that based on our assemblies ~100,000 contigs was what were were left with when we clustered only contigs greater than 5kb. This cut-off worked well for our machine which is a Dell PowerEdge R920 with 1TB of available RAM and Intel Xeon 2.3GHz processor. When we hit ~120,000 contigs though the program typically crashed due to its memory consumption.
-
The second option would be to try
Binsanity-lc
.Binsanity-lc
is a workflow similar toBinsanity-wf
, but prior to clustering with APBinsanity-lc
initially subsets contigs using coverage and K-Means clustering. K-Means is more computationally feasible for assemblies with greater than 100,000 contigs. Following thisBinsanity-lc
follows the same workflow asBinsanity-wf
. The parameters forBinsanity-lc
are shown below:
-
usage: Binsanity-lc -f [/path/to/fasta] -l [fastafile] -c [coverage file] -o [output directory]
************************************************************************************************
**************************************BinSanity*************************************************
** Binsanity-lc is a workflow script that will subset assemblies larger than 100,000 **
** contigs using coverage prior to running Binsanity and Binsanity-refine sequentially. **
** The following is including in the workflow: **
** Step 1: Use Coverage to Subsample contigs with K-mean Clustering **
** STEP 2: Run Binsanity **
** STEP 3: Run CheckM to estimate completeness for Refinement **
** STEP 4: Run Binsanity-refine **
** STEP 5: Creat Final BinSanity Clusters **
** **
************************************************************************************************
optional arguments:
-h, --help show this help message and exit
-c CoverageFile Specify a Coverage File
-f FastaLocation Specify directory containing Fasta File to be clustered
-p Preference Specify a preference [Default: -3]
Note: decreasing the preference leads to more lumping,
increasing will lead to more splitting. If your range
of coverages are low you will want to decrease the
preference, if you have 10 or less replicates increasing
the preference could benefit you.
-m MaximumIterations Specify a max number of iterations [Default: 4000]
-v ConvergenceIterations
Specify the convergence iteration number [Default:400]
e.g Number of iterations with no change in the number
of estimated clusters that stops the convergence.
-d DampeningFactor Specify a damping factor between 0.5 and 1 [Default: 0.95]
-l FastaFile Name Specify the fasta file containing contigs you want to cluster
-x SizeCutOff Specify the contig size cut-off [Default:1000 bp]
-o Output Directory Give a name to the directory BinSanity results will be output in
[Default:'BINSANITY-RESULTS']
--checkm_threads Threads
Indicate how many threads you want dedicated to the subprocess CheckM [Default: 1]
--kmer Kmer Indicate a number for the kmer calculation [Default: 4]
--refine-preference Specify a preference for refinement [Default: -25]
-C ClusterNumber Indicate a number of initial clusters for kmean [Default:100]
--Prefix Prefix Specify a prefix to append to the start of all directories generated during Binsanity
--version show program's version number and exit
- For
Binsanity-lc
the key parameters that need to be set are the-C
and--threads
.-
-C
sets the cluster number for K-Means.K-Means, unlike AP, classifies and clusters a dataset assuming a k number of clusters which is fixed a priori. The default is 100, but if you have less than 100,000 contigs and are choosing to useBinsanity-lc
you should reduce this number. The idea behind subsetting is that we split the contigs into more maneagable subunity, but if the number is too high compared to the number of contigs you will end up running out of memory again when it attempts to cluster with AP. So for example if you had 50,000 contigs I would set the-C
to 50. -
--threads
will dictate how many threads to deticate to K-Means AND CheckM. If the default is 1 and if your forget to set it for your machine expect for this whole process to take forever.
-
So using the Example DataSet in the github you can test the Binsanity-lc
run.
1.cp
the igm.fa
and Infant_gut_assembly.cov.x100.lognorm
files into your working directory.
- While in this directory run the command
Binsanity-lc -f . -l igm.fa -c Infant_gut_assembly.cov.x100.lognorm -C 5
Let this run until it finishes (if you receive an error before it finishes, check that you have fufilled all the installation of dependencies). When you start the program a log will start printing to the screen so you can track your progress. It should look like this
******************************************************
*******************BinSanity-lc***********************
|____________________________________________________|
| |
| Computing Coverage Array |
|____________________________________________________|
K-Mean cluster number: 5
Fasta File: igm.fa
Coverage File: Infant_gut_assembly.cov.x100.lognorm
Fasta File: igm.fa
Output Directory: BINSANITY-RESULTS
Contig Cut-Off: 1000
(4189, 11)
____________________________________________________
| |
| Initializing clustering via K-means |
|____________________________________________________|
Cluster 0: 1069
Cluster 1: 2563
Cluster 2: 239
Cluster 3: 159
Cluster 4: 159
Total Number of Bins: 5
.............
- Once BinSanity-LC is finished running there should be a final directory called
BINSANITY-RESULTS
(or whatever you specify using-o
). There will be multiple files in this directory that we will break down below:
$ ls BINSANITY-RESULTS
BinSanity-Final-bins BinSanityLC-Binsanity-log.txt Binsanity-records.tar.gz
-
Binsanity-Final-bins
is a directory that will contain all of the final bins generated fromBinsanity-lc
. When running the example dataset you should have 32 final bins output into this directory. -
[Prefix]-BinsanityLC-log.txt
is a copy of the logfile for the run which you can scroll through to track the clustering process. -
Binsanity-records.tar.gz
is a zipped directory that contains the initial K-means clustering results, the unrefined clustering results, and the CheckM results following the initial clustering with Affinity Propagation which have been classified as either high completion, high redundancy, and strain heterogeneous based on initial CheckM results. The following are other files or folders found in theBinsanity-records
directory.-
[Prefix]-high_completion
is a directory containing bins classified as high completion using results from CheckM following Step 2. clustering -
[Prefix]-high_redundancy
is a directory containing bins classified as high redundancy using results from CheckM following Step 2. clustering -
[Prefix]-strain_redundancy
is a directory containing bins having high strain heterogeneity using results from CheckM following Step 2. clustering -
[Prefix]-KMEAN-BINS
is a directory that contains the subseted contigs (output into bins) generated by the K-Mean Clustering step. (Aka results from Step 1. ofBinsanity-lc
) -
[Prefix]-4mer-frequencies.txt
contains the k-mer frequency table for the last refined cluster -
[Prefix]-GC_count.txt
contains the GC% table for the last refined clusterBinSanityLC-kmerGC.txt -
[Prefix]-kmerGC.txt
contains the combined coverage profile with both GC% and k-mer frequencies for the last refined cluster. -
[Prefix]-REFINED-BINS
contains the refined versions of the genomes in[Prefix]-high_redundancy
-
[Prefix]-binsanity-checkm
contains the CheckM results associated with Step 3. of Binsanity-lc. -
[Prefix]-checkm_lineagewf-binsanity.out
contains the tab delimtited output ofcheckm qa
-
The most important folder output though is the BinSanity-Final-bins
which will contain the final set of bins
For the purposes of our analysis we used CheckM as a means of generally indicating high and low redundancy bins to use the refinement script on. To speed up this process a script was written checkm_analysis
to parse the output of checkM qa and separate Binsanity produced bins into categories of high redundancy, low completion, high completion, and strain redundacy.
Currently the thresholds written into the script place bins into categories using the following parameters:
- High completion: greater than 95% complete with < 10% redundancy, greater than 80% with <5% redundancy, or > 50% with < 2% redundacy
- Low completion: less than 50% complete with <5%redundancy
- Strain redundancy: greater than 90% complete with > 90% strain heterogeneity
- High Redundancy: 80% complete with >10% redundacy, or 50% complete > 5% redundacy
The program is written in to the script `Binsanity-wf`, but can also be called as a stand alone script available in the Utils. It is run as:
`checkm_analysis -checkM [checkm_qa tab delimited output]`
It should be noted that selection of the high and low redundancy values are an arbitrary cut off and the values of generally accepted redundancy, completion, and strain heterogeneity are up for debate so it is recommended that if you use the script that you decide what the best cut off values are for your purposes. For example if you are looking for members of the [candidate phyla radiation (CPR)](doi:10.1038/nature14486) the completion estimates may place the bin below accepted thresholds within this context of 'high completion' and 'low completion'.
CheckM is also only one means of evaluating bins. This script is provided as a means to make refinement using BinSanity slightly simpler by quickly moving bins produced during a first pass of BinSanity into smaller categories for further analysis. If you wish to use another means of evaluating genomes and discerning completion/contamination while still utilizing Binsanity we have provided the stand alone script Binsanity
, which can then be followed by genome evlaution and extraction of high quality genomes, followed by usage of the script Binsanity-refine
. Please reach out if you have any questions regarding this.
Please reach out if there are any questions or comments.