Skip to content
Elaina edited this page May 25, 2021 · 15 revisions

Script Usage

Binsanity-profile

Before running BinSanity Profile, you should check your fasta headers. If your fasta headers are all numerical like below:

>1
AGTCGATGCAGCTAGCTGATGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCT
>2
ACGTGATGCTAGCTAGCTGACTAGCTGCGTTTTACCTATCTATCATCATCTC
....

Or if your sequences are not simplified, meaning headers should have a single word with no spaces as a descriptor as follows:

>Contig_1
AGTCGATGCAGCTAGCTGATGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCT
>Contig_2
ACGTGATGCTAGCTAGCTGACTAGCTGCGTTTTACCTATCTATCATCATCTC
....

The BinSanity Profile step or subsequent Binsanity clustering may fail in either of these cases. If your contigs are not simplified or are all numerical use the simplify-fasta script provided with Binsanity.

usage: simplify-fasta -i inputFasta -o outputFasta

optional arguments:
  -h, --help     show this help message and exit
  -i INPUTFASTA  Specify the name of the input file
  -o INPUTOUT    Specify the name for the output file

BINSANITY-PROFILE USAGE

usage: Binsanity-profile -i fasta_file -s {sam,bam}_file -c output_file

    ***********************************************************************
    ******************************BinSanity********************************
    **                                                                   **
    **  Binsanity-profile is used to generate coverage files for         **
    **  input to BinSanity. This uses Featurecounts to generate a        **
    **  a coverage profile and transforms data for input into Binsanity, **
    **  Binsanity-refine, and Binsanity-wf                               **
    **                                                                   **
    ***********************************************************************
    ***********************************************************************

optional arguments:
  -h, --help            show this help message and exit
  -i INPUTFASTA         Specify fasta file being profiled
  -s INPUTMAPLOC
                            identify location of BAM files
                            BAM files should be indexed and sorted
  -c OUTCOV
                            Identify name of output file for coverage information
  --transform TRANSFORM

                            Indicate what type of data transformation you want in the final file [Default:log]:
                            scale --> Scaled by multiplying by 100 and log transforming
                            log --> Log transform
                            None --> Raw Coverage Values
                            X5 --> Multiplication by 5
                            X10 --> Multiplication by 10
                            X100 --> Multiplication by 100
                            SQR --> Square root
                            We recommend using a scaled log transformation for initial testing.
                            Other transformations can be useful on a case by case basis
  -T THREADS            Specify Number of Threads For Feature Counts [Default: 1]
  -o OUTDIRECTORY       Specify directory for output files to be deposited [Default: Working Directory]
  --version             show program's version number and exit

First you need to generate input files for Binsanity (e.g the coverage profile). To generate input files for BinSanity the script `Binsanity-profile` is provided: * `Binsanity-profile` generates a `coverage` file containing both average contig coverage from a `.bam` file calculated via featureCounts. In our tests we used Bowtie2 to produce a `.sam` file, and the converted to a `.bam` file using SamTools. To maintain consistency we used the `.cov` suffix for all files output from this script. The script also takes as input the contig ids to be used in generating a profile. This is to reduce the size of the file by excluding contigs that wont be clustered (for example if you wanted to cluster only contigs greater than 1000 bp you would have a list of contig ids greater than 1000bp).

  • There are multiple transformation options identified by the flag --transform. We recommend the scaled option.
    • scale --> Scaled by multiplying by 100 and log transformed
    • log --> Log transform
    • None --> Raw Coverage Values
    • X5 --> Multiplication by 5
    • X10 --> Multiplication by 10
    • X100 --> multiplication by 100
    • SQR --> Square root

Other transformations can be useful in cases where there is an extremely low range distribution of coverages and when coverage values are low

$ Binsanity-profile -i assembly.fa -s directory/to/BAM/files
  • Note that it will read in ALL BAM files in the directory indicated and try to incorporate those into the coverage profile
  • This script will output two files. The raw .cov file and the transformed .cov.x100.lognorm coverage profile.
$less assmebly.cov.x100.lognorm

contig-1  1.2 0.4
contig-2  1.0 0.4
contig-3  1.3 4.2
contig-4  1.1 5.1
....

EXAMPLE

Lets say you have a directory system that looks like this:

$ ls
igmBAM_Files igm.fa 

Where igmBAM_Files is a directory containing the mapping files for the contigs contained in igm.fa. To run the Binsanity-profile you would do this:

Binsanity-profile -i igm.fa -s igmBAM_Files -c igm_coverage 

This will output two files, igm_coverage.cov and igm_coverage.cov.x100.lognorm into the working directory. The file you will input into the next step will be the transformed file igm_coverage.cov.x100.lognorm.


BinSanity-wf

Once the coverage profile is generated for your fasta file you can run either Binsanity-wf, Binsanity, Binsanity-refine, or Binsanity-lc. Here we will explain the usage of Binsanity-wf.

  • Binsanity-wf is a workflow designed to implement the biphasic approach of Binsanity whereby Binsanity (which only implements coverage for clustering) and Binsanity-refine (which incorporates kmers and %GC into clustering) are integrated into a single workflow that optimizes the final cluster output. The workflow does the following:
    1. Uses the Binsanity script to cluster contigs using ONLY coverage
    2. Uses CheckM to estimate completeness and redundancy prior to refinement
    3. Uses Binsanity-refine to refine bins from Step 1. that have high redundancy and low completion (defined in section below)
    4. Creates a final set of refined bins

Please take note of default preferences for the initial binning and refinement step (-3 and -25 respectively). In many cases these work well, but modifications should be made relative to the sample type and expected level of diversity.

sage: Binsanity-wf -f [/path/to/fasta] -l [FastaFile] -c [CoverageFile] -o [OutputDirectory]

    ************************************************************************************************
    **************************************BinSanity*************************************************
    **  Binsanity-wf is a workflow script that runs Binsanity and Binsanity-refine sequentially.  **
    **  The following is including in the workflow:                                               **
    **  STEP 1. Run Binsanity                                                                     **
    **  STEP 2: Run CheckM to estimate completeness for Refinement                                **
    **  STEP 3: Run Binsanity-refine                                                              **
    **  STEP 4: Create Final BinSanity Clusters                                                   **
    **                                                                                            **
    ************************************************************************************************


optional arguments:
  -h, --help            show this help message and exit
  -c INPUTCOVFILE
                            Specify a Transformed Coverage File
                            e.g Log transformed

  -f FastaLocation      Specify directory containing your contigs
  -p PREFERENCE         Specify a preference [Default: -3]
                            Note: decreasing the preference leads to more lumping,
                            increasing will lead to more splitting. If your range
                            of coverages are low you will want to decrease the
                            preference, if you have 10 or less replicates increasing
                            the preference could benefit you.
  -m MAXITER
                            Specify a max number of iterations [Default: 4000]
  -v ConvergenceIteration
                        Specify the convergence iteration number [Default: 400]
                            e.g Number of iterations with no change in the number
                            of estimated clusters that stops the convergence.
  -d DampeningFactor    Specify a damping factor between 0.5 and 1, [Default: 0.95]
  -l FastaFile          Specify the fasta file containing contigs you want to cluster
  -x SizeThreshold      Specify the contig size cut-off [Default: 1000 bp]
  -o OutputDirectory    Give a name to the directory BinSanity results will be output in
                            [Default: 'BINSANITY-RESULTS']
  --threads THREADS     Indicate how many threads you want dedicated to the subprocess CheckM. [Default=1]
  --kmer KMER           Indicate a number for the kmer calculation, the [Default: 4]
  --Prefix PREFIX       Specify a prefix to append to the start of all files generated during Binsanity
  --refine-preference INPUTREFINEDPREF
                        Specify a preference for refinement. [Default: -25]
  --binPrefix BinPrefix 
                        Sepcify what prefix you want appended to final Bins {optional}
  --version             show program's version number and exit

EXAMPLE

Following the example schema above if your working directory has the following files:

$ ls

igmBAM_Files  igm.fa  igm_contig_ids.txt  igm_coverage.cov  igm_coverage.cov.x100.lognorm

You would run the following command while in this directory to use Binsanity-wf:

$ Binsanity-wf -f . -l igm.fa -c Infant_gut_assembly.cov.x100.lognorm -o IGM-BinsanityWF

NOTE there have been some reported issues that if you DO NOT provide the FULL path to the coverage file that there are some issues in reading and opening

The final output will be deposited in the output specifed by -o or by default in a directory called BINSANITY-RESULTS. In this case we chose to output into a directory called IGM-BinsanityWF which contains the following after running Binsanity-wf:

$ ls  IGM-BinsanityWF
BinSanity-Final-bins BinSanityWf.log Binsanity-records.tar.gz

Now we can break down each of these files to better understand the output.

  • Binsanity-Final-bins is a directory that will contain all of the final bins generated from Binsanity-lc. When running the example dataset you should have 33 final bins output into this directory.

  • BinSanityWf.log is a copy of the logfile for the run which you can scroll through to track the clustering process.

  • Binsanity-records.tar.gz is a gzip directory that contains all the intermediate results of clustering and the CheckM results following the initial clustering with Affinity Propagation which have been classified as either high completion, high redundancy, and strain heterogeneous. Here is a breif description of some of the files you will find in there:

    • [Prefix]_4mer_frequencies.txt contains the k-mer frequency table for the last refined cluster

    • [Prefix]_GC_count.txt contains the GC% table for the last refined clusterBinSanityLC-kmerGC.txt

    • [Prefix]_kmerGC.txt contains the combined coverage profile with both GC% and k-mer frequencies for the last refined cluster.

    • [Prefix]_checkm_lineagewf-results.txt contains the checkm qa reults following Step 1 clustering with Binsanity.

    • [Prefix]_binsanity-checkm is the output file for CheckM run at Step 2.

    • high_completion contains high completion genomes generated during Step 1. clustering

    • high_redundancy contains high contamination genomes generated during Step 1. clustering

    • stain_redundancy contains genomes with high strain heterogeneity during Step 1.

    • REFINED-BINS contains the refined genomes from Step 3. where high_redundancy genomes are refined.

BinSanity-Final-bins contains the final clusters


Binsanity-lc

  • When running BinSanity if you encounter an error similar to S += ((np.finfo(np.double).eps * S + np.finfo(np.double).tiny * 100) *MemoryError you may be hitting a memory wall when attempting to cluster. Affinity propagation (AP) has time and memory requirements that scale linearly with the number of inputs. In essence this means the difference between clustering 90,000 contigs and 120,000 contigs can be 100's of gb's of memory. The most common error you'll get is due to numpy crashing due to insufficient memory. As of this update there are two ways I suggest dealing with this error.
    1. You could look at your assembly statistics. If you are running BinSanity with defaults the size cut-off is 1000bp. So looking at your assembly how many contigs are less than 5kb? Although we set a default cut off of 1000bp smaller contigs such as this can have slightly divergent coverage profiles and compositional profiles and may cause difficulty while clustering. While running the Tara Oceans dataset in our lab we found that based on our assemblies ~100,000 contigs was what were were left with when we clustered only contigs greater than 5kb. This cut-off worked well for our machine which is a Dell PowerEdge R920 with 1TB of available RAM and Intel Xeon 2.3GHz processor. When we hit ~120,000 contigs though the program typically crashed due to its memory consumption.

    2. The second option would be to try Binsanity-lc. Binsanity-lc is a workflow similar to Binsanity-wf, but prior to clustering with AP Binsanity-lc initially subsets contigs using coverage and K-Means clustering. K-Means is more computationally feasible for assemblies with greater than 100,000 contigs. Following this Binsanity-lc follows the same workflow as Binsanity-wf. The parameters for Binsanity-lc are shown below:

usage: Binsanity-lc -f [/path/to/fasta] -l [fastafile] -c [coverage file] -o [output directory]

   ************************************************************************************************
    **************************************BinSanity*************************************************
    **    Binsanity-lc is a workflow script that will subset assemblies larger than 100,000       **
    **    contigs using coverage prior to running Binsanity and Binsanity-refine sequentially.    **
    **    The following is including in the workflow:                                             **
    **       Step 1: Use Coverage to Subsample contigs with K-mean Clustering                     **
    **       STEP 2: Run Binsanity                                                                **
    **       STEP 3: Run CheckM to estimate completeness for Refinement                           **
    **       STEP 4: Run Binsanity-refine                                                         **
    **       STEP 5: Creat Final BinSanity Clusters                                               **
    **                                                                                            **
    ************************************************************************************************


optional arguments:
  -h, --help            show this help message and exit
  -c CoverageFile       Specify a Coverage File

  -f FastaLocation      Specify directory containing Fasta File to be clustered

  -p Preference         Specify a preference [Default: -3]
                            Note: decreasing the preference leads to more lumping,
                            increasing will lead to more splitting. If your range
                            of coverages are low you will want to decrease the
                            preference, if you have 10 or less replicates increasing
                            the preference could benefit you.

  -m MaximumIterations  Specify a max number of iterations [Default: 4000]

  -v ConvergenceIterations
                        Specify the convergence iteration number [Default:400]
                            e.g Number of iterations with no change in the number
                            of estimated clusters that stops the convergence.

  -d DampeningFactor    Specify a damping factor between 0.5 and 1 [Default: 0.95]

  -l FastaFile Name     Specify the fasta file containing contigs you want to cluster

  -x SizeCutOff         Specify the contig size cut-off [Default:1000 bp]

  -o Output Directory   Give a name to the directory BinSanity results will be output in
                            [Default:'BINSANITY-RESULTS']

  --checkm_threads Threads
                        Indicate how many threads you want dedicated to the subprocess CheckM [Default: 1]

  --kmer Kmer           Indicate a number for the kmer calculation [Default: 4]

  --refine-preference   Specify a preference for refinement [Default: -25]

  -C ClusterNumber      Indicate a number of initial clusters for kmean [Default:100]

  --Prefix Prefix       Specify a prefix to append to the start of all directories generated during Binsanity

  --version             show program's version number and exit

  • For Binsanity-lc the key parameters that need to be set are the -C and --threads.
    • -C sets the cluster number for K-Means.K-Means, unlike AP, classifies and clusters a dataset assuming a k number of clusters which is fixed a priori. The default is 100, but if you have less than 100,000 contigs and are choosing to use Binsanity-lc you should reduce this number. The idea behind subsetting is that we split the contigs into more maneagable subunity, but if the number is too high compared to the number of contigs you will end up running out of memory again when it attempts to cluster with AP. So for example if you had 50,000 contigs I would set the -C to 50.
    • --threads will dictate how many threads to deticate to K-Means AND CheckM. If the default is 1 and if your forget to set it for your machine expect for this whole process to take forever.

EXAMPLE OF BINSANITY-LC

So using the Example DataSet in the github you can test the Binsanity-lc run.

1.cp the igm.fa and Infant_gut_assembly.cov.x100.lognorm files into your working directory.

  1. While in this directory run the command Binsanity-lc -f . -l igm.fa -c Infant_gut_assembly.cov.x100.lognorm -C 5

Let this run until it finishes (if you receive an error before it finishes, check that you have fufilled all the installation of dependencies). When you start the program a log will start printing to the screen so you can track your progress. It should look like this


        ******************************************************
        *******************BinSanity-lc***********************
        |____________________________________________________|
        |                                                    |
        |             Computing Coverage Array               |
        |____________________________________________________|

          K-Mean cluster number: 5
          Fasta File: igm.fa
          Coverage File: Infant_gut_assembly.cov.x100.lognorm
          Fasta File: igm.fa
          Output Directory: BINSANITY-RESULTS
          Contig Cut-Off: 1000
          (4189, 11)

        ____________________________________________________
       |                                                    |
       |        Initializing clustering via K-means         |
       |____________________________________________________|

          Cluster 0: 1069
          Cluster 1: 2563
          Cluster 2: 239
          Cluster 3: 159
          Cluster 4: 159
          Total Number of Bins: 5
.............
  1. Once BinSanity-LC is finished running there should be a final directory called BINSANITY-RESULTS (or whatever you specify using -o). There will be multiple files in this directory that we will break down below:
$ ls BINSANITY-RESULTS
BinSanity-Final-bins              BinSanityLC-Binsanity-log.txt  Binsanity-records.tar.gz
  • Binsanity-Final-bins is a directory that will contain all of the final bins generated from Binsanity-lc. When running the example dataset you should have 32 final bins output into this directory.

  • [Prefix]-BinsanityLC-log.txt is a copy of the logfile for the run which you can scroll through to track the clustering process.

  • Binsanity-records.tar.gz is a zipped directory that contains the initial K-means clustering results, the unrefined clustering results, and the CheckM results following the initial clustering with Affinity Propagation which have been classified as either high completion, high redundancy, and strain heterogeneous based on initial CheckM results. The following are other files or folders found in the Binsanity-records directory.

    • [Prefix]-high_completion is a directory containing bins classified as high completion using results from CheckM following Step 2. clustering

    • [Prefix]-high_redundancy is a directory containing bins classified as high redundancy using results from CheckM following Step 2. clustering

    • [Prefix]-strain_redundancy is a directory containing bins having high strain heterogeneity using results from CheckM following Step 2. clustering

    • [Prefix]-KMEAN-BINS is a directory that contains the subseted contigs (output into bins) generated by the K-Mean Clustering step. (Aka results from Step 1. of Binsanity-lc)

    • [Prefix]-4mer-frequencies.txt contains the k-mer frequency table for the last refined cluster

    • [Prefix]-GC_count.txt contains the GC% table for the last refined clusterBinSanityLC-kmerGC.txt

    • [Prefix]-kmerGC.txt contains the combined coverage profile with both GC% and k-mer frequencies for the last refined cluster.

    • [Prefix]-REFINED-BINS contains the refined versions of the genomes in [Prefix]-high_redundancy

    • [Prefix]-binsanity-checkm contains the CheckM results associated with Step 3. of Binsanity-lc.

    • [Prefix]-checkm_lineagewf-binsanity.out contains the tab delimtited output of checkm qa

The most important folder output though is the BinSanity-Final-bins which will contain the final set of bins


Setting Completion and Redundancy Estimates for refinement

For the purposes of our analysis we used CheckM as a means of generally indicating high and low redundancy bins to use the refinement script on. To speed up this process a script was written checkm_analysis to parse the output of checkM qa and separate Binsanity produced bins into categories of high redundancy, low completion, high completion, and strain redundacy.

Currently the thresholds written into the script place bins into categories using the following parameters:

  • High completion: greater than 95% complete with < 10% redundancy, greater than 80% with <5% redundancy, or > 50% with < 2% redundacy
  • Low completion: less than 50% complete with <5%redundancy
  • Strain redundancy: greater than 90% complete with > 90% strain heterogeneity
  • High Redundancy: 80% complete with >10% redundacy, or 50% complete > 5% redundacy

The program is written in to the script `Binsanity-wf`, but can also be called as a stand alone script available in the Utils. It is run as:

`checkm_analysis -checkM [checkm_qa tab delimited output]`

It should be noted that selection of the high and low redundancy values are an arbitrary cut off and the values of generally accepted redundancy, completion, and strain heterogeneity are up for debate so it is recommended that if you use the script that you decide what the best cut off values are for your purposes. For example if you are looking for members of the [candidate phyla radiation (CPR)](doi:10.1038/nature14486) the completion estimates may place the bin below accepted thresholds within this context of 'high completion' and 'low completion'.

CheckM is also only one means of evaluating bins. This script is provided as a means to make refinement using BinSanity slightly simpler by quickly moving bins produced during a first pass of BinSanity into smaller categories for further analysis. If you wish to use another means of evaluating genomes and discerning completion/contamination while still utilizing Binsanity we have provided the stand alone script Binsanity, which can then be followed by genome evlaution and extraction of high quality genomes, followed by usage of the script Binsanity-refine. Please reach out if you have any questions regarding this.