Skip to content

Bin Evaluation

Elaina edited this page Dec 12, 2019 · 5 revisions

At one point or another you may find yourself wanting to evlaute the performance of BinSanity or quickly extract high quality genomes from a large subset. We have two useful tools to help you do this.

bin_evaluation

  • The bin_evaluation is used as indicated below.
  • Please Note That This Script Only Works If You Have Reference Bins With The Optimal or Correct Contig Assignments
  • This script has not been verified to run in python3
usage: bin_evaluation -b Putative Genomes -r reference genomes -l suffix of fasta files

    *****************************************************************************
    *********************************BinSanity***********************************
    **   The script `bin_evaluation` uses sklearn metrics                      **
    **   (http://scikit-learn.org/stable/modules/classes.html) to calculate    **
    **   the adjusted rand index, homogeneity, completeness, and v-measure to  **
    **   evaluate clustering results compared to a of known clusters. See the  **
    **   BinSanity paper ( https://doi.org/10.7717/peerj.3035) for a full      **
    **   description of how these are used.                                    **
    **                                                                         **
    **   The `bin_evaluation` script can be used to compare the statistical    **
    **   accuracy of multiple clustering methods on a set of contigs with      **
    **   known identity. To use it you must have two directories. One          **
    **   containing genome with the expected cluster outcomes (identified with **
    **   `-r`), and the other containing genomes generated with clustering     **
    **   method you wish to evaluate (identified with `-b`).                   **
    *****************************************************************************

optional arguments:
  -h, --help  show this help message and exit
  -b          Specify the directory containing Putative genomes
  -r          Specify directory containing reference genomes
  -l          specify suffix of bins e.g .fa, .fna, .fasta, etc.

So what does the bin_evaluation script really tell you?

  1. Precision

    • Precision defines if a cluster contains only members of a single class (an output of 1 representing all bins contain only contigs from a single source). So when thinking about your genomes this measure tells you if contigs in your bin are all from the same reference genome. Precision can be very high while your genomes may still be very incomplete. This is because if you have a single genome that is clustered into 5 bins, but each of those 5 bins only contains contigs from a single source your precision would still be 1.
  2. Recall

    • Recall considers whether each member of a class is assigned to the same bin. This means that it looks at whether contigs from the same reference genome are being assigned to the same cluster. So in contrast to precision you can have a very high recall and a highly contaminated genome. This is because if you had two reference genomes clustered into the same bin, but all the contigs from each of those references is in that one bin the recall would be 1 because all the contigs from a single source are in the same bin.
  3. V-measure

    • The V measure is the harmonic mean of the precision and recall allowing evaluation of accuracy. This is a more effective way to evlauate clustering accuracy than precision or recall alone.
  4. Adjusted-Rand Index (ARI)

    • The ARI considers similarity between predicted and true cluster labels similar to the V-measure. But it also adds another layer and this similarity is then adjusted for chance using a probability heuristic.

    See this paper by Hirschberg and Rosenberg for more information on recall, precision, and the V-measure. See this paper by Hubert and Arabie for more information on the Adjusted Rand Index.


checkm_analysis

  • The help message for checkm_analysis is shown below.
usage: checkm_analysis -checkM checkm_qa -f fasta suffix [.fa,.fasta,.fna]

    *****************************************************************************
    *********************************BinSanity***********************************
    **   The script `checkm_analysis` is a simple parser that extracts the     **
    **   completion, contamination, and strain heterogeneity values from the   **
    **   output of `checkm qa`. Then it splits the corresponding genomes into  **
    **   categories of high completion, low completion, and high redundancy    **
    **   prior to moving the bins into appropriate subfolders.                 **
    *****************************************************************************

optional arguments:
  -h, --help       show this help message and exit
  -checkM INPUTQA  Specify a checkM file
  -f INPUTFA       Identify what your suffix for fasta file is [default: .fna]
  • checkm_analysis takes the output generated from the checkm qa [using default output parameters] and parses out the completion, contamination and redundancy values. These values are used to classify genomes into four categories defined below. Currently the thresholds written into the script place bins into categories using the following parameters:

    • High completion: greater than 95% complete with less than 10% redundancy, greater than 80% with less than 5% redundancy, or greater than 50% with less than 2% redundacy
    • Low completion: less than 50% complete with less than 2% redundancy
    • Strain redundancy: greater than 90% complete, with greater than 10% redundancy and greater than 90% strain heterogeneity
    • High Redundancy: Anything Bin not fiting in those categories is considered high redundancy.
  • If you want to adjust those thresholds you'll need to go into the code base

  • From this you can make your own modifications to the thresholds for each category.
Clone this wiki locally