Skip to content

Utilities

Elaina edited this page Dec 12, 2019 · 8 revisions

Other Useful Scripts

Packaged with Binsanity are a series of other potentially useful scripts that may help with your bioinformatic workflow


concat

  • NOT TESTED IN PYTHON3 YET, CURRENTLY COMPATABLE WITH PYTHON2.7

  • When conducting a phylogenomic analysis it is becoming increasingly common to build large phylogenetic trees using concatenated alignments of multiple conserved genes. Check out these papers: 1, 2, 3.

    Why use a concatenated alignment in building a phylogenetic tree?

    Concatenated gene trees can improve resolution of organismal phylogenies. Single gene tree suffer the limitation that there could be few informative sites for differentiation, variable evolutionary rates in different lineages, and a single gene could be affected by horizontal gene transfer (HGT). Concatenating many sequence alignments (Many paper choose to use ribosomal proteins) can overcome some of these issues. The final phylogeny would then be a sort of consensus of the phylogenys for the genes used.

Usage is shown below:

usage: concat -f directory -e Alignment Extension --Prefix file linker -o output

    *****************************************************************************
    *********************************BinSanity***********************************
    **     The `concat` script is used to concatenate multiple sequence        **
    **     alignments for conducting a phylogenomic analysis. Note that you    **
    **     receive an error if there are any duplicate sequence ids in an      **
    **     alignment.
    *****************************************************************************

optional arguments:
  -h, --help  show this help message and exit
  -f          Specify directory where alignments are
  -e          Specify the extension for your alignments (must be in Fasta format)
  --Prefix    Specify the prefix that links your alignments (ex: if you have two alignments TOBG_RpL10, TOBG_RpL24, the --Prefix would be TOBG
  -o          Specify output file
  -N          Specify the minimum number of sequences needed to be included in concatenation

EXAMPLE

if you have three alignment files calle alignment_ribosomal1.aln, alignment_ribosomal2.aln, and ribosomal3.aln with the following contents:

$ head *aln
==> alignment_ribosomal1.aln <==
>Org1
ACGTACTGTGCGTCATGCA
>Org3
AA--ACGTATG--CCC-TA
>Org4
AA-GTCATA------CATT
==> alignment_ribosomal2.aln <==
>Org1
TCGTTC-GCGTCAGG----CA
>Org2
TCGTT--ACGTATG-CCTAGA
>Org3
TCGATCA--CGTATGCCTAGA
==> alignment_ribosomal3.aln <==
>Org1
ACGAAC-GCGA-AG-T-C-CA
>Org2
ACGA-A-GCGTATG--CCTAGA
>Org3
AC-AACAGCC-TATGCCT--A

And you wanted to concatenate your alignments by running the following command, specifying that sequence record needs to be in at least two of the files to be included in the final output: concat -f . -e .aln --Prefix alignment -o concatenated_riboalign.aln -N 2

The final file would look like:

$ head concatenated_riboalign.aln
>Org1
ACGTACTGTGCGTCATGCA-TCGTTC-GCGTCAGG----CA-ACGAAC-GCGA-AG-T-C-CA
>Org2
--------------------TCGTT--ACGTATG-CCTAGA-ACGA-A-GCGTATG--CCTAGA
>Org3
AA--ACGTATG--CCC-TA-TCGATCA--CGTATGCCTAGA-AC-AACAGCC-TATGCCT--A


simplify-fasta

  • This is a script made to simplfy fasta headers. BinSanity doesn't really like it when the fasta headers have too many categories. For example BinSanity will throw and error if your headers look like:

>AB004394.1 Mus musculus DNA, chromosome 17, clone: BAC 238P22, genomic survey sequence

All the spaces and , will confuse the program so this script simplfies the names by systematically going through and renaming each contig as >contig_[count].

Usage is shown below:

usage: simplify-fasta -i inputFasta -o outputFasta

        *****************************************************************************
        *********************************BinSanity***********************************
        **    The `simplify-fasta` script is built to simplify fasta headers so as **
        **    not to run into errors when running BinSanity. Simplified headers    **
        **    means that every contig id is only made up of a single word. This    **
        **    will rename your fasta ids as `>contig_1`, `>contig_2`, and so on.   **
        *****************************************************************************

optional arguments:
  -h, --help  show this help message and exit
  -i          Specify the name of the input file
  -o          Specify the name for the output file```


transform-coverage-profile

  • This script is used to transform a raw coverage profile for clustering.

Usage is shown below:

usage: transform-coverage-profile -c Output -t transform

        *****************************************************************************
        ********************************BinSanity************************************
        **       The `transform-coverage-profile` script is made to expedite the   **
        **       transformation of a raw coverage profile without re-running       **
        **       Binsanity-profile. The script takes as input the `.cov` file      **
        **       output from Binsanity-profile (or another coverage profile)       **
        *****************************************************************************
optional arguments:
  -h, --help      show this help message and exit
  -c INPUTOUTPUT  Specify the cov
  -t TRANSFORM
                      Indicate what type of data transformation you want in the final file (default is log):
                      scale --> Multiplication by 100 and log transform
                      log --> Log transform
                      X5 --> Multiplication by 5
                      X10 --> Multiplication by 10
                      SQR --> Square root
                      We recommend using a log transformation for initial testing. Other transformations can be useful in cases where there is an extremely low range distribution of coverages and when coverage values are low