Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 1.42 KB

README.md

File metadata and controls

20 lines (15 loc) · 1.42 KB

seqProcessing

Bioinformatic utilities for nucleotide sequences. Written in Bash, perl or python.

Contents

  • count_GC_content.py
    Given an input fasta file in single line format and a window size (int), this script will calculate the GC percentage of each region for every non-overlapping window and output it in a bam format

    Input: FASTA single line (you can use a preprocessing script like PAGIT's fasta2singleLine.pl)
    Output: bam-like file with the following data - "chromosome startPos endPos GC% "
    Used for: This script was created specifically to use for visualization in Circos software. The input files for this software are required as bam format.

  • extractVariableSites_aln.py
    Finds those positions in a multifasta alignment file that are constant in every sequence, and extracts them, leaving as output only those nucleotides/aminoacids that are variable for at least one of the sequences

    Input: MultiFASTA alignment file. Output from any MSA software
    Output: Another multiFASTA file
    Used for: Performing downstream phylogenetic SNP analysis, for example. RAxML, in particular, requires that your input fasta or phy shows only the variable sites if you use the prefix ASC_ in the -m flag (page 27 of this)