emmaielle / seqProcessing Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Bioinformatic utilities for processing nucleotide sequences.

1 star 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
count_GC_content.py		count_GC_content.py
extractVariableSites_aln.py		extractVariableSites_aln.py

Repository files navigation

seqProcessing

Bioinformatic utilities for nucleotide sequences. Written in Bash, perl or python.

Contents

count_GC_content.py
Given an input fasta file in single line format and a window size (int), this script will calculate the GC percentage of each region for every non-overlapping window and output it in a bam format

Input: FASTA single line (you can use a preprocessing script like PAGIT's fasta2singleLine.pl)
Output: bam-like file with the following data - "chromosome startPos endPos GC% "
Used for: This script was created specifically to use for visualization in Circos software. The input files for this software are required as bam format.
extractVariableSites_aln.py
Finds those positions in a multifasta alignment file that are constant in every sequence, and extracts them, leaving as output only those nucleotides/aminoacids that are variable for at least one of the sequences

Input: MultiFASTA alignment file. Output from any MSA software
Output: Another multiFASTA file
Used for: Performing downstream phylogenetic SNP analysis, for example. RAxML, in particular, requires that your input fasta or phy shows only the variable sites if you use the prefix ASC_ in the -m flag (page 27 of this)

About

Bioinformatic utilities for processing nucleotide sequences.

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%