Skip to content

Latest commit

 

History

History
33 lines (24 loc) · 2.97 KB

README.md

File metadata and controls

33 lines (24 loc) · 2.97 KB

GenoTools

Hello! In this repo I collect some of the functions, tools, scripts and handy commands that I use when working on genomic data.

Calculate SFS across windows in the genome

This script estimates the sfs for pre-defined windows along the genome. Required inputs are a bed file for each chromosome stating listing the windows of interest in the format chromosome\tstart\tend, the reference fasta file, the ancestral fasta file (can be replaced with the refence file for folded sfs), and a list of bamfiles for individuals to use.

The output of this script can then be used as input into David Marques' python script to calculate Dxy from the sfs.

SAM/BAM scripts

The only scripts in here now are an awk and python script to calculate the mean, stdev, non-zero mean, median, non-zero count, and genome coverage proportions from the output of samtools depth. I produced them to be used as part of the lcWGS tutorial by the Therdilksen lab, to assess read depth per sample before estimating genotype likelihoods. The input file contains one column, with one integer per base pair, stating the read depth for that base pair from a bam/sam alignment file. The output is a table with the filename, followed by the metrics mentioned above. Test it on a small file (< 10.000.000 rows). DISCLAIMER: These scripts have only been tested on a handful of my own data, so please test them thoroughly on a small dataset first.

VCF processing scripts

The script 0_AddNamesTo012.sh takes as input the 3 files produced by vcftools when using the --012 output format (i.e., filename.012, filename.012.indv and filename.012.pos) and merges them into one file. The three input files contain:

  • filename.012: genotyping information in the 012 format, where genotypes are coded as either 0 (homozygote reference), 1 (heterozygote) or 2 (homozygote snp).
  • filename.012.indv: sample IDs of genotyped individuals
  • filename.012.pos: locus id and snp position within the locus, divided by a tab

The output file, which has the naming format filename.012.all, contains information from all three files, with individuals as rows and loci as columns.

Assuming both the script and the 012 files are in the current working directory, the script can be submitted with

./0_AddNamesTo012.sh filename

The only argument you need to provide is the value of filename, which has to be the same between the three files.