The bioinformatics pipeline for the generation and analyses of genomics data in Kawecki et al. 2020. (doi://10.1101/2020.12.01.406686)
We closely followed the bioinformatics pipeline of the data analyses in Kapun et al. 2020 (doi://10.1093/molbev/msaa120). See the documentation for more details
We closely followed the bioinformatics pipeline of the data analyses in Kapun et al. 2020 (doi://10.1093/molbev/msaa120). See the documentation of the DrosEU bioinformatics pipeline and documentation of PoolSNP for details. Commands that differ from the standard pipeline and additional scripts are listed below.
sh /GitHub/PoolSNP/PoolSNP.sh \
mpileup=Input.mpileup.gz \
reference=Dmel_6.04_hologenome_v2.fasta \
names=C1,C2,C3,C4,C5,C6,S1,S2,S3,S4,S5,S6 \
max-cov=0.95 \
min-cov=10 \
min-count=5 \
min-freq=0.002 \
miss-frac=0.1 \
jobs=22 \
badsites \
base-quality=15 \
output=SNPdata
See the documentation of the DrosEU bioinformatics pipeline for more details
After annotating the SNPs in VCF file format, we converted the VCF to the SYNC file format. See the documentation of the DrosEU bioinformatics pipeline for more details
C) Calculation of unbiased population genetics estimators Tajima's pi, Watterson's Theta and Tajima's D
We closely followed the bioinformatics pipeline of the data analyses in Kapun et al. 2020 (doi://10.1093/molbev/msaa120). See the documentation for more details
D) Statistical identification of candidate SNPs based on (i) Generalized linear mixed models with a binomial error structure (GLMM) and (ii) Fisher's exact test (FET)
1) calculate GLMM and retain only major chromosomes using GNU parallel
gunzip -c SNPdata.sync.gz \
| parallel \
--jobs 22 \
--pipe \
--no-notice \
-k --cat python2.7 /scripts/glmm.py \
--input {} \
--order 1,2,3,4,5,6+7,8,9,10,11,12-1,2,3,7,8,9+4,5,6,10,11,12-1,3,5,7,9,11+2,4,6,8,10,12-3,4,5,9,10,11+1,2,6,7,8,12-2,3,4,8,9,10+1,5,6,7,11,12 \
--group 0,0,0,0,0,0,1,1,1,1,1,1 \
--pop 0,1,2,3,4,5,6,7,8,9,10,11 \
| awk '$1=="2L" || $1 =="2R" || $1=="3L" || $1=="3R" || $1=="X" || $1=="4"' \
| gzip > SNPdata.glmm.gz
See the documentation of the DrosEU bioinformatics pipeline for more details
3) calculate FET of dataset with normalized read-depth and retain only major chromosomes using GNU parallel
gunzip -c SNPdata-30x.sync.gz \
| parallel \
--jobs 22 \
--pipe \
--no-notice \
-k --cat python2.7 /scripts/fet.py \
--input {} \
--order 1,2,3,4,5,6+7,8,9,10,11,12-1,2,3,7,8,9+4,5,6,10,11,12-1,3,5,7,9,11+2,4,6,8,10,12-3,4,5,9,10,11+1,2,6,7,8,12-2,3,4,8,9,10+1,5,6,7,11,12 \
| awk '$1=="2L" || $1 =="2R" || $1=="3L" || $1=="3R" || $1=="X" || $1=="4"' \
| gzip > SNPdata-30x.fet.gz
The method is based on the approach in Jha et al. (2015) (doi://10.1093/molbev/msv248).
python2.7 /scripts/FDR-rank.py \
--input SNPdata.glmm.gz \
--true -6 \
--permuted -2,-3,-4,-5 \
--output SNPdata.glmm.fdr.gz
awk '$4<=0.05' SNPdata.glmm.fdr.gz > SNPdata_cand005.glmm.fdr.gz
python2.7 /scripts/FDR-rank.py \
--input SNPdata.fet.gz \
--true -6 \
--permuted -2,-3,-4,-5 \
--output SNPdata.fet.fdr.gz
awk '$4<=0.05' SNPdata.fet.fdr.gz > SNPdata_cand005.fet.fdr.gz
python2.7 /scripts/AFbyAllele.py \
--input SNPdata.sync.gz \
| gzip > SNPdata.af.gz
5) further select candidates with an absolute average allele frequency difference (AFD) between selected and controls >= 0.3
python /GitHub/DrosEU_pipeline/scripts/OverlapSNPs.py \
--source SNPdata_cand005.glmm.fdr.gz \
--target SNPdata.af.gz \
| awk '{C=$(NF-11)+$(NF-10)+$(NF-9)+$(NF-8)+$(NF-7)+$(NF-6); S=$(NF-6)+$(NF-5)+$(NF-4)+$(NF-3)+$(NF-2)+$(NF-1)+$(NF); if(sqrt((S/6-C/6)^2)>=0.3){print}}' \
> SNPdata_cand005_AFD03.fdr.glmm
python /GitHub/DrosEU_pipeline/scripts/OverlapSNPs.py \
--source SNPdata_cand005.fet.fdr.gz \
--target SNPdata.af.gz \
| awk '{C=$(NF-11)+$(NF-10)+$(NF-9)+$(NF-8)+$(NF-7)+$(NF-6); S=$(NF-6)+$(NF-5)+$(NF-4)+$(NF-3)+$(NF-2)+$(NF-1)+$(NF); if(sqrt((S/6-C/6)^2)>=0.3){print}}' \
> SNPdata_cand005_AFD03.fdr.fet
6) Merge and define candidates based on allele frequency patterns in "High" and "Mid" frequency candidates (or "ambiguous")
python /scripts/CandTypes.py \
--glmm SNPdata_cand005_AFD03.fdr.glmm \
--fet SNPdata_cand005_AFD03.fdr.fet \
> SNPdata_cand005_AFD03.cand