Bioinformatics pipeline of Kawecki et al. 2020.

The bioinformatics pipeline for the generation and analyses of genomics data in Kawecki et al. 2020. (doi://10.1101/2020.12.01.406686)

A) trim, map and re-align around InDels

We closely followed the bioinformatics pipeline of the data analyses in Kapun et al. 2020 (doi://10.1093/molbev/msaa120). See the documentation for more details

B) Merging BAM files and joint SNP calling

We closely followed the bioinformatics pipeline of the data analyses in Kapun et al. 2020 (doi://10.1093/molbev/msaa120). See the documentation of the DrosEU bioinformatics pipeline and documentation of PoolSNP for details. Commands that differ from the standard pipeline and additional scripts are listed below.

1) SNP calling

sh /GitHub/PoolSNP/PoolSNP.sh  \
mpileup=Input.mpileup.gz \
reference=Dmel_6.04_hologenome_v2.fasta  \
names=C1,C2,C3,C4,C5,C6,S1,S2,S3,S4,S5,S6 \
max-cov=0.95 \
min-cov=10 \
min-count=5 \
min-freq=0.002 \
miss-frac=0.1 \
jobs=22 \
badsites \
base-quality=15 \
output=SNPdata

2) annotation with SNPeff

See the documentation of the DrosEU bioinformatics pipeline for more details

3) Conversion to sync file format

After annotating the SNPs in VCF file format, we converted the VCF to the SYNC file format. See the documentation of the DrosEU bioinformatics pipeline for more details

C) Calculation of unbiased population genetics estimators Tajima's pi, Watterson's Theta and Tajima's D

We closely followed the bioinformatics pipeline of the data analyses in Kapun et al. 2020 (doi://10.1093/molbev/msaa120). See the documentation for more details

D) Statistical identification of candidate SNPs based on (i) Generalized linear mixed models with a binomial error structure (GLMM) and (ii) Fisher's exact test (FET)

1) calculate GLMM and retain only major chromosomes using GNU parallel

gunzip -c SNPdata.sync.gz  \
| parallel \
--jobs 22 \
--pipe \
--no-notice \
-k --cat python2.7 /scripts/glmm.py \
--input {} \
--order 1,2,3,4,5,6+7,8,9,10,11,12-1,2,3,7,8,9+4,5,6,10,11,12-1,3,5,7,9,11+2,4,6,8,10,12-3,4,5,9,10,11+1,2,6,7,8,12-2,3,4,8,9,10+1,5,6,7,11,12 \
--group 0,0,0,0,0,0,1,1,1,1,1,1 \
--pop 0,1,2,3,4,5,6,7,8,9,10,11 \
| awk '$1=="2L" || $1 =="2R" || $1=="3L" || $1=="3R" || $1=="X" || $1=="4"' \
| gzip > SNPdata.glmm.gz

2) subsample the SNP dataset in SYNC file format to an even 30x read-depth.

See the documentation of the DrosEU bioinformatics pipeline for more details

3) calculate FET of dataset with normalized read-depth and retain only major chromosomes using GNU parallel

gunzip -c SNPdata-30x.sync.gz  \
| parallel \
--jobs 22 \
--pipe \
--no-notice \
-k --cat python2.7 /scripts/fet.py \
--input {} \
--order 1,2,3,4,5,6+7,8,9,10,11,12-1,2,3,7,8,9+4,5,6,10,11,12-1,3,5,7,9,11+2,4,6,8,10,12-3,4,5,9,10,11+1,2,6,7,8,12-2,3,4,8,9,10+1,5,6,7,11,12 \
| awk '$1=="2L" || $1 =="2R" || $1=="3L" || $1=="3R" || $1=="X" || $1=="4"' \
| gzip > SNPdata-30x.fet.gz

4) calculate empirical false discovery rate and identify candidates with q-value <=0.05

The method is based on the approach in Jha et al. (2015) (doi://10.1093/molbev/msv248).

python2.7 /scripts/FDR-rank.py \
--input SNPdata.glmm.gz \
--true -6 \
--permuted -2,-3,-4,-5 \
--output SNPdata.glmm.fdr.gz

awk '$4<=0.05' SNPdata.glmm.fdr.gz > SNPdata_cand005.glmm.fdr.gz

python2.7 /scripts/FDR-rank.py \
--input SNPdata.fet.gz \
--true -6 \
--permuted -2,-3,-4,-5 \
--output SNPdata.fet.fdr.gz

awk '$4<=0.05' SNPdata.fet.fdr.gz > SNPdata_cand005.fet.fdr.gz

5) calculate frequencies of the allele at higher frequencies in the selected populations

python2.7 /scripts/AFbyAllele.py \
--input SNPdata.sync.gz \
| gzip > SNPdata.af.gz

5) further select candidates with an absolute average allele frequency difference (AFD) between selected and controls >= 0.3

python /GitHub/DrosEU_pipeline/scripts/OverlapSNPs.py \
--source SNPdata_cand005.glmm.fdr.gz \
--target SNPdata.af.gz \
| awk '{C=$(NF-11)+$(NF-10)+$(NF-9)+$(NF-8)+$(NF-7)+$(NF-6); S=$(NF-6)+$(NF-5)+$(NF-4)+$(NF-3)+$(NF-2)+$(NF-1)+$(NF); if(sqrt((S/6-C/6)^2)>=0.3){print}}' \
> SNPdata_cand005_AFD03.fdr.glmm

python /GitHub/DrosEU_pipeline/scripts/OverlapSNPs.py \
--source SNPdata_cand005.fet.fdr.gz \
--target SNPdata.af.gz \
| awk '{C=$(NF-11)+$(NF-10)+$(NF-9)+$(NF-8)+$(NF-7)+$(NF-6); S=$(NF-6)+$(NF-5)+$(NF-4)+$(NF-3)+$(NF-2)+$(NF-1)+$(NF); if(sqrt((S/6-C/6)^2)>=0.3){print}}' \
> SNPdata_cand005_AFD03.fdr.fet

6) Merge and define candidates based on allele frequency patterns in "High" and "Mid" frequency candidates (or "ambiguous")

python /scripts/CandTypes.py \
--glmm SNPdata_cand005_AFD03.fdr.glmm \
--fet SNPdata_cand005_AFD03.fdr.fet \
> SNPdata_cand005_AFD03.cand

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bioinformatics pipeline of Kawecki et al. 2020.

A) trim, map and re-align around InDels

B) Merging BAM files and joint SNP calling

1) SNP calling

2) annotation with SNPeff

3) Conversion to sync file format

C) Calculation of unbiased population genetics estimators Tajima's pi, Watterson's Theta and Tajima's D

D) Statistical identification of candidate SNPs based on (i) Generalized linear mixed models with a binomial error structure (GLMM) and (ii) Fisher's exact test (FET)

1) calculate GLMM and retain only major chromosomes using GNU parallel

2) subsample the SNP dataset in SYNC file format to an even 30x read-depth.

3) calculate FET of dataset with normalized read-depth and retain only major chromosomes using GNU parallel

4) calculate empirical false discovery rate and identify candidates with q-value <=0.05

5) calculate frequencies of the allele at higher frequencies in the selected populations

5) further select candidates with an absolute average allele frequency difference (AFD) between selected and controls >= 0.3

6) Merge and define candidates based on allele frequency patterns in "High" and "Mid" frequency candidates (or "ambiguous")

About

Releases

Packages

Languages

License

capoony/GenomicsOfLarvalMalnutrition

Folders and files

Latest commit

History

Repository files navigation

Bioinformatics pipeline of Kawecki et al. 2020.

A) trim, map and re-align around InDels

B) Merging BAM files and joint SNP calling

1) SNP calling

2) annotation with SNPeff

3) Conversion to sync file format

C) Calculation of unbiased population genetics estimators Tajima's pi, Watterson's Theta and Tajima's D

D) Statistical identification of candidate SNPs based on (i) Generalized linear mixed models with a binomial error structure (GLMM) and (ii) Fisher's exact test (FET)

1) calculate GLMM and retain only major chromosomes using GNU parallel

2) subsample the SNP dataset in SYNC file format to an even 30x read-depth.

3) calculate FET of dataset with normalized read-depth and retain only major chromosomes using GNU parallel

4) calculate empirical false discovery rate and identify candidates with q-value <=0.05

5) calculate frequencies of the allele at higher frequencies in the selected populations

5) further select candidates with an absolute average allele frequency difference (AFD) between selected and controls >= 0.3

6) Merge and define candidates based on allele frequency patterns in "High" and "Mid" frequency candidates (or "ambiguous")

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages