- Understand what is aligning reads to reference genome
- Understand genome indexing
- Understand SAM/BAM file format
- Understand the reason of sorting SAM/BAM file by coordinates
- Understand what is read coverage of positions
- Understand VCF/BCF file format
- Be able to perform each steps of the variant calling workflow
We mentioned before that we are working with files from a long-term evolution study of an E. coli population (designated Ara-3). Now that we have looked at out data to make sure that it is high quality, we can perform variant calling to see how the population changed over time. We care how this population changed relative to the original population E. coli strain REL606. Therefore, we will align each of our samples to the E. coli REL606 reference genome, and see what differences exist in our reads versus the genome.
We perform read alignment or mapping to determine where in the genome our reads originated from. There are a number of tools to choose from and, while there is no gold standard, there are some tools that are better suited for particular Next Generation Sequencing (NGS) analyses.
We will use the Burrows Wheeler Aligner (BWA), which is a software package for mapping low-divergent sequences against a large reference genome. It consists of 3 algorithms: BWA-backtrack, BWA-SW, and BWA-MEM.
BWA-backtrack is designed for Illumina sequence reads up to 100bp. while the rest two for longer sequences ranged from 70bp to 1MBP. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.
The alignment process consists of two steps:
- Indexing the reference genome
- Aligning the reads to the reference genome
First we need to download the reference genome.
mkdir -p ~/workshops/variant-calling/ref-genome
cd ~/workshops/variant-calling/ref-genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/985/GCA_000017985.1_ASM1798v1/GCA_000017985.1_ASM1798v1_genomic.fna.gz
# curl -O for Mac User
mv GCA_000017985.1_ASM1798v1_genomic.fna.gz ecoli_rel606.fasta.gz
gunzip ecoli_rel606.fasta.gz
Then, we need to create directories for the results that will be generated as part of this workflow.
cd ~/workshops/variant-calling/
mkdir -p results/sam results/bam results/bcf results/vcf
Q: what does -p
mean?
The first step of aligning reads to reference genome is to index the genome. Indexing allows the aligner to quickly find potential alignment sites for query sequences in a genome, which saves time during alignment.
Indexing only needs to be run once. Because when you index a reference genome, it will generate a few result files, and the result files store all the indexing information. The generated files are always there and will not disappear if you don't delete it.
The only reason you would want to create a new index is if you are working with a different reference genome or you are using a different alignment tool.
The command to index a reference genome is:
bwa index [genome_file]
Please index the E. coli REL606 genome.
Q: Did you get any output files? What are they?
In this step, we will use the BWA-MEM algorithm, which is the latest and is generally recommended for high-quality queries as it is faster and more accurate.
An example command of how to do the alignment is:
bwa mem [ref_genome] [sample_R1.fastq] [sample_R2.fastq] > [output.sam]
Please run the alignment on sample SRR2584863
.
It takes about 2 minutes to run. After it finishes running, you will get an output SAM file.
SAM stands for Sequence Alignment Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. Header lines start with @
, while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information.
We can take a look at our SAM file:
@SQ SN:CP000819.1 LN:4629812
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem ref-genome/ecoli_rel606.fasta trimmed-fastq/SRR2584863_1.trim.fastq trimmed-fastq/SRR2584863_2.trim.fastq
SRR2584863.3 99 CP000819.1 2067177 60 138M = 2067947 798 ATCAACAACACGCGTTTATTGGTCTGGCTGATCACCGCCGCCAGGTTGGCGCAGACAAAGGTTTTACCGATTGACGGGCTAACCCCGGTCATCATCAACACATTGTTCTGCGCCTGCATCATCGCGAAGTGCAGGCTG C@CFFFFFHHHGHIIJJJJJJIIJIIJJJIIJJJJJGGGGGBGHHHFFBEDDDBBDDDDDD:>CBCCC>5;B?CDBBDBBBCCA?BD5<<@AAACCDCDCDDCDDDCDDCDBBDDBD<@>@@CCBDB<@BACCC<38? NM:i:0 MD:Z:138 MC:Z:28M AS:i:138 XS:i:0
SRR2584863.3 147 CP000819.1 2067947 60 28M = 2067177 -798 CAGCGGAAGATCAACAGAATCTTTATCC IIIIIIIHHHDHHHIDHHF>DDDDA@<@ NM:i:0 MD:Z:28 MC:Z:138M AS:i:28 XS:i:0
SRR2584863.5 99 CP000819.1 1231976 60 150M = 1232448 498 TCCGTAAACATTTTAATGTCGTGCTCGAAAGAACGGTGCGTGAACTGCGCGGCGAACCCTGTTTGCAACTGGAAGAGTTTGCACCGACGAAGCAGGAAATTATCTGTTCCCGCTCGTTTGGTGAACGCATCACGGATTATCCGTCGATGC @@CFBADBFHFHHFIJIGGGGFGHIIIJJJEHIIIGHIJJGIIJGGIIIJJHFDDDBDDDDCDCCCCCDDCCCDDDDACCCDDCDDDDDBD?>CC@DDDDB@ACDCACCCDDBBBB<@D?@BBCCDDDDBBCDB>B>@CDDDDBDDDDC@ NM:i:0 MD:Z:150 MC:Z:26M AS:i:150 XS:i:0
SRR2584863.5 147 CP000819.1 1232448 60 26M = 1231976 -498 GCAATTGATGACGGTAATGGATACAC GHIGHHIIJIIEHHFHHHEDEDF@CC NM:i:0 MD:Z:26 MC:Z:150M AS:i:26 XS:i:0
SRR2584863.7 83 CP000819.1 2562378 60 64M = 2561695 -747 CGGCACGTTAACCTGCTGTTTGATGAGTTTGAACGCTTCTGCCGCGTCCATCGTCGGTACGGAT ;B?<<<:BABBB?;;3BBCCCB@>B>BDDBDFC@==;CFBFBG<?GC1C8BFF@FFDDDDD@@@ NM:i:0 MD:Z:64 MC:Z:40M AS:i:64 XS:i:0
SRR2584863.7 163 CP000819.1 2561695 60 40M = 2562378 747 GGGCCATTCACCACGCAGCCGATAATCGAAACGTCCATCG ;?<ABDDDHDHFFGGI@GGBBGGEHEGGG9D<GFGCDH)< NM:i:0 MD:Z:40 MC:Z:64M AS:i:40 XS:i:0
SRR2584863.8 83 CP000819.1 4263438 60 101M = 4263359 -180 AAGGTGTAGCGCCCCAGGTTACGCAGACGTTCGGCAATCAGGAACAAAATGATCGGCCAGCCCACCAGGAAGCCCAGCGAGTAAATTAAGCCGTCATAGCC CA3?>95555<5DB?4:@<<5@@<?<8?@@@>CCCCCACECECFFDFFFDB:GHHGBHDEEGD9IJIGIGIHEJIGIIIGJJIJJJJJHHHHHFFFFFBC@ NM:i:0 MD:Z:101 MC:Z:150M AS:i:101 XS:i:0
SRR2584863.8 163 CP000819.1 4263359 60 150M = 4263438 180 CGCCACCACCACCAGAGAACCACAGGCCGAAAGAATACGAATCGGCCCTTGTTTCAGACGGTAAGAGGCCACATCGGCAAAGGTGTAGCGCCCCAGGTTACGCAGACGTTCGGCAATCAGGAACAAAATGATCGGCCAGCCCACCAGGAA CCCFFFFFHHGHHJIJIIHJJIIJJJJJJJJJIJICHJGIIDEHGIIIEHCHFDDDCC@CB;AB;@C@BDB@DACDD@BDBDD4::A@C5@BDBBD?CDCC<[email protected]@D??B@@BB>>:AACD?BDDBC@CA>?9@9<?BD@2<<8A8AA NM:i:0 MD:Z:150 MC:Z:101M AS:i:150 XS:i:0
...
We can see that there are 2 header lines then following with alignment lines.
Q: why is there 2 lines have the same sequence name?
BAM is the compressed binary version of SAM. We will convert the SAM file to BAM format using the samtools
program with the view
command and specify the output format in BAM using -b
option.
samtools view -b [aligned.sam] > [aligned.bam]
This takes about 1 minute to run.
Please compare the size of the SAM and BAM file, how much the file size has decreased?
Normally, to save the disk space we don't keep the SAM files. We can pipe the alignment step and the convert format step together to get the BAM file directly.
bwa mem [ref_genome] [sample_R1.fastq] [sample_R2.fastq] | samtools view -b > [aligned.bam]
Please try this method on sample SRR2584866
.
If you have a look of the previous SAM file, you will find out that the sequences are ordered the same as the fastq file. The reads in fastq files are ordered by the time they go through the sequencing machine.
In this step, we will sort the BAM file so the sequences are in order of the position they were mapped to the genome from left to right.
We sort the BAM file using the sort
command from samtools
. -o
tells the command where to write the output. Run this on sample SRR2584863
.
samtools sort -o [aligned.sorted.bam] [aligned.bam]
This takes about 1 minute to run.
To better understand the sorted result, we can set the output to SAM file. Then we can have a look and compare the sorted SAM with the original one.
samtools sort -o [aligned.sorted.sam] [aligned.bam]
The sorted SAM file should look like this:
@HD VN:1.6 SO:coordinate
@SQ SN:CP000819.1 LN:4629812
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem ref-genome/ecoli_rel606.fasta trimmed-fastq/SRR2584863_1.trim.fastq trimmed-fastq/SRR2584863_2.trim.fastq
SRR2584863.72312 2147 CP000819.1 1 60 116H34M = 4629703 4629812 AGCTTTTCATTCTGACTGCAACGGGCAATATGTC DDDADDDDDEDDEEEDDDDDCDDBDB<?CCDDDD NM:i:0 MD:Z:34 MC:Z:110M40H AS:i:34 XS:i:0 SA:Z:CP000819.1,4629697,+,116M34S,60,0;
SRR2584863.236239 99 CP000819.1 1 60 55S95M = 304 453 GCATGATATTGAAAAAAATATCACCAAATAAAAAACGCCTTAGTAAGTATTTTTCAGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAG @BCFFFFFHHHHHJJJJJIIJJIJJJJIIJJJJJJIGJIJIIIAHGHCFGIJJJIIIHHHHHGHFFFFFBDECACCDDDB=BBCDDEDDEDCDCDDDB>@@>CDDDDBBDCDDACC@@CCCCC:<A@CCCD@@>C>CC@C>A4<8099?B NM:i:0 MD:Z:95 MC:Z:150M AS:i:95 XS:i:0 SA:Z:CP000819.1,4629758,+,55M95S,60,0;
SRR2584863.275624 99 CP000819.1 1 60 14S136M = 145 294 AGTAAGTATTTTTCAGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAA @B@FFFBDHHHHHJJIGJJJJJJJJJJJJIJJJJJJIJJJIJIGIGJJJJJJIJFGIHIJJJJJIJHFFFDFEEEEEEEDDDDDDDDDDEDDDDDCDDDDDCBDDBDDDCDDDEEDDDDEDDDEDDDDDDDDDDDDDDDDCDCDDDDEDD NM:i:0 MD:Z:136 MC:Z:150M AS:i:136 XS:i:0
SRR2584863.275984 2209 CP000819.1 1 60 87H63M = 848 896 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGA CCCCCCCCCACDDDCCCCCCCCBB>BBBCCCE>>A@A>ACABBC@>>:@>@B@BB<>:@>CCC NM:i:0 MD:Z:63 MC:Z:49M AS:i:63 XS:i:0 SA:Z:CP000819.1,4629726,+,87M63S,60,0;
SRR2584863.404450 2209 CP000819.1 1 60 91H59M = 437 481 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGT DDDDDDCDDEDDEEEDDDDDDCDDDBDDDDCEEDDCCDDDDDDCDCDDCDDDDBDDCDC NM:i:0 MD:Z:59 MC:Z:45M AS:i:59 XS:i:SA:Z:CP000819.1,4629722,+,91M59S,60,0;
SRR2584863.651714 2145 CP000819.1 1 60 95H55M = 42 175 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGA DDDDDDACDEEEEEEDDDDDCDBBD@BDCCDEDDEDDDCD@BDDDDCCDCCB@B? NM:i:0 MD:Z:55 MC:Z:134M AS:i:55 XS:i:0 SA:Z:CP000819.1,4629718,+,95M55S,60,0;
SRR2584863.702752 163 CP000819.1 1 60 28S122M = 653 802 ATAAAAAACGCCTTAGTAAGTATTTTTCAGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTC <@BFFFFFHHHHHJIJIJJJHIIJJJJJIJJJJJJHHIIJJJJJJIGIJIJIIJJJJJJJHGHHHHHFDFFFEDECEDDDDBBCACDDDDDEDACDDDDDDDDEDDDCDDDDDDDDBDDDBDDDDDCDEDDDDDCDDDDDCDCDDDDCCD NM:i:0 MD:Z:122 MC:Z:150M AS:i:122 XS:i:0
...
You can see that the lines are ordered by the position in the chromosome rather than the read names.
SAM/BAM files can be sorted in multiple ways, e.g. by location of alignment on the chromosome, by read name, etc. It is important to be aware that different alignment tools will output differently sorted SAM/BAM, and different downstream tools require differently sorted alignment files as input.
For example, the alignment tool BWA-MEM that we used gave us read name sorted file, and the variant calling requires coordinate sorted BAM file as input.
A variant call is a conclusion that there is a nucleotide difference compare to the reference genome at a given position, often referred to as a Single Nucleotide Variant (SNV). The call is usually accompanied by an estimate of variant frequency and some measure of confidence.
Similar to other steps in this workflow, there are a number of tools available for variant calling. We will use bcftools
.
There are a few steps we need to do before the actual call of the variants.
This step is to generate a pileup format representation of sequence read data aligned to the reference genome. A pileup format is a textual representation that shows the aligned reads at each position in the reference genome, along with various summary statistics and variant information.
We will use the mpileup
command from bcftools
. The flag -O v
tells bcftools to generate a VCF format output, -o
specifies where to write the output file, and -f
tells the path to the reference genome.
The Variant Call Format (VCF) is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position.
bcftools mpileup -O v -o [sample_raw.vcf] -f [reg_genome] [aligned.sorted.bam]
It takes about 10 minutes to run. Your result VCF file should look like this:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.9+htslib-1.9
##bcftoolsCommand=mpileup -O v -o vcf/SRR2584863_raw.vcf -f ../ref-genome/ecoli_rel606.fasta bam/SRR2584863.sorted.aligned.bam
##reference=file://../ref-genome/ecoli_rel606.fasta
##contig=<ID=CP000819.1,length=4629812>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT bam/SRR2584863.sorted.aligned.bam
CP000819.1 1 . A <*> 0 . DP=27;I16=1,18,0,0,676,24074,0,0,1140,68400,0,0,0,0,0,0;QS=1,0;MQSB=1;MQ0F=0 PL 0,57,229
CP000819.1 2 . G <*> 0 . DP=27;I16=2,18,0,0,745,28387,0,0,1200,72000,0,0,26,68,0,0;QS=1,0;MQSB=1;MQ0F=0 PL 0,60,255
CP000819.1 3 . C <*> 0 . DP=27;I16=2,18,0,0,736,27594,0,0,1200,72000,0,0,46,140,0,0;QS=1,0;MQSB=1;MQ0F=0 PL 0,60,255
CP000819.1 4 . T <*> 0 . DP=27;I16=2,18,0,0,741,27889,0,0,1200,72000,0,0,66,252,0,0;QS=1,0;MQSB=1;MQ0F=0 PL 0,60,255
CP000819.1 5 . T <*> 0 . DP=27;I16=2,18,0,0,754,28670,0,0,1200,72000,0,0,86,404,0,0;QS=1,0;MQSB=1;MQ0F=0 PL 0,60,255
CP000819.1 6 . T <*> 0 . DP=27;I16=2,18,0,0,778,30440,0,0,1200,72000,0,0,106,596,0,0;QS=1,0;MQSB=1;MQ0F=0 PL 0,60,255
CP000819.1 7 . T <*> 0 . DP=27;I16=2,18,0,0,763,29341,0,0,1200,72000,0,0,126,828,0,0;QS=1,0;MQSB=1;MQ0F=0 PL 0,60,255
CP000819.1 8 . C <*> 0 . DP=27;I16=2,18,0,0,786,30962,0,0,1200,72000,0,0,146,1100,0,0;QS=1,0;MQSB=1;MQ0F=0 PL 0,60,255
...
Where it stores the information start from the first base of the reference genome.
CHROM
- chromosomePOS
- position: the reference position, with the 1st base having position 1.ID
- identifierREF
- reference baseALT
- alternate baseQUAL
- quality scoreFILTER
- filter status: PASS if this position has passed all filters (a call is made at this position).INFO
- additional information
Like the alignment steps, we don't normally need to read the coverage information so it would be good we can also output this information to a binary file to reduce the storage space used.
Please read the documentation of bcftools mpileup
to see if you can make the output into a binary format? If can please do so.
The above step we only calculated the read coverage but didn't do anything about it, so it contains everything on every position of the genome. As we can see from the VCF file, most of the positions don't have an alternate base.
In this step, we will use bcftools call
to identify SNVs. We need to specify ploidy with the flag --ploidy
, -m
allows for multiallelic and rare-variant calling, -v
to output variant sites only, and -o
specifies where to write the output file.
bcftools call --ploidy 1 -m -v -o [sample_variants.vcf] [sample_raw.bcf]
The variants VCF file should look like this:
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT bam/SRR2584863.sorted.aligned.bam
CP000819.1 9972 . T G 225 . DP=81;VDB=0.387587;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=1;DP4=0,0,31,48;MQ=60 GT:PL 1:255,0
CP000819.1 263235 . G T 228 . DP=65;VDB=0.00108727;SGB=-0.693146;RPB=0.0433688;MQB=3.80355e-08;MQSB=0.368447;BQB=0.00167695;MQ0F=0.246154;AC=1;AN=1;DP4=10,6,15,27;MQ=28 GT:PL 1:255,0
CP000819.1 281923 . G T 225 . DP=69;VDB=0.0125476;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=1;DP4=0,0,40,24;MQ=60 GT:PL 1:255,0
CP000819.1 377000 . T G 228 . DP=60;VDB=0.575301;SGB=-0.693146;RPB=0.467702;MQB=1.73175e-07;MQSB=0.239114;BQB=0.319368;MQ0F=0.233333;AC=1;AN=1;DP4=12,2,22,21;MQ=30 GT:PL 1:255,0
CP000819.1 433359 . CTTTTTTT CTTTTTTTT 116 . INDEL;IDV=73;IMF=0.948052;DP=77;VDB=0.0026231;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=1;DP4=1,3,25,48;MQ=60 GT:PL 1:143,0
CP000819.1 473901 . CCGC CCGCGC 228 . INDEL;IDV=68;IMF=0.944444;DP=72;VDB=0.802322;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=1;DP4=2,1,21,48;MQ=60 GT:PL 1:255,0
CP000819.1 648692 . C T 225 . DP=84;VDB=0.151313;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=1;DP4=0,0,58,21;MQ=60 GT:PL 1:255,0
CP000819.1 1331794 . C A 225 . DP=54;VDB=0.330798;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=1;DP4=0,0,27,25;MQ=60 GT:PL 1:255,0
CP000819.1 1733343 . G A 225 . DP=95;VDB=0.996142;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=1;DP4=0,0,47,44;MQ=60 GT:PL 1:255,0
CP000819.1 2103887 . ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAG ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAG 7.30276 . INDEL;IDV=12;IMF=0.444444;DP=27;VDB=0.969575;SGB=-0.686358;MQSB=0.951472;MQ0F=0;AC=1;AN=1;DP4=9,4,4,10;MQ=58 GT:PL1:255,221
CP000819.1 2333538 . AT ATT 228 . INDEL;IDV=78;IMF=0.962963;DP=81;VDB=0.283362;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=1;DP4=0,9,40,32;MQ=60 GT:PL 1:255,0
...
Now, there are only variant sites left.
The above step gave us the result of all variant sites in our sample, but not all the sites have high quality score. In this step, we will filter out low quality variants using vcfutils.pl varFilter
.
The vcfutils.pl varFilter
filters out variants that do not meet minimum quality default criteria, which can be changed through its options.
vcfutils.pl varFilter [sample_variants.vcf] > [sample_final_variants.vcf]
Let's take a look of our final result, it should only contain the filtered variants.
Exercise: take a guess of how many variants left?
Exercise: how can you count it using the vcf file?
Here, we have finished all the steps for variant calling. Next, we are going to use a genome browser to look at our data.
Run the variant calling workflow on the other 2 samples.
- Data Carpentry - Data Wrangling and Processing for Genomics
- samtools - Sequence Alignment Map Format Specification
- Wikipedia - Binary Alignment Map
- samtools - The Variant Call Format (VCF) Version 4.2 Specification