In this tutorial, you will learn the basics of TRGT and TRVZ by analyzing a tiny example dataset included in this repository.
- Download the latest TRGT and TRVZ binaries
- Download the tiny example dataset
- Install recent versions of
samtools
andbcftools
Our example dataset consists of a reference genome (reference.fasta
), a file
with aligned reads (sample.bam
), and the repeat definition file
(repeat.bed
). The repeat definitions are stored in a BED file that, among
other information, contains repeat coordinates, repeat identifiers, and motifs:
$ cat repeat.bed
chrA 10000 10061 ID=TR1;MOTIFS=CAG;STRUC=(CAG)n
To genotype the repeat, run:
./trgt --genome example/reference.fasta \
--repeats example/repeat.bed \
--reads example/sample.bam \
--output-prefix sample
The output consists of files sample.vcf.gz
and sample.spanning.bam
. The VCF
file contains repeat genotypes while the BAM file contains pieces of HiFi reads
that fully span the repeat sequences.
For example, here is the first entry of the VCF file:
$ bcftools view --no-header sample.vcf.gz | head -n 1
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample
chrA 10002 . CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG,CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG 0 . TRID=TR1;END=10061;MOTIFS=CAG;STRUC=(CAG)n GT:AL:ALCI:SD:MC:MS:AP:AM 1/1:33,33:30-39,33-33:16,16:11,11:0(0-33),0(0-33):1,1:.,.
It says that:
- There is a tandem repeat starting at position 10001 of chrA
- The reference sequence of this repeat is CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG
- This repeat has two alleles spanning 16 CAG motifs each
TRGT outputs are not sorted. So you need to sort and index the VCF:
bcftools sort -Ob -o sample.sorted.vcf.gz sample.vcf.gz
bcftools index sample.sorted.vcf.gz
and then the BAM:
samtools sort -o sample.spanning.sorted.bam sample.spanning.bam
samtools index sample.spanning.sorted.bam
And that's it! The output files are now ready for downstream analysis.
To visualize the repeat with the identifier "TR1", run:
./trvz --genome example/reference.fasta \
--repeats example/repeat.bed \
--vcf sample.sorted.vcf.gz \
--spanning-reads sample.spanning.sorted.bam \
--repeat-id TR1 \
--image TR1.svg
TRVZ outputs a file TR1.svg
that contains the read pileup image. Note that the
SVG file can be directly edited in vector graphics editing software like
Inkscape.
The resulting pileup plot shows the sequences of reads spanning each repeat allele (blue) and the surrounding flanking sequence (green):