Skip to content

Latest commit

 

History

History
236 lines (157 loc) · 10.2 KB

03_Basecall_Alignment.md

File metadata and controls

236 lines (157 loc) · 10.2 KB

Edinburgh Genomics - MinION Bioinformatics

Alternative Basecallers

Up until recently, the only basecaller for MinION was the ONT cloud-based basecaller EPI2ME/Metrichor.

Community R7 Basecallers

In March 2016, two alternative community-developed basecallers were released: nanocall and deepnano. Both targeted the then current R7 data.

Nanocall was developed in Jared Simpson's lab, and performs basecalling using a hidden markov model (like the original R7 Metrichor).

Deepnano is a python package built on the Theano framework, and uses a deep recurrent neural network (RNN) model to call bases. ONT has also moved towards RNN basecalling and this is now the main method for calling R9 reads.

Live/Local Basecalling in MinKNOW

In August 2016, local basecalling was enabled in a new release of ONT's MinKNOW software (version 1.0.2 onwards). At the moment, the MinKNOW local basecaller only works on R9 1D rapid sequencing data; 2D support in under development. Local basecalling is enabled by two new protocol scripts in MinKNOW, so it forms part of live run processing. Since we aren't doing a live run here, we are unable to demonstrate it.

Albacore - The Basecaller Source Code

Albacore is a C++ project designed to provide a high-performance end-to-end analysis pipeline that can be run on (potentially) any platform.

Albacore is currently only available through the ONT Developer Channel to users who have signed the Developer terms and conditions.

Nanonet - The Research Basecaller Source Code

Nanonet uses the same neural network that is used in Albacore, EPI2ME and live basecalling in MinKNOW. However, Nanonet is research code that is continually under development, not production code. Nanonet is available to the Nanopore Community through the ONT Github

Nanonet provides recurrent neural network basecalling via CURRENNT. Event data is extracted from .fast5 files to create feature vectors to input into a pre-trained network. Output is as a single .fasta file. CURRENT provides GPU acceleration, but nanonet is pure python. Nanonet-Dev provides both 1D and 2D Basecalling.

Read alignment

Aligning reads to a reference

A number of different alignment tools have been proposed for nanopore data. Three good ones are BWA, LAST and Graphmap. Working in a terminal window, we will start with BWA.

BWA straight to sorted BAM:

# start by moving to a tutorial folder
cd ~/tutorial

# bwa index genome
bwa index ~/Data/reference/Ecoli_MG1655/MG1655.fa

# create fasta index for samtools
samtools faidx ~/Data/reference/Ecoli_MG1655/MG1655.fa

# run bwa and pipe straight to samtools to create BAM
bwa mem -x ont2d ~/Data/reference/Ecoli_MG1655/MG1655.fa \
        ~/Data/read_data/R9_1Drapid_2100000-2600000.fasta | \
        samtools view -T ~/Data/reference/Ecoli_MG1655/MG1655.fa -bS - | \
        samtools sort -T r9_1d.bwa -o r9_1d.bwa.bam -

# index bam
samtools index r9_1d.bwa.bam

LAST straight to sorted BAM:

# create a LAST db of the reference genome
lastdb MG1655 ~/Data/reference/Ecoli_MG1655/MG1655.fa

# align reads to reference genome with LAST
lastal -q 1 -a 1 -b 1 MG1655 ~/Data/read_data/R9_1Drapid_2100000-2600000.fasta > r9_1d.maf

# convert the MAF to BAM with complete CIGAR (matches and mismatches)
maf-convert.py sam r9_1d.maf | \
    samtools view -T ~/Data/reference/Ecoli_MG1655/MG1655.fa -bS - | \
    samtools sort -T r9_1d.last -o r9_1d.last.bam -

# index bam
samtools index r9_1d.last.bam

Finally, Graphmap:

~/graphmap/bin/Linux-x64/graphmap align -t 2 -r ~/Data/reference/Ecoli_MG1655/MG1655.fa \
        -d ~/Data/read_data/R9_1Drapid_2100000-2600000.fasta -o r9_1d.graphmap.sam

samtools view -Sb r9_1d.graphmap.sam | \
        samtools sort -o r9_1d.graphmap.bam - 

# index bam
samtools index r9_1d.graphmap.bam

Viewing Alignments in IGV

You can read about viewing alignments in IGV here. We start by running

igv.sh

If required, we load a genome from file ~/Data/reference/Ecoli_MG1655/MG1655.fa, and after that we can load one or more r9_1d.*.bam files to see how they compare. Which one looks best? See if you can find a SNP.

Aligning events with Nanopolish EventAlign

Nanopolish was written by Jared Simpson and represents a suite of tools for aligning events to a reference sequence and working with those (using HMMs) to improve e.g. SNP calling and consensus accuracy

Nanopolish eventalign aligns the raw squiggle data directly to a reference genome, but requires a sequence-based alignment to guide it.

The command below is an example only. It won't work on the VM because nanopolish requires a set of fast5 read files corresponding to the fasta and bam, and that the headers in the fasta give the path to the corresponding reads.

nanopolish eventalign --print-read-names --scale-events \
        --reads ~/Data/read_data/R9_1Drapid_2100000-2600000.fasta \
        -b r9_1d.bwa.bam -g Data/reference/Ecoli_MG1655/MG1655.fa \
        > Data/eventalign/test.eventalign.txt

# eventalign can also output SAM format with the --sam flag

We have a smaller dataset pre-run in Data/eventalign/singleread.eventalign.txt so let's take a look at that:

head ~/Data/eventalign/singleread.eventalign.txt

Now we can visualise in R

R
# Load the poRe library for the plot.as.squiggle function...
library(poRe)

# read in eventalign output as a data.frame
d <- read.table("~/Data/eventalign/singleread.eventalign.txt", header=TRUE, sep="\t")

# let's just look at the template strand
dt <- d[d$strand=="t",]

# plot.as.squiggle is a utility function in R that takes a vector of
# numbers and plots them as if they were an event-driven nanopore
# squiggle.  We need this because in real nanopore data, the events
# are differing times apart, so can be hard to visualise, whereas
# plot.as.squiggle simply uses the same false time interval

# get data for the first 100 event means
em <- plot.as.squiggle(dt$event_level_mean[1:100], plot=FALSE)

# get data for the first 100 model means they are aligned to
mm <- plot.as.squiggle(dt$model_mean[1:100], plot=FALSE)

# plot and visualise
plot(em$times, em$means, type="l", ylab="mean signal", xlab="fake time")
lines(mm$times, mm$means, col="red")

Hairpins

Finally, we will take a brief look at hairpins. In 2D MinION reads, hairpins divide the template and complement components of a MinION read; hairpins are absent from 1D rapid reads.

The hairpins in 2D reads form a characteristic current signal that is detected at the start of the basecalling process and accurate detection is crucial to accurate basecalling.

As I will show in a demo, they look rather different in R7 and R9, and are not always easy to detect correctly.