Skip to content

RRBS Benchmarking

Mike Lloyd edited this page Jun 8, 2022 · 3 revisions

RRBS Benchmark Testing

Two RRBS simulation tools were tested:

  1. RRBSsim
  2. Sherman

In testing these tools:

  1. RRBSsim produced highly duplicated (~90%) libraries, and proved difficult to validate. This tool would be prefered if simulation is attempted again.
  2. Sherman does not provide a 'truth set' output, and proved difficult to validate the resulting methylation calls.

Therefore, simulation tools were abandoned and the pipeline was tested against published results. These data were obtained from GEO and SRA. Accession numbers are noted below.

Library Load

library(dplyr)
library(ggplot2)
library(tidyr)
library(stringr)
library(data.table)
Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘data.table’


The following objects are masked from ‘package:dplyr’:

    between, first, last

Mouse RRBS

Dataset 1:

GEO Data was taken from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4626846

SRA reads were taken from: https://www.ncbi.nlm.nih.gov/sra?term=SRX8582110

Note 1 : The calls from the NXF pipeline are +1 bp relative to the published GEO data. Note 2 : The above data used bismark for alignment, and methyldackel for calls. It is therefore not 1:1 with our method. These data are kept in case we eventually try methyldackel.

Dataset 2:

GEO Data was taken from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3940770

SRA reads were taken from: https://www.ncbi.nlm.nih.gov/sra?term=SRR9681755

Note 1 : Compairing The GEO results, to primary testing showed that Bismark Deduplication was not used prior to methylation extraction.

Truth Set

Mm_RRBS_truth <- read.table('/projects/omics_share/mouse/GRCm38/supporting_files/benchmarking_data/RRBS/SRA/GSM3940770_RRBS-Control-3W-WK-6.bismark.cov', 
                            skip = 1, 
                            col.names = c('Chrom', 'Start', 'End', 'Meth_Perc', 'score_1', 'score_2'))
head(Mm_RRBS_truth)
A data.frame: 6 × 6
Chrom Start End Meth_Perc score_1 score_2
<chr> <int> <int> <dbl> <int> <int>
1 chr19 3079679 3079679 91.80328 112 10
2 chr19 3079682 3079682 100.00000 122 0
3 chr19 3079728 3079728 100.00000 121 0
4 chr19 3079791 3079791 41.66667 50 70
5 chr19 3079909 3079909 100.00000 33 0
6 chr19 3079916 3079916 60.60606 20 13
Mm_RRBS_truth$Chrom<-gsub("chr","",as.character(Mm_RRBS_truth$Chrom))
head(Mm_RRBS_truth)
A data.frame: 6 × 6
Chrom Start End Meth_Perc score_1 score_2
<chr> <int> <int> <dbl> <int> <int>
1 19 3079679 3079679 91.80328 112 10
2 19 3079682 3079682 100.00000 122 0
3 19 3079728 3079728 100.00000 121 0
4 19 3079791 3079791 41.66667 50 70
5 19 3079909 3079909 100.00000 33 0
6 19 3079916 3079916 60.60606 20 13

NXF Calls

Mm_RRBS_nxf <- read.table('/projects/omics_share/mouse/GRCm38/supporting_files/benchmarking_data/RESULTS/RRBS_mouse_nxf_bench/SRR9681755/bismark_results/SRR9681755_val_1_bismark_bt2_pe.bismark.cov', 
                            skip = 1, 
                            col.names = c('Chrom', 'Start', 'End', 'Meth_Perc', 'score_1', 'score_2'))
head(Mm_RRBS_nxf)
A data.frame: 6 × 6
Chrom Start End Meth_Perc score_1 score_2
<chr> <int> <int> <dbl> <int> <int>
1 1 3014742 3014742 100 5 0
2 1 3014865 3014865 0 0 6
3 1 3014929 3014929 100 7 0
4 1 3014975 3014975 100 7 0
5 1 3020689 3020689 100 10 0
6 1 3020690 3020690 100 6 0

Join Calls

joined_Mm_calls <- dplyr::left_join(Mm_RRBS_truth, Mm_RRBS_nxf, by=c('Chrom', 'Start', 'End'))
head(joined_Mm_calls)
A data.frame: 6 × 9
Chrom Start End Meth_Perc.x score_1.x score_2.x Meth_Perc.y score_1.y score_2.y
<chr> <int> <int> <dbl> <int> <int> <dbl> <int> <int>
1 19 3079679 3079679 91.80328 112 10 91.59664 109 10
2 19 3079682 3079682 100.00000 122 0 100.00000 119 0
3 19 3079728 3079728 100.00000 121 0 100.00000 119 0
4 19 3079791 3079791 41.66667 50 70 42.37288 50 68
5 19 3079909 3079909 100.00000 33 0 100.00000 32 0
6 19 3079916 3079916 60.60606 20 13 59.37500 19 13
samp <- sample(nrow(joined_Mm_calls), 10000, replace = FALSE)

random_10000 <- joined_Mm_calls[samp,]
plot(random_10000$Meth_Perc.x, random_10000$Meth_Perc.y, xlab = 'GEO Dataset', ylab = 'NXF Pipeline', main = 'Mouse Test Data')

png

Human RRBS

GEO Data was taken from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4082740

SRA reads were taken from: https://www.ncbi.nlm.nih.gov/sra?term=SRX6862244

Note 1 : Compairing The GEO results, to primary testing showed that Bismark Deduplication was not used prior to methylation extraction.

Truth Set

Hs_RRBS_truth <- read.table('/projects/omics_share/human/GRCh38/supporting_files/benchmarking_data/RRBS/SRA/GSM4082740_281_W766F_father_ICSI-ET.CpG.bismark.cov', 
                            skip = 1, 
                            col.names = c('Chrom', 'Start', 'End', 'Meth_Perc', 'count_1', 'count_2'))
head(Hs_RRBS_truth)
A data.frame: 6 × 6
Chrom Start End Meth_Perc count_1 count_2
<chr> <int> <int> <dbl> <int> <int>
1 chr19 103586 103586 96.61017 57 2
2 chr19 103660 103660 80.00000 8 2
3 chr19 103661 103661 91.93548 57 5
4 chr19 103674 103674 90.00000 9 1
5 chr19 103675 103675 82.25806 51 11
6 chr19 103690 103690 100.00000 10 0

NXF Set

Hs_RRBS_nxf <- read.table('/projects/omics_share/human/GRCh38/supporting_files/benchmarking_data/RESULTS/RRBS_human_nxf_bench/SRR10133746/bismark_results/SRR10133746_val_1_bismark_bt2_pe.bismark.cov', 
                            skip = 1, 
                            col.names = c('Chrom', 'Start', 'End', 'Meth_Perc', 'count_1', 'count_2'))

Join Datasets

joined_Hs_calls <- dplyr::left_join(Hs_RRBS_truth, Hs_RRBS_nxf, by=c('Chrom', 'Start', 'End'))
samp <- sample(nrow(joined_Hs_calls), 10000, replace = FALSE)

random_10000 <- joined_Hs_calls[samp,]
plot(random_10000$Meth_Perc.x, random_10000$Meth_Perc.y, xlab = 'GEO Dataset', ylab = 'NXF Pipeline', main = 'Human Test Data')

png

Clone this wiki locally