RRBS Benchmarking

RRBS Benchmark Testing

Two RRBS simulation tools were tested:

In testing these tools:

RRBSsim produced highly duplicated (~90%) libraries, and proved difficult to validate. This tool would be preferred if simulation is attempted again.
Sherman does not provide a 'truth set' output, and proved difficult to validate the resulting methylation calls.

Therefore, simulation tools were abandoned and the pipeline was tested against published results. These data were obtained from GEO and SRA. Accession numbers are noted below.

Library Load

library(dplyr)
library(ggplot2)
library(tidyr)
library(stringr)
library(data.table)

Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘data.table’


The following objects are masked from ‘package:dplyr’:

    between, first, last

Mouse RRBS

Dataset 1:

GEO Data was taken from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4626846

SRA reads were taken from: https://www.ncbi.nlm.nih.gov/sra?term=SRX8582110

Note 1 : The calls from the NXF pipeline are +1 bp relative to the published GEO data. Note 2 : The above data used bismark for alignment, and methyldackel for calls. It is therefore not 1:1 with our method. These data are kept in case we eventually try methyldackel.

Dataset 2:

GEO Data was taken from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3940770

SRA reads were taken from: https://www.ncbi.nlm.nih.gov/sra?term=SRR9681755

Note 1 : Compairing The GEO results, to primary testing showed that Bismark Deduplication was not used prior to methylation extraction.

Truth Set

Mm_RRBS_truth <- read.table('/projects/omics_share/mouse/GRCm38/supporting_files/benchmarking_data/RRBS/SRA/GSM3940770_RRBS-Control-3W-WK-6.bismark.cov', 
                            skip = 1, 
                            col.names = c('Chrom', 'Start', 'End', 'Meth_Perc', 'score_1', 'score_2'))

head(Mm_RRBS_truth)

A data.frame: 6 × 6
	Chrom	Start	End	Meth_Perc	score_1	score_2
	<chr>	<int>	<int>	<dbl>	<int>	<int>
1	chr19	3079679	3079679	91.80328	112	10
2	chr19	3079682	3079682	100.00000	122	0
3	chr19	3079728	3079728	100.00000	121	0
4	chr19	3079791	3079791	41.66667	50	70
5	chr19	3079909	3079909	100.00000	33	0
6	chr19	3079916	3079916	60.60606	20	13

Mm_RRBS_truth$Chrom<-gsub("chr","",as.character(Mm_RRBS_truth$Chrom))

head(Mm_RRBS_truth)

A data.frame: 6 × 6
	Chrom	Start	End	Meth_Perc	score_1	score_2
	<chr>	<int>	<int>	<dbl>	<int>	<int>
1	19	3079679	3079679	91.80328	112	10
2	19	3079682	3079682	100.00000	122	0
3	19	3079728	3079728	100.00000	121	0
4	19	3079791	3079791	41.66667	50	70
5	19	3079909	3079909	100.00000	33	0
6	19	3079916	3079916	60.60606	20	13

NXF Calls

Mm_RRBS_nxf <- read.table('/projects/omics_share/mouse/GRCm38/supporting_files/benchmarking_data/RESULTS/RRBS_mouse_nxf_bench/SRR9681755/bismark_results/SRR9681755_val_1_bismark_bt2_pe.bismark.cov', 
                            skip = 1, 
                            col.names = c('Chrom', 'Start', 'End', 'Meth_Perc', 'score_1', 'score_2'))

head(Mm_RRBS_nxf)

A data.frame: 6 × 6
	Chrom	Start	End	Meth_Perc	score_1	score_2
	<chr>	<int>	<int>	<dbl>	<int>	<int>
1	1	3014742	3014742	100	5	0
2	1	3014865	3014865	0	0	6
3	1	3014929	3014929	100	7	0
4	1	3014975	3014975	100	7	0
5	1	3020689	3020689	100	10	0
6	1	3020690	3020690	100	6	0

Join Calls

joined_Mm_calls <- dplyr::left_join(Mm_RRBS_truth, Mm_RRBS_nxf, by=c('Chrom', 'Start', 'End'))

head(joined_Mm_calls)

A data.frame: 6 × 9
	Chrom	Start	End	Meth_Perc.x	score_1.x	score_2.x	Meth_Perc.y	score_1.y	score_2.y
	<chr>	<int>	<int>	<dbl>	<int>	<int>	<dbl>	<int>	<int>
1	19	3079679	3079679	91.80328	112	10	91.59664	109	10
2	19	3079682	3079682	100.00000	122	0	100.00000	119	0
3	19	3079728	3079728	100.00000	121	0	100.00000	119	0
4	19	3079791	3079791	41.66667	50	70	42.37288	50	68
5	19	3079909	3079909	100.00000	33	0	100.00000	32	0
6	19	3079916	3079916	60.60606	20	13	59.37500	19	13

samp <- sample(nrow(joined_Mm_calls), 10000, replace = FALSE)

random_10000 <- joined_Mm_calls[samp,]

plot(random_10000$Meth_Perc.x, random_10000$Meth_Perc.y, xlab = 'GEO Dataset', ylab = 'NXF Pipeline', main = 'Mouse Test Data')

Human RRBS

GEO Data was taken from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4082740

SRA reads were taken from: https://www.ncbi.nlm.nih.gov/sra?term=SRX6862244

Note 1 : Compairing The GEO results, to primary testing showed that Bismark Deduplication was not used prior to methylation extraction.

Truth Set

Hs_RRBS_truth <- read.table('/projects/omics_share/human/GRCh38/supporting_files/benchmarking_data/RRBS/SRA/GSM4082740_281_W766F_father_ICSI-ET.CpG.bismark.cov', 
                            skip = 1, 
                            col.names = c('Chrom', 'Start', 'End', 'Meth_Perc', 'count_1', 'count_2'))

head(Hs_RRBS_truth)

A data.frame: 6 × 6
	Chrom	Start	End	Meth_Perc	count_1	count_2
	<chr>	<int>	<int>	<dbl>	<int>	<int>
1	chr19	103586	103586	96.61017	57	2
2	chr19	103660	103660	80.00000	8	2
3	chr19	103661	103661	91.93548	57	5
4	chr19	103674	103674	90.00000	9	1
5	chr19	103675	103675	82.25806	51	11
6	chr19	103690	103690	100.00000	10	0

NXF Set

Hs_RRBS_nxf <- read.table('/projects/omics_share/human/GRCh38/supporting_files/benchmarking_data/RESULTS/RRBS_human_nxf_bench/SRR10133746/bismark_results/SRR10133746_val_1_bismark_bt2_pe.bismark.cov', 
                            skip = 1, 
                            col.names = c('Chrom', 'Start', 'End', 'Meth_Perc', 'count_1', 'count_2'))

Join Datasets

joined_Hs_calls <- dplyr::left_join(Hs_RRBS_truth, Hs_RRBS_nxf, by=c('Chrom', 'Start', 'End'))

samp <- sample(nrow(joined_Hs_calls), 10000, replace = FALSE)

random_10000 <- joined_Hs_calls[samp,]

plot(random_10000$Meth_Perc.x, random_10000$Meth_Perc.y, xlab = 'GEO Dataset', ylab = 'NXF Pipeline', main = 'Human Test Data')

Home

Quick Start for JAX Users

Troubleshooting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RRBS Benchmarking

RRBS Benchmark Testing

Library Load

Mouse RRBS

Truth Set

NXF Calls

Join Calls

Human RRBS

Truth Set

NXF Set

Join Datasets

Home

Documentation

Pipeline Documentation

Benchmarking Documentation

Pipeline development and Release Documentation

Clone this wiki locally