-
Notifications
You must be signed in to change notification settings - Fork 10
RRBS Benchmarking
Two RRBS simulation tools were tested:
In testing these tools:
- RRBSsim produced highly duplicated (~90%) libraries, and proved difficult to validate. This tool would be preferred if simulation is attempted again.
- Sherman does not provide a 'truth set' output, and proved difficult to validate the resulting methylation calls.
Therefore, simulation tools were abandoned and the pipeline was tested against published results. These data were obtained from GEO and SRA. Accession numbers are noted below.
library(dplyr)
library(ggplot2)
library(tidyr)
library(stringr)
library(data.table)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Attaching package: ‘data.table’
The following objects are masked from ‘package:dplyr’:
between, first, last
Dataset 1:
GEO Data was taken from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4626846
SRA reads were taken from: https://www.ncbi.nlm.nih.gov/sra?term=SRX8582110
Note 1 : The calls from the NXF pipeline are +1 bp relative to the published GEO data. Note 2 : The above data used bismark for alignment, and methyldackel for calls. It is therefore not 1:1 with our method. These data are kept in case we eventually try methyldackel.
Dataset 2:
GEO Data was taken from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3940770
SRA reads were taken from: https://www.ncbi.nlm.nih.gov/sra?term=SRR9681755
Note 1 : Compairing The GEO results, to primary testing showed that Bismark Deduplication was not used prior to methylation extraction.
Mm_RRBS_truth <- read.table('/projects/omics_share/mouse/GRCm38/supporting_files/benchmarking_data/RRBS/SRA/GSM3940770_RRBS-Control-3W-WK-6.bismark.cov',
skip = 1,
col.names = c('Chrom', 'Start', 'End', 'Meth_Perc', 'score_1', 'score_2'))
head(Mm_RRBS_truth)
Chrom | Start | End | Meth_Perc | score_1 | score_2 | |
---|---|---|---|---|---|---|
<chr> | <int> | <int> | <dbl> | <int> | <int> | |
1 | chr19 | 3079679 | 3079679 | 91.80328 | 112 | 10 |
2 | chr19 | 3079682 | 3079682 | 100.00000 | 122 | 0 |
3 | chr19 | 3079728 | 3079728 | 100.00000 | 121 | 0 |
4 | chr19 | 3079791 | 3079791 | 41.66667 | 50 | 70 |
5 | chr19 | 3079909 | 3079909 | 100.00000 | 33 | 0 |
6 | chr19 | 3079916 | 3079916 | 60.60606 | 20 | 13 |
Mm_RRBS_truth$Chrom<-gsub("chr","",as.character(Mm_RRBS_truth$Chrom))
head(Mm_RRBS_truth)
Chrom | Start | End | Meth_Perc | score_1 | score_2 | |
---|---|---|---|---|---|---|
<chr> | <int> | <int> | <dbl> | <int> | <int> | |
1 | 19 | 3079679 | 3079679 | 91.80328 | 112 | 10 |
2 | 19 | 3079682 | 3079682 | 100.00000 | 122 | 0 |
3 | 19 | 3079728 | 3079728 | 100.00000 | 121 | 0 |
4 | 19 | 3079791 | 3079791 | 41.66667 | 50 | 70 |
5 | 19 | 3079909 | 3079909 | 100.00000 | 33 | 0 |
6 | 19 | 3079916 | 3079916 | 60.60606 | 20 | 13 |
Mm_RRBS_nxf <- read.table('/projects/omics_share/mouse/GRCm38/supporting_files/benchmarking_data/RESULTS/RRBS_mouse_nxf_bench/SRR9681755/bismark_results/SRR9681755_val_1_bismark_bt2_pe.bismark.cov',
skip = 1,
col.names = c('Chrom', 'Start', 'End', 'Meth_Perc', 'score_1', 'score_2'))
head(Mm_RRBS_nxf)
Chrom | Start | End | Meth_Perc | score_1 | score_2 | |
---|---|---|---|---|---|---|
<chr> | <int> | <int> | <dbl> | <int> | <int> | |
1 | 1 | 3014742 | 3014742 | 100 | 5 | 0 |
2 | 1 | 3014865 | 3014865 | 0 | 0 | 6 |
3 | 1 | 3014929 | 3014929 | 100 | 7 | 0 |
4 | 1 | 3014975 | 3014975 | 100 | 7 | 0 |
5 | 1 | 3020689 | 3020689 | 100 | 10 | 0 |
6 | 1 | 3020690 | 3020690 | 100 | 6 | 0 |
joined_Mm_calls <- dplyr::left_join(Mm_RRBS_truth, Mm_RRBS_nxf, by=c('Chrom', 'Start', 'End'))
head(joined_Mm_calls)
Chrom | Start | End | Meth_Perc.x | score_1.x | score_2.x | Meth_Perc.y | score_1.y | score_2.y | |
---|---|---|---|---|---|---|---|---|---|
<chr> | <int> | <int> | <dbl> | <int> | <int> | <dbl> | <int> | <int> | |
1 | 19 | 3079679 | 3079679 | 91.80328 | 112 | 10 | 91.59664 | 109 | 10 |
2 | 19 | 3079682 | 3079682 | 100.00000 | 122 | 0 | 100.00000 | 119 | 0 |
3 | 19 | 3079728 | 3079728 | 100.00000 | 121 | 0 | 100.00000 | 119 | 0 |
4 | 19 | 3079791 | 3079791 | 41.66667 | 50 | 70 | 42.37288 | 50 | 68 |
5 | 19 | 3079909 | 3079909 | 100.00000 | 33 | 0 | 100.00000 | 32 | 0 |
6 | 19 | 3079916 | 3079916 | 60.60606 | 20 | 13 | 59.37500 | 19 | 13 |
samp <- sample(nrow(joined_Mm_calls), 10000, replace = FALSE)
random_10000 <- joined_Mm_calls[samp,]
plot(random_10000$Meth_Perc.x, random_10000$Meth_Perc.y, xlab = 'GEO Dataset', ylab = 'NXF Pipeline', main = 'Mouse Test Data')
GEO Data was taken from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4082740
SRA reads were taken from: https://www.ncbi.nlm.nih.gov/sra?term=SRX6862244
Note 1 : Compairing The GEO results, to primary testing showed that Bismark Deduplication was not used prior to methylation extraction.
Hs_RRBS_truth <- read.table('/projects/omics_share/human/GRCh38/supporting_files/benchmarking_data/RRBS/SRA/GSM4082740_281_W766F_father_ICSI-ET.CpG.bismark.cov',
skip = 1,
col.names = c('Chrom', 'Start', 'End', 'Meth_Perc', 'count_1', 'count_2'))
head(Hs_RRBS_truth)
Chrom | Start | End | Meth_Perc | count_1 | count_2 | |
---|---|---|---|---|---|---|
<chr> | <int> | <int> | <dbl> | <int> | <int> | |
1 | chr19 | 103586 | 103586 | 96.61017 | 57 | 2 |
2 | chr19 | 103660 | 103660 | 80.00000 | 8 | 2 |
3 | chr19 | 103661 | 103661 | 91.93548 | 57 | 5 |
4 | chr19 | 103674 | 103674 | 90.00000 | 9 | 1 |
5 | chr19 | 103675 | 103675 | 82.25806 | 51 | 11 |
6 | chr19 | 103690 | 103690 | 100.00000 | 10 | 0 |
Hs_RRBS_nxf <- read.table('/projects/omics_share/human/GRCh38/supporting_files/benchmarking_data/RESULTS/RRBS_human_nxf_bench/SRR10133746/bismark_results/SRR10133746_val_1_bismark_bt2_pe.bismark.cov',
skip = 1,
col.names = c('Chrom', 'Start', 'End', 'Meth_Perc', 'count_1', 'count_2'))
joined_Hs_calls <- dplyr::left_join(Hs_RRBS_truth, Hs_RRBS_nxf, by=c('Chrom', 'Start', 'End'))
samp <- sample(nrow(joined_Hs_calls), 10000, replace = FALSE)
random_10000 <- joined_Hs_calls[samp,]
plot(random_10000$Meth_Perc.x, random_10000$Meth_Perc.y, xlab = 'GEO Dataset', ylab = 'NXF Pipeline', main = 'Human Test Data')