Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import of BigWig files with rtracklayer fails under MultiCoreParam #111

Open
MalteThodberg opened this issue Mar 4, 2020 · 21 comments
Open

Comments

@MalteThodberg
Copy link

I'm seeing a weird bug in in my CAGEfightR package when trying to import BigWigFiles.

Here's a reproducible example:

# Load the example data
library(BiocParallel)
library(rtracklayer)
library(CAGEfightR)
data('exampleDesign')

# Example bigwig files
bw_plus <- system.file('extdata', exampleDesign$BigWigPlus,
                       package = 'CAGEfightR')
bw_plus <- BigWigFileList(bw_plus)

# Genome
mm9 <- SeqinfoForUCSCGenome("mm9")
grl <- tileGenome(mm9, ntile=3)

# Import, knowing that some areas will be empty
import(bw_plus[[1]], which=grl[[1]]) # empty
import(bw_plus[[1]], which=grl[[2]]) # empty
import(bw_plus[[1]], which=grl[[3]]) # has signal

# This works
lapply(bw_plus, import, which=grl[[1]])
lapply(bw_plus, import, which=grl[[2]])
lapply(bw_plus, import, which=grl[[3]])

# This works
register(SerialParam())
bplapply(bw_plus, import, which=grl[[1]])
bplapply(bw_plus, import, which=grl[[2]])
bplapply(bw_plus, import, which=grl[[3]])

# This works
register(MulticoreParam(3))
bplapply(bw_plus, import)

# This fails!
register(MulticoreParam(3))
bplapply(bw_plus, import, which=grl[[1]])
bplapply(bw_plus, import, which=grl[[2]])
bplapply(bw_plus, import, which=grl[[3]])

To summarize:

  • You can import entire BigWig-files using bplapply.
  • You can import part of a BigWig using bplapply and SerialParam
  • You cannot import an empty tile from a BigWig file with bplapply and MultiCoreParam.

What's causing this strange behavior?

@mtmorgan
Copy link
Collaborator

mtmorgan commented Mar 4, 2020

Can you clarify what 'fails' mean? I see

> lapply(bw_plus, import, which=grl[[1]])
[[1]]
GRanges object with 0 ranges and 1 metadata column:
   seqnames    ranges strand |     score
      <Rle> <IRanges>  <Rle> | <numeric>
  -------
  seqinfo: 2 sequences from an unspecified genome

[[2]]
GRanges object with 0 ranges and 1 metadata column:
   seqnames    ranges strand |     score
      <Rle> <IRanges>  <Rle> | <numeric>
  -------
  seqinfo: 2 sequences from an unspecified genome

[[3]]
GRanges object with 0 ranges and 1 metadata column:
   seqnames    ranges strand |     score
      <Rle> <IRanges>  <Rle> | <numeric>
  -------
  seqinfo: 2 sequences from an unspecified genome

Warning messages:
1: In .local(con, format, text, ...) :
  'which' contains seqnames not known to BigWig file: chr1, chr2, chr3, chr4, chr5, chr6
2: In .normarg_seqnames2(seqnames, seqinfo) :
  levels in 'seqnames' with no entries in 'seqinfo' were dropped
3: In .local(con, format, text, ...) :
  'which' contains seqnames not known to BigWig file: chr1, chr2, chr3, chr4, chr5, chr6
4: In .normarg_seqnames2(seqnames, seqinfo) :
  levels in 'seqnames' with no entries in 'seqinfo' were dropped
5: In .local(con, format, text, ...) :
  'which' contains seqnames not known to BigWig file: chr1, chr2, chr3, chr4, chr5, chr6
6: In .normarg_seqnames2(seqnames, seqinfo) :
  levels in 'seqnames' with no entries in 'seqinfo' were dropped
> bplapply(bw_plus, import, which=grl[[1]])
[[1]]
GRanges object with 0 ranges and 1 metadata column:
   seqnames    ranges strand |     score
      <Rle> <IRanges>  <Rle> | <numeric>
  -------
  seqinfo: 2 sequences from an unspecified genome

[[2]]
GRanges object with 0 ranges and 1 metadata column:
   seqnames    ranges strand |     score
      <Rle> <IRanges>  <Rle> | <numeric>
  -------
  seqinfo: 2 sequences from an unspecified genome

[[3]]
GRanges object with 0 ranges and 1 metadata column:
   seqnames    ranges strand |     score
      <Rle> <IRanges>  <Rle> | <numeric>
  -------
  seqinfo: 2 sequences from an unspecified genome

Warning messages:
1: In .local(con, format, text, ...) :
  'which' contains seqnames not known to BigWig file: chr1, chr2, chr3, chr4, chr5, chr6
2: In .normarg_seqnames2(seqnames, seqinfo) :
  levels in 'seqnames' with no entries in 'seqinfo' were dropped
3: In .local(con, format, text, ...) :
  'which' contains seqnames not known to BigWig file: chr1, chr2, chr3, chr4, chr5, chr6
4: In .normarg_seqnames2(seqnames, seqinfo) :
  levels in 'seqnames' with no entries in 'seqinfo' were dropped
5: In .local(con, format, text, ...) :
  'which' contains seqnames not known to BigWig file: chr1, chr2, chr3, chr4, chr5, chr6
6: In .normarg_seqnames2(seqnames, seqinfo) :
  levels in 'seqnames' with no entries in 'seqinfo' were dropped

@MalteThodberg
Copy link
Author

Fails here means an error is raised, and a rather cryptic one relating to the parallel processing.

When calling import with the which argument, it will produce a warning if you request one or more seqlevels not in the bigwig file.

@mtmorgan
Copy link
Collaborator

mtmorgan commented Mar 4, 2020

can you cut-and-paste the command and error message? It doesn't fail in this way for me... Also please add the output of sessionInfo() and BiocManager::valid()

@MalteThodberg
Copy link
Author

Sure, continuing from the example in the first post:

> bplapply(bw_plus, import, which=grl[[1]])
Error in result[[njob]] <- value : 
  attempt to select less than one element in OneIndex
In addition: Warning message:
In parallel::mccollect(wait = FALSE, timeout = 1) :
  3 parallel jobs did not deliver results

Session info:

> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin19.2.0 (64-bit)
Running under: macOS Catalina 10.15.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/openblas/0.3.7/lib/libopenblasp-r0.3.7.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] CAGEfightR_1.7.1            SummarizedExperiment_1.16.1 DelayedArray_0.12.2        
 [4] BiocParallel_1.20.1         matrixStats_0.55.0          Biobase_2.46.0             
 [7] rtracklayer_1.46.0          GenomicRanges_1.38.0        GenomeInfoDb_1.22.0        
[10] IRanges_2.20.2              S4Vectors_0.24.3            BiocGenerics_0.32.0        
[13] rstudioapi_0.11            

loaded via a namespace (and not attached):
 [1] ProtGenerics_1.18.0      bitops_1.0-6             bit64_0.9-7             
 [4] RColorBrewer_1.1-2       progress_1.2.2           httr_1.4.1              
 [7] GenomicFiles_1.22.0      tools_3.6.2              backports_1.1.5         
[10] R6_2.4.1                 rpart_4.1-15             Hmisc_4.3-1             
[13] DBI_1.1.0                lazyeval_0.2.2           Gviz_1.30.3             
[16] colorspace_1.4-1         nnet_7.3-13              tidyselect_1.0.0        
[19] gridExtra_2.3            prettyunits_1.1.1        bit_1.1-15.2            
[22] curl_4.3                 compiler_3.6.2           htmlTable_1.13.3        
[25] microbenchmark_1.4-7     scales_1.1.0             checkmate_2.0.0         
[28] askpass_1.1              rappdirs_0.3.1           stringr_1.4.0           
[31] digest_0.6.25            Rsamtools_2.2.3          foreign_0.8-76          
[34] XVector_0.26.0           dichromat_2.0-0          base64enc_0.1-3         
[37] jpeg_0.1-8.1             pkgconfig_2.0.3          htmltools_0.4.0         
[40] regioneR_1.18.1          ensembldb_2.10.2         BSgenome_1.54.0         
[43] dbplyr_1.4.2             htmlwidgets_1.5.1        rlang_0.4.5             
[46] pryr_0.1.4               RSQLite_2.2.0            DelayedMatrixStats_1.8.0
[49] acepack_1.4.1            dplyr_0.8.4              VariantAnnotation_1.32.0
[52] RCurl_1.98-1.1           magrittr_1.5             GenomeInfoDbData_1.2.2  
[55] Formula_1.2-3            Matrix_1.2-18            Rcpp_1.0.3              
[58] munsell_0.5.0            lifecycle_0.1.0          stringi_1.4.6           
[61] zlibbioc_1.32.0          BiocFileCache_1.10.2     grid_3.6.2              
[64] blob_1.2.1               crayon_1.3.4             lattice_0.20-40         
[67] Biostrings_2.54.0        splines_3.6.2            GenomicFeatures_1.38.2  
[70] hms_0.5.3                knitr_1.28               pillar_1.4.3            
[73] codetools_0.2-16         biomaRt_2.42.0           XML_3.99-0.3            
[76] glue_1.3.1               biovizBase_1.34.1        latticeExtra_0.6-29     
[79] data.table_1.12.8        png_0.1-7                vctrs_0.2.3             
[82] gtable_0.3.0             openssl_1.4.1            purrr_0.3.3             
[85] assertthat_0.2.1         ggplot2_3.2.1            xfun_0.12               
[88] AnnotationFilter_1.10.0  survival_3.1-8           tibble_2.1.3            
[91] GenomicAlignments_1.22.1 AnnotationDbi_1.48.0     memoise_1.1.0           
[94] cluster_2.1.0    

BiocManager::valid():


* sessionInfo()

R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin19.2.0 (64-bit)
Running under: macOS Catalina 10.15.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/openblas/0.3.7/lib/libopenblasp-r0.3.7.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] CAGEfightR_1.7.1            SummarizedExperiment_1.16.1 DelayedArray_0.12.2        
 [4] BiocParallel_1.20.1         matrixStats_0.55.0          Biobase_2.46.0             
 [7] rtracklayer_1.46.0          GenomicRanges_1.38.0        GenomeInfoDb_1.22.0        
[10] IRanges_2.20.2              S4Vectors_0.24.3            BiocGenerics_0.32.0        
[13] rstudioapi_0.11            

loaded via a namespace (and not attached):
 [1] ProtGenerics_1.18.0      bitops_1.0-6             bit64_0.9-7             
 [4] RColorBrewer_1.1-2       progress_1.2.2           httr_1.4.1              
 [7] GenomicFiles_1.22.0      tools_3.6.2              backports_1.1.5         
[10] R6_2.4.1                 rpart_4.1-15             Hmisc_4.3-1             
[13] DBI_1.1.0                lazyeval_0.2.2           Gviz_1.30.3             
[16] colorspace_1.4-1         nnet_7.3-13              tidyselect_1.0.0        
[19] gridExtra_2.3            prettyunits_1.1.1        bit_1.1-15.2            
[22] curl_4.3                 compiler_3.6.2           htmlTable_1.13.3        
[25] microbenchmark_1.4-7     scales_1.1.0             checkmate_2.0.0         
[28] askpass_1.1              rappdirs_0.3.1           stringr_1.4.0           
[31] digest_0.6.25            Rsamtools_2.2.3          foreign_0.8-76          
[34] XVector_0.26.0           dichromat_2.0-0          base64enc_0.1-3         
[37] jpeg_0.1-8.1             pkgconfig_2.0.3          htmltools_0.4.0         
[40] regioneR_1.18.1          ensembldb_2.10.2         BSgenome_1.54.0         
[43] dbplyr_1.4.2             htmlwidgets_1.5.1        rlang_0.4.5             
[46] pryr_0.1.4               RSQLite_2.2.0            DelayedMatrixStats_1.8.0
[49] acepack_1.4.1            dplyr_0.8.4              VariantAnnotation_1.32.0
[52] RCurl_1.98-1.1           magrittr_1.5             GenomeInfoDbData_1.2.2  
[55] Formula_1.2-3            Matrix_1.2-18            Rcpp_1.0.3              
[58] munsell_0.5.0            lifecycle_0.1.0          stringi_1.4.6           
[61] zlibbioc_1.32.0          BiocFileCache_1.10.2     grid_3.6.2              
[64] blob_1.2.1               crayon_1.3.4             lattice_0.20-40         
[67] Biostrings_2.54.0        splines_3.6.2            GenomicFeatures_1.38.2  
[70] hms_0.5.3                knitr_1.28               pillar_1.4.3            
[73] codetools_0.2-16         biomaRt_2.42.0           XML_3.99-0.3            
[76] glue_1.3.1               biovizBase_1.34.1        latticeExtra_0.6-29     
[79] BiocManager_1.30.10      data.table_1.12.8        png_0.1-7               
[82] vctrs_0.2.3              gtable_0.3.0             openssl_1.4.1           
[85] purrr_0.3.3              assertthat_0.2.1         ggplot2_3.2.1           
[88] xfun_0.12                AnnotationFilter_1.10.0  survival_3.1-8          
[91] tibble_2.1.3             GenomicAlignments_1.22.1 AnnotationDbi_1.48.0    
[94] memoise_1.1.0            cluster_2.1.0           

Bioconductor version '3.10'

  * 2 packages out-of-date
  * 1 packages too new

create a valid installation with

  BiocManager::install(c(
    "CAGEfightR", "MultiAssayExperiment", "nloptr"
  ), update = TRUE, ask = FALSE)

more details: BiocManager::valid()$too_new, BiocManager::valid()$out_of_date

Warning message:
2 packages out-of-date; 1 packages too new 

@mtmorgan
Copy link
Collaborator

mtmorgan commented Mar 4, 2020

Your reproducible example doesn't reproduce for me ;) does

bplapply(1:5, log, base = 2)

work?

@MalteThodberg
Copy link
Author

Ah, the plot thickens then!

It seems bplapply fails specifically when run with MultiCoreParam AND the which argument is used with import.

Here's another example (reusing files from the original example):

# bplapply test works (no warnings)
z1 <- bplapply(1:5, log, base = 2)

# Single import works (only raises a warning)
y1 <- import(bw_plus[[1]], which=grl[[1]]) # empty
y2 <- import(bw_plus[[1]], which=grl[[2]]) # empty
y3 <- import(bw_plus[[1]], which=grl[[3]]) # has signal

# All these works (only raises a warning)
x1 <- lapply(bw_plus, import)
x2 <- bplapply(bw_plus, import, BPPARAM=SerialParam())
x3 <- bplapply(bw_plus, import, BPPARAM=MulticoreParam(3))
x4 <- lapply(bw_plus, import, which=grl[[1]])
x5 <- bplapply(bw_plus, import, which=grl[[1]], BPPARAM=SerialParam())

# This specific case raises an error!
x6 <- bplapply(bw_plus, import, which=grl[[1]], BPPARAM=MulticoreParam(3))
Error in result[[njob]] <- value : 
  attempt to select less than one element in OneIndex
In addition: Warning messages:
1: In parallel::mccollect(wait = TRUE) :
  1 parallel job did not deliver a result
2: In parallel::mccollect(wait = FALSE, timeout = 1) :
  3 parallel jobs did not deliver results

I have no idea how to decipher the error message. Could it be:

  • RStudio-related?
  • rtracklayer::import related?
  • macOS related?

I originally encountered the problem using GenomicFiles::reduceByRange, which needs to load separate parts of a BigWigFile.

@MalteThodberg
Copy link
Author

And some more testing:

Both of these work:

x7 <- bplapply(bw_plus, import, which=grl[[1]], BPPARAM=MulticoreParam(1))
x8 <- bplapply(bw_plus, import, which=grl[[1]], BPPARAM=SnowParam(2))

So MultiCoreParam only fails when more than 1 core is used. SnowParam works without errors.

@HenrikBengtsson
Copy link
Contributor

HenrikBengtsson commented Mar 5, 2020

FYI, they recommend against using forked parallel processing in the RStudio Console. I would rule out that player. Try to replicate it without RStudio by running R in the terminal.

@MalteThodberg
Copy link
Author

Still gives an error message in the terminal:
image

@HenrikBengtsson
Copy link
Contributor

Next I would try with:

x <- parallel::mclapply(bw_plus, import, which=grl[[1]], mc.cores=3L)

@MalteThodberg
Copy link
Author

Aha, Mother Of All Error Messages:

> x <- parallel::mclapply(bw_plus, import, which=grl[[1]], mc.cores=3L)

 *** caught segfault ***
 *** caught segfault ***

address 0x10bdb8020, cause 'memory not mapped'address 0x10bdb8020, cause 'memory not mapped'


 *** caught segfault ***
address 0x10bdb8020, cause 'memory not mapped'

Traceback:
 1: strsplit(x, "\n[ \t\n]*\n", perl = TRUE)
 2: lapply(strsplit(x, "\n[ \t\n]*\n", perl = TRUE), strsplit, "[ \t\n]",     perl = TRUE)
 3: strwrap(paste0(c(...), collapse = ""))
 4: paste0(strwrap(paste0(c(...), collapse = "")), collapse = "\n  ")

Traceback:
 5:  1: wmsg("levels in 'seqnames' with no entries ", "in 'seqinfo' were dropped")strsplit(x, "\n[ \t\n]*\n", perl = TRUE)
 6:
 2: warning(wmsg("levels in 'seqnames' with no entries ", "in 'seqinfo' were dropped"))lapply(strsplit(x, "\n[ \t\n]*\n", perl = TRUE), strsplit, "[ \t\n]",
 7:     perl = TRUE).normarg_seqnames2(seqnames, seqinfo)

 3:  8: strwrap(paste0(c(...), collapse = ""))new_GRanges("GRanges", seqnames = seqnames, ranges = ranges,
    strand = strand, mcols = mcols, seqinfo = seqinfo) 4:
paste0(strwrap(paste0(c(...), collapse = "")), collapse = "\n  ") 9:
GRanges(rep(seqnames(which), nhits), C_ans[[1L]], seqinfo = si) 5:
wmsg("levels in 'seqnames' with no entries ", "in 'seqinfo' were dropped")10:
.local(con, format, text, ...) 6:
warning(wmsg("levels in 'seqnames' with no entries ", "in 'seqinfo' were dropped"))11: FUN(X[[i]], ...)
 7:
12: .normarg_seqnames2(seqnames, seqinfo)FUN(X[[i]], ...)

Traceback:
 8:
13: new_GRanges("GRanges", seqnames = seqnames, ranges = ranges,  1: lapply(X = S, FUN = FUN, ...)strsplit(x, "\n[ \t\n]*\n", perl = TRUE)

    strand = strand, mcols = mcols, seqinfo = seqinfo) 2: 14:
 9: lapply(strsplit(x, "\n[ \t\n]*\n", perl = TRUE), strsplit, "[ \t\n]", doTryCatch(return(expr), name, parentenv, handler)GRanges(rep(seqnames(which), nhits), C_ans[[1L]], seqinfo = si)    perl = TRUE)


15: 10:  3: tryCatchOne(expr, names, parentenv, handlers[[1L]]).local(con, format, text, ...)strwrap(paste0(c(...), collapse = ""))


16: 11:  4: tryCatchList(expr, classes, parentenv, handlers)paste0(strwrap(paste0(c(...), collapse = "")), collapse = "\n  ")FUN(X[[i]], ...)

 5: 12: wmsg("levels in 'seqnames' with no entries ", "in 'seqinfo' were dropped")
FUN(X[[i]], ...)
17:
 6: tryCatch(expr, error = function(e) {13: warning(wmsg("levels in 'seqnames' with no entries ", "in 'seqinfo' were dropped"))    call <- conditionCall(e)lapply(X = S, FUN = FUN, ...)
    if (!is.null(call)) {
 7: 14: .normarg_seqnames2(seqnames, seqinfo)        if (identical(call[[1L]], quote(doTryCatch)))
doTryCatch(return(expr), name, parentenv, handler)            call <- sys.call(-4L) 8:         dcall <- deparse(call)[1L]
        prefix <- paste("Error in", dcall, ": ")15: new_GRanges("GRanges", seqnames = seqnames, ranges = ranges,         LONG <- 75LtryCatchOne(expr, names, parentenv, handlers[[1L]])    strand = strand, mcols = mcols, seqinfo = seqinfo)        sm <- strsplit(conditionMessage(e), "\n")[[1L]]

16:  9: tryCatchList(expr, classes, parentenv, handlers)        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")GRanges(rep(seqnames(which), nhits), C_ans[[1L]], seqinfo = si)
        if (is.na(w))
17:             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], 10:                 type = "b")tryCatch(expr, error = function(e) {.local(con, format, text, ...)        if (w > LONG)     call <- conditionCall(e)
            prefix <- paste0(prefix, "\n  ")    if (!is.null(call)) {11:     }        if (identical(call[[1L]], quote(doTryCatch))) FUN(X[[i]], ...)    else prefix <- "Error : "            call <- sys.call(-4L)0m
    msg <- paste0(prefix, conditionMessage(e), "\n")        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L12:         sm <- strsplit(conditionMessage(e), "\n")[[1L]]    .Internal(seterrmessage(msg[1L]))        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))     if (!silent && isTRUE(getOption("show.error.messages"))) {            w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],         cat(msg, file = outFile)                type = "b")        .Internal(printDeferredWarnings())FUN(X[[i]], ...)        if (w > LONG)             prefix <- paste0(prefix, "\n  ")
    }    }13: lapply(X = S, FUN = FUN, ...)
14:     else prefix <- "Error : "doTryCatch(return(expr), name, parentenv, handler)    invisible(structure(msg, class = "try-error", condition = e))
15:     msg <- paste0(prefix, conditionMessage(e), "\n")})tryCatchOne(expr, names, parentenv, handlers[[1L]])    .Internal(seterrmessage(msg[1L]))

    if (!silent && isTRUE(getOption("show.error.messages"))) {16: 18:         cat(msg, file = outFile)tryCatchList(expr, classes, parentenv, handlers)try(lapply(X = S, FUN = FUN, ...), silent = TRUE)        .Internal(printDeferredWarnings())
    }
17: 19:     invisible(structure(msg, class = "try-error", condition = e))tryCatch(expr, error = function(e) {})sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))    call <- conditionCall(e)

    if (!is.null(call)) {18: 20:         if (identical(call[[1L]], quote(doTryCatch))) try(lapply(X = S, FUN = FUN, ...), silent = TRUE)FUN(X[[i]], ...)            call <- sys.call(-4L)

        dcall <- deparse(call)[1L]19:         prefix <- paste("Error in", dcall, ": ")21: sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))        LONG <- 75Llapply(seq_len(cores), inner.do)

        sm <- strsplit(conditionMessage(e), "\n")[[1L]]20: 22: parallel::mclapply(bw_plus, import, which = grl[[1]], mc.cores = 3L)        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")FUN(X[[i]], ...)        if (is.na(w))

            w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], 21:                 type = "b")lapply(seq_len(cores), inner.do)        if (w > LONG)
            prefix <- paste0(prefix, "\n  ")22:     }    else prefix <- "Error : "parallel::mclapply(bw_plus, import, which = grl[[1]], mc.cores = 3L)
    msg <- paste0(prefix, conditionMessage(e), "\n")
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace    .Internal(seterrmessage(msg[1L]))
    if (!silent && isTRUE(getOption("show.error.messages"))) {election: m
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
18: try(lapply(X = S, FUN = FUN, ...), silent = TRUE)
19: sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))
20: FUN(X[[i]], ...)
21: Selection: lapply(seq_len(cores), inner.do)
22: parallel::mclapply(bw_plus, import, which = grl[[1]], mc.cores = 3L)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:

@mtmorgan
Copy link
Collaborator

mtmorgan commented Mar 6, 2020

I'm still not able to reproduce this. Can you confirm that a file 'test.R' containing only

# Load the example data
library(BiocParallel)
library(rtracklayer)
library(CAGEfightR)

# Example bigwig files
bw_plus <- system.file('extdata', exampleDesign$BigWigPlus,
                       package = 'CAGEfightR')
bw_plus <- BigWigFileList(bw_plus)

# Genome
mm9 <- SeqinfoForUCSCGenome("mm9")
grl <- tileGenome(mm9, ntile=3)

register(MulticoreParam(3))
bplapply(bw_plus, import, which=grl[[1]])

generates the error when run from the command line with R --vanilla -f test.R.

I guess that more careful management of seqlevels means that the warning messages and hence crash goes away...?

bplapply(bw_plus, function(bw, which) {
    seqlevels(which, pruning.mode = "coarse") <- seqlevels(bw)
    import(bw, which = which)
}, which = grl[[3]])

@MalteThodberg
Copy link
Author

Still raises an error:

maltethodberg@ip164-91:~/Desktop$ R --vanilla -f test.R

R version 3.6.2 (2019-12-12) -- "Dark and Stormy Night"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin19.2.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> # Load the example data
> library(BiocParallel)
> library(rtracklayer)
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges
Loading required package: GenomeInfoDb
> library(CAGEfightR)
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DelayedArray
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

    anyMissing, rowMedians


Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following objects are masked from ‘package:base’:

    aperm, apply, rowsum

>
> # Example bigwig files
> bw_plus <- system.file('extdata', exampleDesign$BigWigPlus,
+                        package = 'CAGEfightR')
> bw_plus <- BigWigFileList(bw_plus)
>
> # Genome
> mm9 <- SeqinfoForUCSCGenome("mm9")
> grl <- tileGenome(mm9, ntile=3)
>
> register(MulticoreParam(3))
> bplapply(bw_plus, import, which=grl[[1]])
Error in result[[njob]] <- value :
  attempt to select less than one element in OneIndex
Calls: bplapply ... bplapply -> bplapply -> bplapply -> bploop -> bploop.lapply
In addition: Warning message:
In parallel::mccollect(wait = FALSE, timeout = 1) :
  1 parallel job did not deliver a result
Execution halted

If this is some deep unexplained bug, you're right that it might be easier to modify the code to avoid warnings.

@HenrikBengtsson
Copy link
Contributor

The warning:

Warning message:
In parallel::mccollect(wait = FALSE, timeout = 1) :
  1 parallel job did not deliver a result

strongly suggests that the forked worker process died, cf. Issue #112. The reason for why it died seems to be suggested by #111 (comment):

> x <- parallel::mclapply(bw_plus, import, which=grl[[1]], mc.cores=3L)

 *** caught segfault ***
 *** caught segfault ***

address 0x10bdb8020, cause 'memory not mapped'address 0x10bdb8020, cause 'memory not mapped'


 *** caught segfault ***
address 0x10bdb8020, cause 'memory not mapped'

To me, the problem is most likely not in BiocParallel.

I would:

  1. Troubleshoot in a plain R and definitely not in RStudio
  2. Always troubleshoot in a fresh R session, i.e. use R --vanilla to avoid .Rprofile etc.
  3. Replicate with two workers (mc.cores = 2L) instead of three. It's not possible to go lower, i.e. don't use mc.cores = 1L - it's different.
  4. Minimize the example further, e.g. can the error be reproduced with a single BigWig file (now three)?
  5. Post a new minimal example where BiocParallel is not attached/used.
  6. Share a new sessionInfo() and more output, e.g. print(grl[[1]]).
  7. Do also last comment in Import of BigWig files with rtracklayer fails under MultiCoreParam #111 (comment)

@mtmorgan
Copy link
Collaborator

mtmorgan commented Mar 6, 2020

I'll mention that valgrind reports an illegal read (the code takes strlen(x) where x is not null terminated) but it seems unlikely that that would cause a segfault, and it's not actually at the place where the traceback suggests things are going south (in R code signaling a warning...)

@bschilder
Copy link

bschilder commented May 18, 2022

Having very similar issues with rtracklayer::import on remote bigwig files when using which=.
parallel::mclapply(..., mc.core=1) works fine, but parallel::mclapply(..., mc.core=2) does not, nor does BioParallel:bplapply when I register workers >1 with MulticoreParam

@MalteThodberg Base on your first post, I'm assuming your package also relies on rtracklayer. Trying to figure out the common denominator.

It seems rtracklayer has been accumulating quite a few issues for several years, so it's possible something is broken that's related to this:
https://github.com/lawremi/rtracklayer/issues

@bschilder
Copy link

bschilder commented May 18, 2022

I'm wondering if this has something to do with repeated calls to the underlying C library that are made every time you call import on a bigwig with rtracklayer. When multiple calls are made at once (i.e. during parallelization) this might cause issues. @sanchit-saini

I ran into similar issues recently when trying to parallelize VariantAnnotation::readVcf, with an explanation by @mtmorgan here.

@mtmorgan
Copy link
Collaborator

I'm not exactly sure why we went down the rabbit hole above, but I think it's likely (though a bit mysterious) to say that the C code in rtracklayer is not re-entrant, so access via forked processes has unpredictable results.

I would guess that SnowParam (i.e., independent rather than forked processes) is more robust. This does require (a) remembering that these processes are independent, so individual packages need to be loaded explicitly (e.g., via rtracklayer::...()) on the worker and (b) that data is serialized both to and from the worker. The later means that one wants to construct small objects on the way there (e.g., file names, rather than the parsed content of the file) and back (e.g,. summary statistics rather than a GRanges representation of the input). This also means that there can be relatively considerable costs of 'start-up' (loading packages) that make inexpensive 'worker' operations not worth the effort -- the worker needs to do a lot of work.

I don't think there is an easy solution here -- if the C library is not re-entrant, then very considerable work will be needed to make it so.

@DarwinAwardWinner
Copy link

Is it possible to detect if the current R process is forked and warn when the non-reentrant method is called from a forked process? That would make it easier for users to discover the underlying issue.

@HenrikBengtsson
Copy link
Contributor

Is it possible to detect if the current R process is forked and warn when the non-reentrant method is called from a forked process? That would make it easier for users to discover the underlying issue.

parallel:::isChild()

See also PR#18230 (WISH: Export parallel:::isChild() to make it possible to protect against forked processing).

@mtmorgan
Copy link
Collaborator

mtmorgan commented May 19, 2022

As @HenrikBengtsson notes it's possible to find out that one is in a forked process (useful for avoiding nested forked processes and over-subscription to the available CPUs) but unfortunately @DarwinAwardWinner there is no way to know that a particular function call is non-reentrant (n.b., it is speculation that this is the problem in the current thread or for @bschilder 's use; it would be useful, though not definitive, to show that the code works without problem under SnowParam()).

Despite the appeal of forked (MulticoreParam()) processing for scripts and 'light-weight' tasks, I have come to believe that developers (rather than users writing scripts) should implement their code with SnowParam() and independent processes. In addition to the problem in this thread, there is a discussion in the Bioconductor slack about interaction between the garbage collector and memory allocation in the forked process, and the basic observation that approximately 1/2 of Bioconductor package downloads are to Windows machines -- targeting MulticoreParam() as the 'efficient' solution excludes 1/2 a package author's audience. Using separate processes requires much more discipline / sophistication to manage workers and data transfer. There have been efforts in BiocParallel to make this more transparent, especially in the last release and thanks to @Jiefei-Wang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants