devtools::install_github("SydneyBioX/BenchHub")
- Clean and simple user interface (API), with clear information and error messages using modern CLI outputs (progress bars, stack tracing, colours, …).
- Data handling, including caching and data integrity verification (MD5 checksum?).
- Simplify novel method evaluation by providing sample code and inputs for the evaluate function.
- Ensure that it is easy and highly flexible to add new datasets from various sources.
- Use YAML to encode dataset metadata to ensure language interoperability.
We could find datasets by datatype, task and patterns.
library(BenchHub)
findData(dataType = "SST", task = "Celltype Classification")
#outputs a data frame with metadata about the matched datasets
findData(name = "*IMC*") # match datasets which have IMC in their name
Get data from various sources.
trio <- getTrio("openproblems_v1/tenx_1k_pbmc") # some automatic caching, etc.
trio$tasks()
# outputs information about tasks, metrics and "gold standards" of the included data
Evaluate your method.
myCellTypes <- myMethod1(trio$getSCE())
mySegementation <- myMethod2(trio$getRaster())
outputs <- list("cellTypes" = myCellTypes, "segmentation" = mySegementation)
evaluation <- trio$evaluate(outputs)
Generate a temple of what an input to trio$evaluate()
would have to look like for a successful evaluation.
template <- trio$getTemplate(task = c("Celltype Classification", "Cell Segmentation"))
names(template) # [1] "celltype" "segmentation"
length(template$celltype) # [1] 1000
typeof(template$celltype[1]) # [1] "character"
unique(template$celltype) # [1] "a" "b"
Generate an assertion that checks the format of the method output.
checkInput <- trio$checkInput(myCelltypes, task = "Celltype Classification")
# informative errors that allow the user to make their outputs conform to the input of trio$evaluate()
Get multiple datasets in one trio and perform the evaluation on all relevant dataset within the trio.
trio <- getTrio(c("openproblems_v1/tenx_1k_pbmc", "openproblems_v1/tenx_5k_pbmc"))
Allow users to define a cache directory for the datasets (i.e., in our case, biostat
folder).