diff --git a/DESCRIPTION b/DESCRIPTION index b8cbaa93..4089189d 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -49,6 +49,7 @@ Suggests: mia, stats, limma, - readr + readr, + BiocStyle URL: https://github.com/waldronlab/bugphyzz BugReports: https://github.com/waldronlab/bugphyzz/issues diff --git a/vignettes/bugphyzz.Rmd b/vignettes/bugphyzz.Rmd index 6842ab9d..894c2206 100644 --- a/vignettes/bugphyzz.Rmd +++ b/vignettes/bugphyzz.Rmd @@ -1,8 +1,12 @@ --- title: "bugphyzz" subtitle: "A harmonized data resource and software for enrichment analysis of microbial physiologies" -output: - rmarkdown::html_vignette: +author: + - name: "Samuel Gamboa" + email: "Samuel.Gamboa.Tuz@gmail.com" + - name: "Levi Waldron" +output: + BiocStyle::html_document: fig_caption: true toc: true vignette: > @@ -18,110 +22,102 @@ knitr::opts_chunk$set( ) ``` -## Introduction +# Introduction -[Bugphyzz](https://github.com/waldronlab/bugphyzzExports) -is an electronic resource of harmonized microbial annotations from -different sources. These annotations can be used to create signatures of -microbes sharing attributes and used for bug set enrichment analysis. +[bugphyzz](https://github.com/waldronlab/bugphyzzExports) +is an electronic database of standardized microbial annotations. +It facilitates the creation of microbial signatures based on shared attributes, +which are utilized for bug set enrichment analysis. -## Data schema +# Data schema Annotations in bugphyzz represent the link between a taxon (Bacteria/Archaea) -and an attribute as described in the data schema below. +and an attribute, as outlined in the data schema provided below.
-![**Data schema**](bugphyzz_data_schema.png){height="300px" width="600px"} +![**Data schema**](bugphyzz_data_schema.png){width="100%"}
**Taxon-related** -Taxonomic data was harmonized according to the NCBI taxonomy: +Taxonomic data in bugphyzz is standardized according to the NCBI taxonomy: -1. _NCBI ID_. An integer. The NCBI taxonomy ID (taxid) associated with a -taxon. -2. _Rank_. A character string describing the taxonomy rank. Valid values: -superkingdom, kingdom, phylum, class, order, family, genus, species, strain. -3. _Taxon name_. A character string describing the scientific name of the taxon. +1. _NCBI ID_: An integer representing the NCBI taxonomy ID (taxid) associated +with a taxon. +2. _Rank_: A character string indicating the taxonomy rank, +including superkingdom, kingdom, phylum, class, order, family, genus, species, +or strain. +3. _Taxon name_: A character string denoting the scientific name of the taxon. **Attribute-related** -Attribute data was harmonized with a controlled vocabulary based on -available ontology terms. Attributes, ontology terms, and ontology libraries +Attribute data is harmonized with a controlled vocabulary based on available +ontology terms. Details of attributes, ontology terms, and ontology libraries can be found in the 'Attribute and sources' vignette. -4. _Attribute_. A character string describing the name of a trait that can be +4. _Attribute_: A character string describing the name of a trait that can be observed or measured. -5. _Attribute type_. A character string describing the data type. - * numeric. Attributes that can take numeric values. For example, attribute: - growth temperature; attribute value: 25 C. - * binary. Attributes that can take booleans. For example, - attribute: butyrate-producing; attribute value: TRUE. - * multistate-intersection. A set of related binary attributes. For example, - habitat. - * multistate-union. Attribute that can take three or more values. These - values are always character strings. For example, attribute: aerophilicity; - attribute values: aerobic, anaerobic, or facultatively anaerobic. -6. *Attribute value*. The values that an attribute could take. Either a -character string, a boolean, or a number. +5. _Attribute type_: A character string indicating the data type: + * numeric: Attributes with numeric values (e.g., growth temperature: 25°C). + * binary: Attributes with boolean values (e.g., butyrate-producing: TRUE). + * multistate-intersection: A set of related binary attributes (e.g., habitat). + * multistate-union: Attributes with three or more values represented as + character strings (e.g., aerophilicity: aerobic, anaerobic, + or facultatively anaerobic). +6. _Attribute value_: The possible values that an attribute could take, +represented as character strings, booleans, or numbers. **Attribute value-related** -Metadata associated with the attribute values: - -7. _Attribute source_. The source of the information. -8. _Evidence_. The type of evidence that supports an annotation. Valid options: - * EXP = experiment. - * IGC = inferred from genomic context. - * TAS = traceable author statement. - * NAS = non-traceable author statement. - * IBD = inferred from biological aspect of descendant. - * ASR = ancestral state reconstruction. - -9. _Support values_. - * Frequency and Score. Confidence that a given taxon exhibits a trait based - on the curator’s knowledge or results of ASR or IBD. - * Validation. Score of the [10-fold cross-validation analysis](https://github.com/waldronlab/taxPProValidation). - Matthews correlation coefficient (MCC) for discrete attributes and +Metadata associated with attribute values: + +7. _Attribute source_: The source of the information. +8. _Evidence_: The type of evidence supporting an annotation, including: + * EXP: Experiment + * IGC: Inferred from genomic context + * TAS: Traceable author statement + * NAS: Non-traceable author statement + * IBD: Inferred from biological aspect of descendant + * ASR: Ancestral state reconstruction + +9. _Support values_: + * Frequency and Score: Confidence that a given taxon exhibits a trait + based on the curator’s knowledge or results of ASR or IBD. + * Validation: Score from the [10-fold cross-validation analysis](https://github.com/waldronlab/taxPProValidation). + Matthews correlation coefficient (MCC) for discrete attributes and R-squared for numeric attributes. Default threshold value is 0.5 and above. - * NSTI. Nearest sequence taxon index as described in - [PICRUSt](https://doi.org/10.1038/nbt.2676) or the - [castor](https://cran.r-project.org/web/packages/castor/index.html) package. + * NSTI: Nearest sequence taxon index from [PICRUSt](https://doi.org/10.1038/nbt.2676) + or the [castor package](https://cran.r-project.org/web/packages/castor/index.html). Relevant for numeric values only. **Attribute source-related** -10. _Confidence in curation_. A character string describing the confidence -value of a source based on three criteria: 1) it has a source, 2) it has valid -references, and 3) the curation was peer-reviewed. -Valid options: high, medium, or low, used when a source satisfied three, two, -or one of these criteria. +10. _Confidence in curation_: A character string indicating the confidence +value of a source based on three criteria: 1) presence of a source, 2) valid +references, and 3) peer-reviewed curation. +Valid options include high, medium, or low, corresponding to satisfaction of +three, two, or one of these criteria. -**More information** +**Additional information** -+ Description of **sources** and **attributes** can be found here: -https://waldronlab.io/bugphyzz/articles/attributes.html ++ Description of **sources** and **attributes**: https://waldronlab.io/bugphyzz/articles/attributes.html -+ Description of ontology **evidence** codes (8) can be found here: -https://geneontology.org/docs/guide-go-evidence-codes/ ++ Description of ontology **evidence** codes: https://geneontology.org/docs/guide-go-evidence-codes/ -+ Description of **frequency** keywords and scores is based on: -https://grammarist.com/grammar/adverbs-of-frequency/ ++ Description of **frequency** keywords and scores were based on: https://grammarist.com/grammar/adverbs-of-frequency/ -+ IBD and ASR were performed with taxPPro: -https://github.com/waldronlab/taxPPro ++ IBD and ASR were performed with taxPPro: https://github.com/waldronlab/taxPPro -## Analysis and Stats +# Analysis and Stats -This vignette only covers the main use of bugphyzz. Detailed analysis and -stats of the bugphyzz annotations can be found here: +This vignette serves as an introduction to the basic functionalities of +bugphyzz. For a more in-depth analysis and detailed statistics +utilizing bugphyzz annotations, please visit: https://github.com/waldronlab/bugphyzzAnalyses -## Installation - -The bugphyzz package can be installed with: +# Installation ```{r, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) @@ -131,9 +127,9 @@ if (!require("BiocManager", quietly = TRUE)) BiocManager::install("waldronlab/bugphyzz") ``` -## Import bugphyzz +# Import bugphyzz data -Load bugphyzz and other packages: +Load the bugphyzz package and additional packages for data manipulation: ```{r load package, message=FALSE} library(bugphyzz) @@ -141,19 +137,19 @@ library(dplyr) library(purrr) ``` -bugphyzz is imported with the `importBugphyzz` function as a list of -tidy data.frames, each of them corresponding to an attribute -or group of related attributes in the case of the multistate-union type -(check the data schema description above). - -Import bugphyzz and explore available attributes with `names`: +bugphyzz data is imported using the `importBugphyzz` function, +resulting in a list of tidy data frames. Each data frame corresponds to an +attribute or a group of related attributes. This is particularly evident in +the case of the multistate-union type described in the data schema above, +where related attributes are grouped together in a single data frame. Available +attribute names can be inspected with the `names` function: ```{r import data, message=FALSE} bp <- importBugphyzz() names(bp) ``` -Let's take a glimpse at one of the data.frames: +Let's take a glimpse at one of the data frames: ```{r a glimpse} glimpse(bp$aerophilicity, width = 50) @@ -161,16 +157,19 @@ glimpse(bp$aerophilicity, width = 50) Compare the column names with the data schema described above. -## Create signatures +# Create microbial signatures -After the attributes have been imported, we can use the `makeSignatures` -function to create a list of signatures. `makeSignatures` accepts a few -arguments for filtering such as evidence, frequency, and minimum and maximum -values for numeric attributes. If a more refined filtering is required, -a user could use regular data manipulation functions on the data.frame of -interest (e.g., `dplyr::filter`). +bugphyzz's primary function is to facilitate the creation of microbial +signatures, which are essentially lists of microbes sharing specific taxonomy +ranks and attribute values. Once the data frames containing attribute +information are imported, the `makeSignatures` function can be employed to +generate these signatures. `makeSignatures` offers various filtering options, +including evidence, frequency, and minimum and maximum values for numeric +attributes. For more precise filtering requirements, users can leverage +standard data manipulation functions on the relevant data frame, +such as `dplyr::filter`. -Some examples: +Examples: + Create signatures of taxon names at the genus level for the aerophilicity attribute (discrete): @@ -214,7 +213,7 @@ ap_sigs_mix <- makeSignatures( map(ap_sigs_mix, head) ``` -+ Make signatures for all of the data.frames: ++ Create signatures for all of the bugphyzz data frames: ```{r} sigs <- map(bp, makeSignatures) |> @@ -226,15 +225,24 @@ length(sigs) head(map(sigs, head)) ``` -## Run an enrichment analysis +# Run a bug set enrichment analysis -Bugphyzz signatures can be used for running enrichment analysis with -existing tools developed in R. For example, using EnrichmenBrowser. +Bugphyzz signatures are suitable for conducting bug set enrichment analysis +using existing tools available in R. In this example, we will perform a set enrichment analysis using a dataset +with a known biological ground truth. -Here is an example of how to run an enrichment analysis using GSEA and -a benchmark dataset. +The dataset originates from the Human Microbiome Project (2012) and compares +subgingival and supragingival plaque. +This data will be imported using the [MicrobiomeBenchmarkData package](https://bioconductor.org/packages/release/data/experiment/html/MicrobiomeBenchmarkData.html). +For the implementation of the enrichment analysis, we will utilize the +Gene Set Enrichment Analysis (GSEA) method available in the +[EnrichmentBrowser package](https://bioconductor.org/packages/release/bioc/html/EnrichmentBrowser.html). +The expected outcome is an enrichment of aerobic taxa in the supragingival +plaque (positive enrichment score) and anaerobic taxa in the subgingival plaque +(negative enrichment score). -Load packages: + +Load necessary packages: ```{r, message=FALSE} library(EnrichmentBrowser) @@ -242,7 +250,7 @@ library(MicrobiomeBenchmarkData) library(mia) ``` -Load benchmark data: +Import benchmark data: ```{r, warning=FALSE} dat_name <- 'HMP_2012_16S_gingival_V35' @@ -253,7 +261,7 @@ tse_subset <- tse_genus[rowSums(assay(tse_genus) >= 1) >= min_n_samples,] tse_subset ``` -Differential abundance (DA) analysis: +Perform differential abundance (DA) analysis to get sets of microbes: ```{r} tse_subset$GROUP <- ifelse( @@ -271,7 +279,7 @@ assay(edger) <- limma::voom( )$E ``` -Enrichment analysis using GSEA: +Perform GSEA and display the results: ```{r, message=FALSE} gsea <- sbea( @@ -287,12 +295,14 @@ gsea_tbl <- as.data.frame(gsea$res.tbl) |> knitr::kable(gsea_tbl) ``` -## Get taxon signatures +# Get signatures associated with a specific microbe -Finally, a user could get all of the signature names to which a given taxon -belongs to. Only taxids should be used. +To retrieve all signature names associated with a specific taxon, +users can utilize the `getTaxonSignatures` function. +It's important to note that only taxids should be used as input for this +function. -An example using _Escherichia coli_ (taxid: 562). +Let's see an example using _Escherichia coli_ (taxid: 562). Get taxid for _E. coli_ using taxize: @@ -301,12 +311,13 @@ taxid <- as.character(taxize::get_uid("Escherichia coli")) taxid ``` -Get all signature names related to the _E. coli_ taxid: +Get all signature names associated to the _E. coli_ taxid: ```{r} getTaxonSignatures(tax = taxid, bp = bp) ``` -## Session information: + +# Session information: ```{r} sessioninfo::session_info()