Skip to content

6. Gumshoe Functions

Gurveer Gill edited this page Sep 12, 2023 · 5 revisions

Gumshoe Tutorial

Introduction

This tutorial is based on aggregated single-cell data from the following paper:

  • Usoskin D, Furlan A, Islam S, Abdo H, Lönnerberg P, Lou D, Hjerling-Leffler J, Haeggström J, Kharchenko O, Kharchenko PV, Linnarsson S, Ernfors P (2015) Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nat Neurosci 18:145–153.

For the sake of simplicity and as the focus of Gumshoe is supplemental utilities for Sleuth, we can skip the step of transcript abundance quantification using Kallisto. We have done this step, and the output files are below. A corresponding metadata file for the dataset has been included. The data is from a single-cell experiment on colonic sensory neurons in mice. The data has been combined and converted into bulk RNA-Seq data, and has four tissue sources (NF, NP, PEP, TH) and two sexes (male and female).


Downloading The Data

Tutorial data can be found in the data folder of the package. For this tutorial, you should find the following in the folder:

Tutorial_Gumeshoe_Functions
├── gumshoe_functions_analysis.R
├── gumshoe_functions_tutorial.ipynb
├── metadata.txt
├── output
│   ├── female_NF.0.kallisto
│   ├── female_NF.1.kallisto
│   ├── female_NF.2.kallisto
│   ├── female_NP.0.kallisto
│   ├── female_NP.1.kallisto
│   ├── female_NP.2.kallisto
│   ├── female_PEP.0.kallisto
│   ├── female_PEP.1.kallisto
│   ├── female_PEP.2.kallisto
│   ├── female_TH.0.kallisto
│   ├── female_TH.1.kallisto
│   ├── female_TH.2.kallisto
│   ├── male_NF.0.kallisto
│   ├── male_NF.1.kallisto
│   ├── male_NF.2.kallisto
│   ├── male_NP.0.kallisto
│   ├── male_NP.1.kallisto
│   ├── male_NP.2.kallisto
│   ├── male_PEP.0.kallisto
│   ├── male_PEP.1.kallisto
│   ├── male_PEP.2.kallisto
│   ├── male_TH.0.kallisto
│   ├── male_TH.1.kallisto
│   └── male_TH.2.kallisto
├── plot
│   ├── pca_plot.pdf
│   └── sample_volc_plot.jpeg
└── results
    ├── NF Metadata
    └── TH Metadata

Analysis Using Sleuth and Gumshoe

NOTE: Please ensure that your working directory is the current directory, as this is critical for the remainder of the tutorial.

Loading Libraries

After downloading or generating the data, our first step should be loading the libraries we will be using throughout the program.

library(tidyverse)
library(sleuth)
library(biomaRt)
library(gumshoe)
library(ggrepel)

Metadata File Creation

Sleuth requires a metadata file, and, in our case, the file contains the sample name, path, sex, and tissue type. If you are well-versed with RNASeq analysis, feel free to create this file yourself and load it in; otherwise, we can either use the metadata.txt file that is provided alongside the downloadable data or run the following code to create our own:

# Create a dataframe that is 4 columns by 24 rows and rename the columns
metadata <- data.frame(matrix(NA, nrow = 24, ncol = 4))
colnames(metadata) <- c("sample", "path", "sex", "tissue")

# Assign sample names
# The naming format is sex, tissue type, and replicate number
metadata[1] <- c("female_NF.0", "female_NF.1", "female_NF.2", "female_NP.0", "female_NP.1", "female_NP.2", "female_PEP.0", "female_PEP.1", "female_PEP.2", "female_TH.0", "female_TH.1", "female_TH.2", "male_NF.0", "male_NF.1", "male_NF.2", "male_NP.0", "male_NP.1", "male_NP.2", "male_PEP.0", "male_PEP.1", "male_PEP.2", "male_TH.0", "male_TH.1", "male_TH.2")

# Assign the file path
# NOTE: This line of code assumes that your data is contained in the working directory in a subfolder named "output"
for (cell in 1:24){
metadata[cell, 2] <- paste("output/", metadata[cell, 1], sep = "")
metadata[cell, 2] <- paste(metadata[cell, 2], ".kallisto", sep = "")
}

# Assign sex
metadata[1:12, 3] <- "F"
metadata[13:24, 3] <- "M"

# Assign the tissue type
for (cell in 1:24){
metadata[cell, 4] <- unlist(strsplit(metadata[cell, 1], "_"))[2]
metadata[cell, 4] <- gsub("\\.[0-9]", "", metadata[cell, 4])
}

Metadata File Verification

We can verify the accuracy of the file by viewing it View(metadata), and it should look like this:

       sample                        path sex tissue
1 female_NF.0 output/female_NF.0.kallisto   F     NF
2 female_NF.1 output/female_NF.1.kallisto   F     NF
3 female_NF.2 output/female_NF.2.kallisto   F     NF
4 female_NP.0 output/female_NP.0.kallisto   F     NP
5 female_NP.1 output/female_NP.1.kallisto   F     NP
6 female_NP.2 output/female_NP.2.kallisto   F     NP

We can also run the following command to compare it to the metadata included with our downloaded data. Ideally, the output should be a 4x24 matrix containing TRUE in each index.

default_metadata <- read.table("default_metadata.txt", header = TRUE)
metadata == default_metadata

Creating a Data Frame for Gumshoe

All the steps we have done up till now are ubiquitous when running a Sleuth analysis. Typically, we would now begin to construct a sleuth_object. According to the Sleuth tutorial (found here), a sleuth_object will hold some experimental information, model details for differential testing, and the results from the tests. Although, the method employed by Sleuth to construct this object is very fixed and requires you to run sleuth_prep and sleuth_fit on all provided metadata files and the associated models, which can require extensive time to code. It's particularly time-consuming and repetitive when, for example, using different models on the same metadata file or comparing results between an analysis using a metadata file that includes a sample or omits it.

Gumshoe uses the sleuth_interpret function to compensate for this shortcoming; however, particular attention to detail is needed when initially setting up the required data structure, but afterwards, the entire process is, for the most part, automated and hands-off. These steps might seem redundant and unnecessary, but let's consider that we wanted to convert the tissue column in our metadata file from a numeric to a factor and compare the analysis between two baseline levels like NF and TH. A typical approach would require double the amount of lines, or we can add this information to our data frame and leave the remainder of the code untouched.

First, let's convert the tissue column to a factor and create a new variable to hold our releveled metadata.

# Convert the tissue type column to a factor, and based on how R works, NF will be the first-factor level
metadata$tissue <- as.factor(metadata$tissue)

# Create a new variable that will contain a re-levelled version of the metadata
metadata_releveled <- metadata
metadata_releveled$tissue <- relevel(metadata_releveled$tissue, ref = "TH")

We can now create a data frame that contains the information that Gumshoe will use to automate the analysis.

# Define the metadata file name(s), the associated model name(s), the corresponding model(s) to be used, and the model parameters. 
# You can also place an `I` before the mathematical expressions, which informs R that the expression should remain uninterpreted until 
# Sleuth processes the model. 
# NOTE: Model data order must align with the model names
metadata_names <- c("metadata_nominal_NF_baseline",
                    "metadata_nominal_TH_baseline")
model_names <- c("NF_model_sex,NF_model_sex_tissue,NF_model_interaction",
                 "TH_model_sex,TH_model_sex_tissue,TH_model_interaction")
model_data <- c("~sex, ~sex + tissue, ~sex*tissue",
                "~sex, ~sex + tissue, ~sex*tissue")
model_parameters <- c("", 
                      "")

# Take the metadata names, model names, and model data information and combine it into a single data frame
analysis_data <- data.frame(metadata_name = metadata_names, metadata_file = tibble(list(metadata)), model_name = model_names, model_data = model_data, model_parameters = model_parameters)

NOTE: The model names must be unique or they will be overwritten

Let's go over what we did; we defined two metadata names corresponding to two different metadata files, which have a varying factor baseline, named three models to use for each metadata, and created formulae associated with each model name. Each model uses different factors to run the analysis. The first model is solely sex, the second is sex and tissue type, and the third is sex, tissue type, and the interaction between both. In this tutorial, we are not passing any arguments to sleuth_prep or sleuth_fit.

As some of you might have noticed, the analysis_data data frame contains identical metadata files for each metadata name. We will need to change the metadata file for the 'metadata_nominal_TH_baseline' from the default file to the intended altered file for our analysis to function, and, to do so, we can use the following command:

analysis_data[[2]][[2]] <- metadata_releveled

To confirm that we have the correct metadata in the analysis_data data frame, we can check the factor levels of the tissue column in the metadata files contained in the analysis_data data frame.

levels(analysis_data[[2]][[1]]$tissue)
# "NF"  "NP"  "PEP" "TH" 
levels(analysis_data[[2]][[2]]$tissue)
# "TH"  "NF"  "NP"  "PEP"

Using Gumshoe

Finally, with a few lines of code, we can analyze all the data. First, we will be using the sleuth_interpret function that comes with Gumshoe. We do not need to worry about assigning the function output to a variable, as this step is also automated for you. You should also note that sleuth_interpret will allow you to pass additional arguments to sleuth_fit and sleuth_prep, such as aggregation column names and a boolean value for gene_mode.

sleuth_interpret(analysis_data)

Gumshoe follows this notion of automation throughout all of its functions. For example, after running sleuth_interpret, which runs sleuth_prep and sleuth_fit, you might wish to run either the wald test (WT) or the likelihood-ratio test (LRT) on all sleuth_objects and to accomplish this Gumshoe includes sleuth_test_wt and sleuth_test_lrt. The test statistics are ran for each coefficient in the models of the sleuth_object the code was ran on. Similarly, the output of the functions is assigned to the sleuth_object the code was ran on.

# Run a Wald test on each sleuth_object
sleuth_test_wt(so_NF_model_sex)
sleuth_test_wt(so_NF_model_sex_tissue)
sleuth_test_wt(so_NF_model_interaction)
sleuth_test_wt(so_TH_model_sex)
sleuth_test_wt(so_TH_model_sex_tissue)
sleuth_test_wt(so_TH_model_interaction)

To retrieve the significant results (FDR < .01) from the sleuth_object we can run sleuth_object_result, if you require the non-significant results, the all_data = FALSE parameter can be excluded. No variable assignment is required at this stage either as Gumshoe automatically assigns the results to a variable.

# Obtain the results from the sleuth_object
sleuth_object_result(so_NF_model_sex, all_data = FALSE, q_max = 0.01)
sleuth_object_result(so_NF_model_sex_tissue, all_data = FALSE, q_max = 0.01)
sleuth_object_result(so_NF_model_interaction, all_data = FALSE, q_max = 0.01)
sleuth_object_result(so_TH_model_sex, all_data = FALSE, q_max = 0.01)
sleuth_object_result(so_TH_model_sex_tissue, all_data = FALSE, q_max = 0.01)
sleuth_object_result(so_TH_model_interaction, all_data = FALSE, q_max = 0.01)

We should now have all the significant results (FDR < .01) from each sleuth_object. At this point, you can save the data or manipulate it as you wish. To save the results we will need to write each result to our drive:

# Save all the results.
# Save the NF baseline results as .txt in the current directory
write.table(sig_wald_NF_model_interaction_sexM, row.names = FALSE, sep = "\t", quote = FALSE, "NF_interaction_model_sexM.txt")
write.table(sig_wald_NF_model_interaction_sexM_tissueNP, row.names = FALSE, sep = "\t", quote = FALSE, "NF_interaction_model_sexM_NP.txt")
write.table(sig_wald_NF_model_interaction_sexM_tissuePEP, row.names = FALSE, sep = "\t", quote = FALSE, "NF_interaction_model_sexM_PEP.txt")
write.table(sig_wald_NF_model_interaction_sexM_tissueTH, row.names = FALSE, sep = "\t", quote = FALSE, "NF_interaction_model_sexM_TH.txt")
write.table(sig_wald_NF_model_interaction_tissueNP, row.names = FALSE, sep = "\t", quote = FALSE, "NF_interaction_model_NP.txt")
write.table(sig_wald_NF_model_interaction_tissuePEP, row.names = FALSE, sep = "\t", quote = FALSE, "NF_interaction_model_PEP.txt")
write.table(sig_wald_NF_model_interaction_tissueTH, row.names = FALSE, sep = "\t", quote = FALSE, "NF_interaction_model_TH.txt")
write.table(sig_wald_NF_model_sex_sexM, row.names = FALSE, sep = "\t", quote = FALSE, "NF_sex_model_sexM.txt")
write.table(sig_wald_NF_model_sex_tissue_sexM, row.names = FALSE, sep = "\t", quote = FALSE, "NF_sex_tissue_model_sexM.txt")
write.table(sig_wald_NF_model_sex_tissue_tissueNP, row.names = FALSE, sep = "\t", quote = FALSE, "NF_sex_tissue_model_NP.txt")
write.table(sig_wald_NF_model_sex_tissue_tissuePEP, row.names = FALSE, sep = "\t", quote = FALSE, "NF_sex_tissue_model_PEP.txt")
write.table(sig_wald_NF_model_sex_tissue_tissueTH, row.names = FALSE, sep = "\t", quote = FALSE, "NF_sex_tissue_model_TH.txt")

# Save the TH baseline results as .txt in the current directory
write.table(sig_wald_TH_model_interaction_sexM, row.names = FALSE, sep = "\t", quote = FALSE, "TH_interaction_model_sexM.txt")
write.table(sig_wald_TH_model_interaction_sexM_tissueNF, row.names = FALSE, sep = "\t", quote = FALSE, "TH_interaction_model_sexM_NF.txt")
write.table(sig_wald_TH_model_interaction_sexM_tissueNP, row.names = FALSE, sep = "\t", quote = FALSE, "TH_interaction_model_sexM_NP.txt")
write.table(sig_wald_TH_model_interaction_sexM_tissuePEP, row.names = FALSE, sep = "\t", quote = FALSE, "TH_interaction_model_sexM_PEP.txt")
write.table(sig_wald_TH_model_interaction_tissueNF, row.names = FALSE, sep = "\t", quote = FALSE, "TH_interaction_model_NF.txt")
write.table(sig_wald_TH_model_interaction_tissueNP, row.names = FALSE, sep = "\t", quote = FALSE, "TH_interaction_model_NP.txt")
write.table(sig_wald_TH_model_interaction_tissuePEP, row.names = FALSE, sep = "\t", quote = FALSE, "TH_interaction_model_PEP.txt")
write.table(sig_wald_TH_model_sex_sexM, row.names = FALSE, sep = "\t", quote = FALSE, "TH_sex_model_sexM.txt")
write.table(sig_wald_TH_model_sex_tissue_sexM, row.names = FALSE, sep = "\t", quote = FALSE, "TH_sex_tissue_model_sexM.txt")
write.table(sig_wald_TH_model_sex_tissue_tissueNF, row.names = FALSE, sep = "\t", quote = FALSE, "TH_sex_tissue_model_NF.txt")
write.table(sig_wald_TH_model_sex_tissue_tissueNP, row.names = FALSE, sep = "\t", quote = FALSE, "TH_sex_tissue_model_NP.txt")
write.table(sig_wald_TH_model_sex_tissue_tissuePEP, row.names = FALSE, sep = "\t", quote = FALSE, "TH_sex_tissue_model_PEP.txt")

The results can be found in the tutorial folder under results_NF_metadata and results_TH_metadata, respectively,

Gumshoe Ensembl to ID and Plotting

Gumshoe does include a few other additional functions. This part of the tutorial will briefly cover a few of them like how to convert the Ensembl IDs to gene names using the included ensembl_to_id function and creating a volcano plot. We will go over how to annotate a single set of results and generate a volcano plot for them, but the process is the same for all results. The downloadable data contains all the annotated results and associated volcano plots for comparison.

First we must create a mart object that contains all the information required to retrieve the data needed to annotate our results.

# Create a mart object to retrive the required information
mart <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")

The next step is quite simple, we can call the ensembl_to_id function on one of the data frames that we have and it will convert the ensembl id's to gene id's. The function automatically assigns the results to a variable called annotated_sig_results.

# Call the ensembl_to_id function on a data frame containing results from the NF baseline metadata file with
# the model ~sex
ensembl_to_id(sig_wald_NF_model_sex_sexM)

We can now take the annotated_sig_results variable and call the volc_plot function on it to create a volcano plot of the differentially expressed genes. The ensembl_to_id and/or volc_plot function can also be called on the non-significant results.

# Create a volcano plot of the annotated data
volc_plot(annotated_sig_results, graph_name = "Sample Graph - NF Baseline, ~sex, sexM")

You can find this figure in the downloadable data.

NOTE: If there is a lot of data, the plotting function will take a long time to process.


Conclusion

This takes us to the end of the tutorial, and based on our dataset, we have covered nearly all applicable functions. For reference, Gumshoe also includes the following functions that were not used in this tutorial: design_filter, fdr_cutoff (built-into sleuth_object_result), sleuth_prep_transcript_dosage, sleuth_reliable_target_ids, and sleuth_u. Gumshoe also includes several other functions, but they have not been exported and are not available for use right now. If you have any questions, suggestions, or feedback feel free to send an email!


Supplemental Information

Kallisto Quantification

If you would like to quantify the transcript abundances yourself, the following command was used when running Kallisto:

Note the use of file_1 after the -o and in the FASTQ input file portion of the command (More information can be found here)

kallisto quant -t 12 -b 40 -i ~/m39.refMrna --single -l 180 -s 20 -o ~/output/file_1.kallisto ~/raw_read/file_1.fastq