This repository houses the exploratory work that we are doing to evaluate the role of proteomics and phosphoproteomics in measuring patient diversity in Acute Myeloid Leukemia, both in the patient outcome and the response of patient samples to ex vivo drugs.
This repository contains scripts that pull data from a Synapse repository to carry out the various analysis steps needed. You will need to acquire a synapse username to access synapse, and become a certified user to add data, but after that you will be set with future projects. You will then need to navigate to the PNNL/OHSU Synapse page to request access.
This repository is only the basic structure of the tools needed, not the end-to-end analysis. Here are the steps you'll need to use this:
- Read up on the tools. GitHub requires some basic protocols such as pull requests and commits, you should try to get a basic understanding. I found this tutorial that can serve as a starting point. Synapse also has a bit of learning curve. To understand what Synapse is1 and isn't, check out this document.
- Get RStudio. Basic R is essential, but RStudio will make your life a lot easier, I promise!
- Install the synapse python client, create a
.synapseConfig
file in your home directory. - Clone this repository - it has all that you will need to contribute and run this analysis.
Here we descirbe the processing of the BeatAML Data
This repository contains the code for the normalization and processing.
This is derived from the P3 proteomics workflow and can be found in the proteomics folder. It contains the scripts required to process and normalize the data. It also contains the study design files that are required to do the processing.
Once the data is processed from DMS it is uploaded to Synapse in the Proteomics and Quality Control folder.
The files are all stored in Synapse so that they can be downloaded and shared.
Description | Link |
---|---|
Global proteomics data files | syn25714186 |
Phosphoproteomics data files | syn25714185 |
Metadata file | syn25807733 |
The data was pushed from raw files to long-form tables for facile querying and viewing:
Description | Normalization/filtering | Link |
---|---|---|
Global Proteomics | Uncorrected | syn25808625 |
Global Proteomics | Batch-corrected, no missing batches | syn25808020 |
Global phosphoproteomics | Uncorrected | syn25808685 |
Global phosphoprotoemics | Batch-corrected, no missing batches | syn25808662 |
Global phosphoproteomics | Batch-corrected, at most 4 missing batches | syn26469873 |
Global phosphoproteomics | Batch-corrected, at most 10 missing batches | syn26477193 |
Before batch correction, we filter to eliminate features which contain too much missing data. Originally, we filtered out any features which had at least one batch with ONLY missing data in that batch, ie, "no missing batches". However, this filtering was too strict for our large phosphoproteomics dataset, so we applied two less conservative filters prior to batch correction, as the "no missing batches" filter left us with relatively few phosphosites. Thankfully, since global proteomics dataset has very little missing data, the same issue did not arise.
These data are also uploaded to Synapse, and then parsed.
Description | Link |
---|---|
Waves 1to4 WES Data | syn2648827 |
Waves 1to4 RNASeq Data | syn26545877 |
Drug response data | syn25813252 |
These tables were pulled from spreadsheets that are currently also stored on Synapse in this folder. Clinical data is still being updated, but is currently stored in an excel spreadsheet.
This section aspirationally aims to serve as the outline for the manuscript we are building. This is pretty rough so as things merge together or overlap we can consider restructuring the repository to reflect the latest state of the manuscript.
The first figure of the manuscript will require visualizing the Beat AML cohort and the data we have. To date this analysis requires, the following, each of which should produce either data for future analysis or figure panels for figure 1. The code should be deposited in the cohort_summary/ directory.
We are looking into circos plotting to summarize the data types, this will enable us to see how much data there are for each patient.
This will take a multi-omics approach to clustering all samples. This clusters and metagenes will be stored on synapse for future anlaysis.
Now that we have the patient assignments we can ask if there are survival differences between patients of each cluster, if there are genetic mutation differences, or if there are other clinical properties that vary.
Last we need to investigate the 'metagenes' that define the clusters and determine if there is functional enrichment, or any phospho networks that are active or depleted.
Figure 2 will compare the genetic mutation data to the other data types. This code will go into the mutational_analysis directory. This analysis focuses on the transcriptomic and proteomic differences between patients with various mutational combinations.
Figure 3 will focus on the immune infiltration of various tumors using the BayesDeBulk analysis within the Decomprolute framework.
Lastly we will spend Figures 4 and 5 investigating the drug profiles that are unique to this dataset.
What can we learn from the regression? Are there gene sets of interest?
Should we do differential expression as well?
This section is reserved for the manuscript figures, as we work on the analysis.