8_Compare_proteomics_preprocessing_v2.Rmd

---
title: "8 Compare proteomics preprocessing for reviewer"
author: "Melina"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  BiocStyle::html_document:
    toc_float: TRUE
    code_folding: hide
    toc: TRUE
---

```{r setup, include=FALSE}
require("knitr")
knitr::opts_chunk$set(warning=FALSE, message=FALSE, cache=TRUE) #, fig.pos = "!H", out.extra = ""
```

# What is done here?

- Types of bound genes
- Distribution of binding sites over gene regions (with and without normalisation for region length)
- Distribution of crosslink event over gene regions, metagene profile 
- Distribution of binding sites per gene and in relation to TPM
- Relative maps of crosslink distribution
- GeneOntology and REACTOME enrichment analysis


```{r libraries, include=FALSE}
library(knitr)
library(GenomicRanges)
library(tidyverse)
library(ComplexHeatmap)
library(hypeR)
library(DEqMS)
library(matrixStats)
library(ggrastr)
library(ggpubr)

source("/Users/melinaklostermann/Documents/projects/PURA/02_R_new_pip/XX-helpful-chunks/theme_paper.R")

``` 


```{r}

### TODO MK

################################################################################
###                       CODE REVIEW SUMMARY                                ###
################################################################################
### 
### Review by:          Mario Keller
### Comment shortcut:   MK
### Date:               17.08.2022 / 18.08.2022
### Time spent:         ~2.5h

# Main objective: 
# - Comparison of Thermofisher and Maxquant detection and quantificantion
#
# Input:
# - Thermofisher and MaxQuant protein and peptide quantifications as  XLSX-Files
#
# # 
# Major/ issues:
# - As discussed in Slack Thermofisher has some genes grouped (sep=";")
# - As discussed in Slack in the MaxQuant output there are multiple
#   Majority Protein IDs but only a single gene name
#
# Minor/ suggestions:
# - Most of the time you use the dplyr way to filter things but sometime you
#   switch to subsetting data.frames via brackets (see line 644)
# - You often do the same analyses on different data.frames (e.g the 3 processing
#   approaches or PURA/PURAB). At some point you could combine things into a 
#   single data.frame and differentia via a column you could add (e.g. $processing)
# - Sometimes things can be done with less code. I placed some shorter code at 
#   selected positions
# - In your ggplot code you often use xlim() and ylim() to adjust axes. This may
#   be risky as you might remove entries from your data.frame. Better use 
#   coord_cartesian(xlim=X, ylim=X)
# - At some points you use code that has no effect like when you apply na.omit()
#   after you have removed all NA containing entries
# - There is one duplicated plot, which just lacks the title
# - In another plot you use geom_point() and geom_density(), whereby the geom_point()
#   is completely overplotted
# - You often remove the color legend, which makes it hard to understand the plot
# 
# Conclusion:
# - Analyses are reliable and trustworthy
# - Well commented
#
```

# What is done?
- this is an additional analysis for the reply to the reviewers
- comparison of the differential protein expression from the same PURA proteomics experiment with three different methods of protein discovery from the measured peptides
- the thermofisher proteome discoverer is compared to Maxquant
- moreover Maxquant results with and without isobaric matching are compared (isobaric matching matched ions of similar run-times if no unique peptide is found)
- I also look at the discovered proteins for PURA and PURB depending on these methods

# Input

```{r}
# proteomics table
proteomics_tables <- list( thermo_fisher = readxl::read_excel("/Users/melinaklostermann/Documents/projects/PURA/02_R_new_pip/04-BioID-SILAC/03-Proteomics/data-without-outlier-rep-3/MOL13920_noRepl3_SwissHum_perc_precDet_int_ratio_TTEST_Peptide-high_proteins.xlsx", sheet = 2),
                           maxquant_matching = readxl::read_excel("/Users/melinaklostermann/Documents/projects/PURA/02_R_new_pip/04-BioID-SILAC/03-Proteomics/XX_reprocessed_data/MOL13920_wo-repl3_MQ1670_withMBR_proteinGroups_LFQ1_matching.xlsx"),
                          maxquant_no_matching = readxl::read_excel("/Users/melinaklostermann/Documents/projects/PURA/02_R_new_pip/04-BioID-SILAC/03-Proteomics/XX_reprocessed_data/MOL13920_wo-repl3_MQ1670_noMBR_proteinGroups_LFQ1.xlsx"))

p_cut_prot <- 0.05

```


# Clean up count tabels
## Number of proteins detected with at least one unique peptide

```{r}
map(proteomics_tables, ~nrow(.x))

```
## How are proteins matched to genes?
```{r}
proteomics_tables[[1]] <- proteomics_tables[[1]] %>% dplyr::rename(., gene_name = `Gene Symbol`)
proteomics_tables[[2]] <- proteomics_tables[[2]] %>% dplyr::rename(., gene_name = `Gene symbol`)
proteomics_tables[[3]] <- proteomics_tables[[3]] %>% dplyr::rename(., gene_name = `Gene symbol`)
```

Found proteins, that are not matched to any gene:

```{r}
map(proteomics_tables, ~sum(is.na(.x$gene_name)))

```

Genes, that are matched to multiple proteins:

```{r}

map(proteomics_tables, ~sum(duplicated(.x$gene_name, incomparables=NA)))
map(proteomics_tables, ~.x[duplicated(.x$gene_name, incomparables=NA),]$gene_name)
```

Removing both duplicated genes and unknown genes

```{r}
proteomics_tables <- lapply(proteomics_tables, function(x){
  x <- x[!is.na(x$gene_name),]
  x <- x[!duplicated(x$gene_name),]
})

```

Protein groups with no specific protein

```{r eval=FALSE, include=T}
##########################
# get alternative gene names from ensemble
#########################

#get names
ensembl38p13 <- useMart(biomart='ENSEMBL_MART_ENSEMBL',
                        dataset='hsapiens_gene_ensembl')

my_attributes <- c("ensembl_gene_id", "hgnc_symbol", "external_synonym", "uniprot_isoform", "protein_id", "ensembl_peptide_id")

info_biomart_new<- getBM(attributes=my_attributes,
                        filters = "hgnc_symbol",
                        values = proteomics_tables[[1]]$gene_name,
                        mart = ensembl38p13, useCache = F)

#combine
idx_biomart_new <- match(proteomics_tables[[1]]$gene_name, info_biomart_new$hgnc_symbol)
proteomics_tables[[1]] <- cbind(proteomics_tables[[1]], info_biomart_new[idx_biomart_new,])

#save
saveRDS(proteomics_tables[[1]], "/Users/melinaklostermann/Documents/projects/PURA/Molitor-et-al-2022/thermofisher_table.rds")
```

```{r}
#####################
# resolve results with multiple major protein ids
######################
# split up mutiple ids into separate columns

proteomics_tables[[1]] <- readRDS("/Users/melinaklostermann/Documents/projects/PURA/Molitor-et-al-2022/thermofisher_table.rds")

multiple_protein_ids <- map(proteomics_tables[2:3], ~.x[grepl(pattern = ";", .x$`Majority protein IDs`),])
multiple_protein_ids <- map(multiple_protein_ids, ~ .x %>%
                              separate(`Majority protein IDs`,
                                       sep = ";", into = c("ID1", "ID2", "ID3")) )
multiple_protein_ids  <- map(multiple_protein_ids, ~ .x   %>%
                               rowwise(.) %>% 
                               mutate(across(c("ID1", "ID2", "ID3"), 
                ~unlist(strsplit(.x, split = "\\|"))[2] )))


# remove proteins that were detected in group in maxquant also from thermofisher
proteomics_tables[[1]] <- proteomics_tables[[1]][-which(proteomics_tables[[1]]$Accession %in% c(multiple_protein_ids[[1]]$ID1, multiple_protein_ids[[1]]$ID2, multiple_protein_ids[[1]]$ID3)),]

proteomics_tables[[1]] <- proteomics_tables[[1]][-which(proteomics_tables[[1]]$Accession %in% c(multiple_protein_ids[[2]]$ID1, multiple_protein_ids[[2]]$ID2, multiple_protein_ids[[2]]$ID3) ),]


######################
# resolve results with multiple gene names
######################

multiple_gene_names <- proteomics_tables[[1]][grepl(pattern = ";", proteomics_tables[[1]]$gene_name),] %>%
  separate(gene_name, sep = ";", into = c("name1", "name2", "name3")) 

# proteomics_tables[[1]][which(c(multiple_gene_names$name1, multiple_gene_names$name2, multiple_gene_names$name3) %in% proteomics_tables[[2]]$gene_name),]


# search vor major proteins in alternative names
multiple_gene_names <- multiple_gene_names  %>% 
  rowwise(.) %>%
  mutate(synonymous_names = case_when(
  (name1 == external_synonym |
    name2 ==  external_synonym |
    name3 == external_synonym) ~ T,
  T~ F
    
))

# how many of the mutilple major gene ids belong to alternative names of the same gene
table(multiple_gene_names$synonymous_names)

# none? remove all?


```


## Overlap of found genes, that the proteins belong to

```{r}

### TODO MK
### You removed all NAs in the chunk above. So you can remove the na.omit()

### TODO MK
### In proteomics_tables[[1]]$gene_name some entries are combined genes (e.g GPR64; ADGRG2)


all = c(proteomics_tables[[1]]$gene_name, proteomics_tables[[2]]$gene_name, proteomics_tables[[3]]$gene_name) %>% unique() 

overlap_mat <- cbind(
thermo_fisher = all %in% proteomics_tables[[1]]$gene_name,
maxquant_matching = all  %in% proteomics_tables[[2]]$gene_name,
maxquant_no_matching = all  %in% proteomics_tables[[3]]$gene_name)

fit <- eulerr::euler(overlap_mat)

plot(fit,  quantities = TRUE)


  head(all[!(all  %in% proteomics_tables[[2]]$gene_name)])

```

# Normalised abundances


```{r clean_lfq}
# decoy matches and matches to contaminant should be removed --> already happened?
# any(proteomics_tables[[2]]$Reverse == "+", na.rm = T)
# any(proteomics_tables[[2]]$`Potential contaminant` == "+", na.rm = T)
# 
# any(proteomics_tables[[3]]$Reverse == "+", na.rm = T)
# any(proteomics_tables[[3]]$`Potential contaminant` == "+", na.rm = T)


### TODO MK
### Could be done with dplyr
### LFQs <- list(thermo_fisher = proteomics_tables[[1]] %>% select(gene_name, starts_with("Abundances (Normalized):")),
###              maxquant_matching = proteomics_tables[[2]] %>% select(gene_name, starts_with("Abundances (Normalized):")),
###              maxquant_no_matchingproteomics_tables[[2]] %>% select(gene_name, starts_with("Abundances (Normalized):")))

# LFQ table
LFQs <- list(thermo_fisher = proteomics_tables[[1]][,c(grepl(pattern = "(Normalized)", colnames(proteomics_tables[[1]])) |
                                                         grepl(pattern = "gene_name", colnames(proteomics_tables[[1]])))] %>%
               as.data.frame(),
              maxquant_matching = proteomics_tables[[2]][,c(grepl(pattern = "LFQ", colnames(proteomics_tables[[2]]))|
                                                         grepl(pattern = "gene_name", colnames(proteomics_tables[[2]])))] %>%
               as.data.frame(),
             maxquant_no_matching = proteomics_tables[[3]][,c(grepl(pattern = "LFQ", colnames(proteomics_tables[[3]]))|
                                                         grepl(pattern = "gene_name", colnames(proteomics_tables[[3]])))] %>%
               as.data.frame())


LFQs <- lapply(LFQs, function(x){
  colnames(x) <- c("gene_name", "kd1", "kd2", "kd3", "kd4", "wt1", "wt2", "wt3", "wt4")
  return(x)})

LFQs_t <- map(LFQs, ~reshape2::melt(.x))

<<<<<<< HEAD
LFQs_t2 <- s(LFQs_t[[1]], LFQs_t[[2]], by = c("gene_name", "variable"), suffix = c(".tf", ".mq_ma"))
=======
### TODO MK
### You could rename the columns of LFQs_t2 as value.tf, value.mq_ma and especially
### value are not very informative

LFQs_t2 <- full_join(LFQs_t[[1]], LFQs_t[[2]], by = c("gene_name", "variable"), suffix = c(".tf", ".mq_ma"))
>>>>>>> bb1a0b1937fbba76ccd712dfa5ba04bec0e5ad6f
LFQs_t2 <- full_join(LFQs_t2, LFQs_t[[3]], by = c("gene_name", "variable"), suffix = c("", ".mq_noma"))

LFQs_t3 <- reshape2::melt(LFQs_t2)

colnames(LFQs_t3) <- c("gene_name", "sample", "processing", "abundance")

### TODO MK
### Be careful with xlim() and ylim(). I always prefer coord_cartesian() as it
### will never eliminate data

### TODO MK
### you can remove  ggrastr::rasterise(geom_point(size = 0.1), dpi = 400)+
### as you overplot it completely with the pointdensity

### TODO MK
### If find it kind of strange that Proteins appear multiple times in the 
### scatter plot as they were measured in 8 experiments. Wouldn't a merge
### of replicates or facetwrap on ~variable would make more sense?

ggplot(LFQs_t2, aes(x=log10(value.tf), y= log10(value.mq_ma)))+
  xlim(5,11)+
  ylim(5,11)+
  ggrastr::rasterise(geom_point(size = 0.1), dpi = 400)+
  ggrastr::rasterise(ggpointdensity::geom_pointdensity(size = 0.1), dpi = 400)+
  xlab("thermofisher abundances")+
  ylab("maxquant - matching abundances")+
  geom_point(data=LFQs_t2[(LFQs_t2$gene_name == "PURA" & grepl(LFQs_t2$variable, pattern = "kd")), ], aes(x=log10(value.tf), y= log10(value.mq_ma)), color = "orange", size = 0.5)+
  geom_point(data=LFQs_t2[(LFQs_t2$gene_name == "PURA" & grepl(LFQs_t2$variable, pattern = "wt")), ], aes(x=log10(value.tf), y= log10(value.mq_ma)), color = "red", size = 0.5)+
  theme_paper()+
  theme(legend.key.size = unit(0.15, 'cm'))
  

#ggsave(paste0(outpath, "Scatter_tf_MQ.pdf"), width = 7, height = 6, units = "cm")

ggplot(LFQs_t2, aes(x=log10(value), y= log10(value.mq_ma)))+
  ggrastr::rasterise(geom_point(size = 0.1), dpi = 400)+
  ggrastr::rasterise(ggpointdensity::geom_pointdensity(size = 0.1), dpi = 400)+
  xlab("maxquant - no matching abundances")+
  ylab("maxquant - matching abundances")+
  geom_point(data=LFQs_t2[(LFQs_t2$gene_name == "PURA" & grepl(LFQs_t2$variable, pattern = "wt")), ], aes(x=log10(value), y= log10(value.mq_ma)), color = "red", size = 0.5)+
  geom_point(data=LFQs_t2[(LFQs_t2$gene_name == "PURA" & grepl(LFQs_t2$variable, pattern = "kd")), ], aes(x=log10(value), y= log10(value.mq_ma)), color = "orange", size = 0.5)+
  theme_paper()+
  theme(legend.key.size = unit(0.15, 'cm'))

#ggsave(paste0(outpath, "Scatter_MQ_match_no_match.pdf"), width = 7, height = 6, units = "cm")


```


## Excluding to high fallout rates

--> We exclude proteins with 2 or 3 missing values in either of the conditions

```{r zero_filt}

# turn 0 into NA
LFQs[[1]][LFQs[[1]]==0] = NA
LFQs[[2]][LFQs[[2]]==0] = NA
LFQs[[3]][LFQs[[3]]==0] = NA


# sum fall outs
LFQs <- lapply(LFQs, function(x){
  x$na_kd <- apply(x,1, function(y) sum(is.na(y[2:5])))
  x$na_wt <- apply(x,1, function(y) sum(is.na(y[6:9])))
  return(x)
})

LFQ_filt <- LFQs %>% map(., ~dplyr::filter(.x, (na_wt %in% c(0,1)) & (na_kd %in% c(0,1))))

LFQ_4_zeros <- LFQs %>% map(., ~dplyr::filter(.x, (((na_wt == 4) & (na_kd %in% c(0,1)))| ((na_kd == 4)& (na_wt %in% c(0,1))))))

#saveRDS(LFQ_filt, paste0(outpath, "LFQ_tables.rds"))

```

Number of Proteins after zero filtering:
(max one missing value per condition)

```{r}
map(LFQ_filt, ~ nrow(.x))

```

Proteins with fall-out only in one condition:
(4 zeros in one condition and max 1 zero in other condition)

```{r}
map(LFQ_4_zeros, ~ .x$gene_name)
map(LFQ_4_zeros, ~ nrow(.x))


```


# Differential analysis

There are multiple options for differential analysis

- Thermo fisher in build analysis
- limma
- DEqMs
- Perseus

## Is data median centered?

Use boxplot to check if the samples have medians centered. if not, do median centering.

```{r fit, echo=FALSE}

### TODO MK
### You write that you use a boxplot to check if things are median centered
### but don't show a boxplot.

protein.matrix <- LFQ_filt %>% map(., ~.x[,2:9])
#map(protein.matrix, ~invisible(plotDensities(.x, legend = F)))

# log transformation to linerase changes
protein.matrix <- protein.matrix %>% map( ~log2(.x))

```

--> yes


```{r}
###################
# function DEqMS
###################

f_DEqMS <- function(protein_matrix, pep_count_table){

# model
##################
condition = as.factor(c(rep("kd",4), rep("wt",4)))
design = model.matrix(~0+condition)

contrasts <- c("conditionkd-conditionwt")
contrast <- makeContrasts(contrasts=contrasts, levels = design)

# fit model to matrix
########################

fit1 = lmFit(protein_matrix, design = design)
fit2 = contrasts.fit(fit1, contrasts = contrast)
fit3 <- eBayes(fit2)


# correct bias of variance estimate
####################################

fit3$count = pep_count_table[rownames(fit3$coefficients),"count"]


fit4 <- spectraCounteBayes(fit3)

VarianceBoxplot(fit4, main = "Fit of variance estimate",
                xlab="peptide count + 1")

VarianceScatterplot(fit4, main = "Fit of variance estimate",
                xlab="peptide count + 1")


# adding gene info to results
DEqMS.results = outputResult(fit4, coef_col = 1)

DEqMS.results$gene_name <- rownames(DEqMS.results)

return(DEqMS.results)
}


```


DEqMS is based on limma, but adjustes the p-values in the end for the variance given for the specific number of unique peptides found

### Fit variance to peptide count in DEqMS

```{r}
#####################
# peptide count tables
########################
# Maxquant: we use minimum peptide count among six samples
# count unique+razor peptides used for quantification

pep.count.table = list( thermo_fisher = data.frame(count = proteomics_tables[[1]]$`# Unique Peptides`,
                             row.names = proteomics_tables[[1]]$gene_name),
  maxquant_mat = data.frame(count = rowMins(as.matrix(proteomics_tables[[2]][,20:27])),
                             row.names = proteomics_tables[[2]]$gene_name),
  maxquant_nomat = data.frame(count = rowMins(as.matrix(proteomics_tables[[3]][,20:27])),
                             row.names = proteomics_tables[[3]]$gene_name))
# Minimum peptide count of some proteins can be 0
# add pseudocount 1 to all proteins
pep.count.table = map(pep.count.table, ~mutate(.x, count = count+1))

# run
rownames(protein.matrix[[1]]) = LFQ_filt[[1]]$gene_name
rownames(protein.matrix[[2]]) = LFQ_filt[[2]]$gene_name
rownames(protein.matrix[[3]]) = LFQ_filt[[3]]$gene_name

deqms <- map2(protein.matrix, pep.count.table, ~ f_DEqMS(protein_matrix = .x, pep_count_table = .y))

```

## Vulcanos 

```{r}
titles <- as.list(names(deqms))

map2(deqms, titles, ~ggplot(.x, aes(x = logFC, y = (-log10(sca.adj.pval))))+
   geom_point(color ="grey", shape =1)+
   geom_point(data = .x[(.x$sca.adj.pval< 0.05) & (abs(.x$logFC) > log(0.5)), ],
              aes(x= logFC, y = (-log10(sca.adj.pval))))+
   geom_point(data = .x[.x$gene_name %in% c("CTNNA1", "SQSTM1", "LSM14A", "RBFOX2", "DDX6", "GAPDH", "TUBB", "PURA", "PURB"),],
              aes(x= logFC, y = -log10(sca.adj.pval)), color = "orange")+
  ggrepel::geom_label_repel( data = .x[.x$gene_name %in% c("CTNNA1", "SQSTM1", "LSM14A", "RBFOX2", "DDX6", "GAPDH", "TUBB","PURA", "PURB"),],
                             aes( x = logFC,  y = (-log10(sca.adj.pval)), label = gene_name), max.overlaps = 10)+
   xlab("Protein fold change (kd/wt) [log2]")+
   ylab("adj. p-value (protein fold change)")+
  theme_paper()+
   theme(legend.position = "none", aspect.ratio = 1/1)+
    ggtitle(paste(.y, "\n DEqMS differential analysis")))

```

```{r eval=FALSE, include=FALSE}

### TODO MK
### You could remove the chunk if it is not included

p <- map2(deqms, titles, ~ggplot(.x, aes(x = logFC, y = (-log10(sca.adj.pval))))+
            geom_vline(xintercept = 0) +
  ggrastr::rasterise( geom_point(color ="grey", shape =1, size = 0.5), dpi = 300)+
   geom_point(data = .x[(.x$sca.adj.pval< 0.05) & (abs(.x$logFC) > log(0.5)), ],
              aes(x= logFC, y = (-log10(sca.adj.pval))), size = 0.5)+
   geom_point(data = .x[.x$gene_name %in% c("CTNNA1", "SQSTM1", "LSM14A", "RBFOX2", "DDX6", "GAPDH", "TUBB", "PURA", "PURB"),],
              aes(x= logFC, y = -log10(sca.adj.pval)), color = "blue", size = 0.5)+
  ggrepel::geom_label_repel( data = .x[.x$gene_name %in% c("CTNNA1", "SQSTM1", "LSM14A", "RBFOX2", "DDX6", "GAPDH", "TUBB","PURA", "PURB"),],
                             aes( x = logFC,  y = (-log10(sca.adj.pval)),
                                  label = gene_name), size = 2, label.size = 0.2,label.padding = 0.2, max.overlaps = 10)+
   xlab("Protein fold change (kd/wt) [log2]")+
   ylab("adj. p-value (protein fold change)")+
    xlim(c(-4,2))+
  theme_paper()+
   theme(legend.position = "none", aspect.ratio = 1/1)
   )

#ggsave(plot = p[[1]], paste0(outpath, "vulcano_tf.pdf"), width = 6, height = 6, unit = "cm")
#ggsave(plot = p[[2]], paste0(outpath, "vulcano_mq.pdf"), width = 6, height = 6, unit = "cm")
#ggsave(plot = p[[3]], paste0(outpath, "vulcano_mq_no_match.pdf"), width = 6, height = 6, unit = "cm")

```

##  default differential analysis by Thermofisher result
--> striped


```{r}

### TODO MK
### You could provide some more information as a comment. If I understand it
### correctly the Thermofisher output already contained diff. analysis results,
### which is not useful/reliable. Did I get it right?

ggplot(proteomics_tables[[1]], aes(x = log2(`Abundance Ratio: (PurA-KD) / (WT)`), y = `Exp. q-value: Combined`))+
         geom_point(color ="grey", shape =1)+
   geom_point(data = proteomics_tables[[1]][(proteomics_tables[[1]]$`Exp. q-value: Combined`< 0.05) & (abs(log2(proteomics_tables[[1]]$`Abundance Ratio: (PurA-KD) / (WT)`)) > log(0.5)), ],
              aes(x=log2(`Abundance Ratio: (PurA-KD) / (WT)`), y = (-log10(`Exp. q-value: Combined`))))+
   geom_point(data = proteomics_tables[[1]][proteomics_tables[[1]]$gene_name %in% c("CTNNA1", "SQSTM1", "LSM14A", "RBFOX2", "DDX6", "GAPDH", "TUBB", "PURA", "PURB"),],
              aes(x=log2(`Abundance Ratio: (PurA-KD) / (WT)`), y = -log10(`Exp. q-value: Combined`)), color = "orange")+
  ggrepel::geom_label_repel( data = proteomics_tables[[1]][proteomics_tables[[1]]$gene_name %in% c("CTNNA1", "SQSTM1", "LSM14A", "RBFOX2", "DDX6", "GAPDH", "TUBB","PURA", "PURB"),],
                             aes( x =log2(`Abundance Ratio: (PurA-KD) / (WT)`),  y = (-log10(`Exp. q-value: Combined`)), label = gene_name), max.overlaps = 10)+
   xlab("Protein fold change (kd/wt) [log2]")+
   ylab("adj. p-value (protein fold change)")+
  theme_paper()+
   theme(legend.position = "none", aspect.ratio = 1/1)+
    ggtitle(paste(" Detection & quantification from pipeline"))


```

# Why are the results for PURA and PURB differnet for the different preprocessings?
## Countplots PURA, PURB 


```{r}
### TODO MK
### You could think about combining PURA and PURB information in the same 
### data.frame and use facet_wrap() in your ggplot call.

###############
# countplot PURA
#################
LFQ_filt_PURA <- map(LFQ_filt, ~.x[.x$gene_name=="PURA",])

LFQ_filt_PURA <- map(LFQ_filt_PURA, ~reshape2::melt(.x))

LFQ_filt_PURA <- map(LFQ_filt_PURA, ~.x[1:8,]) %>% map(., ~mutate(.x, condition = substr(variable,1,2)))

LFQ_filt_PURA_no_ma <- LFQ_4_zeros[[3]][LFQ_4_zeros[[3]]$gene_name=="PURA",] %>% reshape2::melt()
LFQ_filt_PURA_no_ma <- LFQ_filt_PURA_no_ma[1:8,] %>% mutate(., condition = substr(variable,1,2))

LFQ_filt_PURA[[1]]$processing <- "thermo fisher"
LFQ_filt_PURA[[2]]$processing <- "maxquant + matching"
LFQ_filt_PURA_no_ma$processing <- "maxquant, no matching"

gg_df <- rbind(LFQ_filt_PURA[[1]], LFQ_filt_PURA[[2]], LFQ_filt_PURA_no_ma)

### TODO MK
### Why do you remove the color legend? One does not know which color corresponds
### to thermo fisher and maxquant. I know you show the legend in the PURB count
### plot but still would put it here aswell

ggplot(gg_df, aes(x = variable, y = log(value), color = processing))+
  geom_point()+
  ggtitle("Intensities of PURA from thermo fisher or maxquant processing")+
  theme_paper()+
  theme(panel.grid.major =  element_line(color = "lightgrey", size = 0.25),
        legend.position = "None")

### TODO MK
### This plot is a duplicate of the plot before

ggplot(gg_df, aes(x = variable, y = log(value), color = processing))+
  geom_point()+
  theme_paper()+
  theme(panel.grid.major =  element_line(color = "lightgrey", size = 0.25),
        legend.position = "None")

#ggsave(filename = paste0(outpath, "protein_counts.pdf"), width = 6, height = 6, units = "cm")

###############
# countplot PURB
#################

LFQ_filt_PURB <- map(LFQ_filt, ~.x[.x$gene_name=="PURB",])

LFQ_filt_PURB <- map(LFQ_filt_PURB, ~reshape2::melt(.x))

LFQ_filt_PURB <- map(LFQ_filt_PURB, ~.x[1:8,]) %>% map(., ~mutate(.x, condition = substr(variable,1,2)))

# LFQ_filt_PURB_no_ma <- LFQ_4_zeros[[3]][LFQ_4_zeros[[3]]$gene_name=="PURB",] %>% reshape2::melt()
# LFQ_filt_PURB_no_ma <- LFQ_filt_PURB_no_ma[1:8,] %>% mutate(., condition = substr(variable,1,2))

LFQ_filt_PURB[[1]]$processing <- "thermo fisher"
LFQ_filt_PURB[[2]]$processing <- "maxquant + matching"
LFQ_filt_PURB[[3]]$processing <- "maxquant, no matching"

gg_df <- rbind(LFQ_filt_PURB[[1]], LFQ_filt_PURB[[2]], LFQ_filt_PURB[[3]])

### TODO MK
### Why don't you use your theme_paper()?

ggplot(gg_df, aes(x = variable, y = log(value), color = processing))+
  geom_point()+
  ggtitle("Intensities of PURB from thermo fisher or maxquant processing")
  

#LFQs[[3]][LFQs[[3]]$gene_name=="PURB",]

```
--> different quantification of proteins

# Look at  PURA and PURB found peptides 

```{r}
peptides_tf <- readxl::read_excel("/Users/melinaklostermann/Documents/projects/PURA/02_R_new_pip/04-BioID-SILAC/03-Proteomics/peptide_lists/MOL13920/MOL13920_noRepl3_SwissHum_perc_precDet_int_ratio_TTEST_Peptide-high_peptides.xlsx")

peptides_mq <- readxl::read_excel("/Users/melinaklostermann/Documents/projects/PURA/02_R_new_pip/04-BioID-SILAC/03-Proteomics/peptide_lists/MOL13920/MOL13920_wo-repl3_MQ1670_withMBR_peptides_LFQ1.xlsx")

PURA_protein_id <- "Q00577"
PURB_protein_id <- "Q96QR8"

# peptides maxquant
peptides_mq <- peptides_mq %>% 
  rowwise() %>%
  mutate(., protein_id = unlist(strsplit(`Leading razor protein`, split = "|", fixed = T ))[2],
         protein_name = unlist(strsplit(`Leading razor protein`, split = "|", fixed = T))[3],
         protein_name = unlist(strsplit(protein_name, split = "_", fixed = T))[1]) %>% as.data.frame(.)

peptides_PURA_mq <- peptides_mq %>% filter(protein_id == PURA_protein_id )
peptides_PURB_mq <- peptides_mq %>% filter(protein_id == PURB_protein_id )

# peptides thermo fisher
peptides_PURA_tf <- peptides_tf %>% filter(`Master Protein Accessions` == PURA_protein_id )
peptides_PURB_tf <- peptides_tf %>% filter(`Master Protein Accessions` == PURB_protein_id )

peptides_PURA_mq[,c("Sequence", "PEP")]
peptides_PURA_tf[,c("Annotated Sequence", "Abundance Ratio: (PurA-KD) / (WT)", "Qvality PEP", "Qvality q-value" , "Percolator q-Value (by Search Engine): Sequest HT")]

peptides_PURB_mq[,c("Sequence", "PEP")]
peptides_PURB_tf[,c("Annotated Sequence", "Abundance Ratio: (PurA-KD) / (WT)", "Qvality PEP", "Qvality q-value" , "Percolator q-Value (by Search Engine): Sequest HT")]


```


## Countplots of peptides

```{r}

### TODO MK
### This would be the dplyr way
### gg_pura_mq <- peptides_PURA_mq %>% select(starts_with("LFQ"), starts_with("Sequence")) %>% pivot_longer(1:8)

# PURA
gg_pura_mq <- peptides_PURA_mq[,(grepl(pattern="LFQ", colnames(peptides_PURA_mq))|
                   grepl(pattern="Sequence", colnames(peptides_PURA_mq)))] %>% reshape2::melt(.)
gg_pura_tf <- peptides_PURA_tf[,(grepl(pattern="Normalized", colnames(peptides_PURA_tf))|
                   grepl(pattern="Annotated Sequence", colnames(peptides_PURA_tf)))] %>% reshape2::melt(.)
gg_pura_tf <- gg_pura_tf %>% rowwise(.) %>% mutate(`Annotated Sequence` = unlist(strsplit(`Annotated Sequence`, split = ".", fixed =T))[2]) %>%
  dplyr::rename(Sequence = `Annotated Sequence`)

gg_pura_mq$condition = c(rep("kd1", 2), rep("kd2", 2), rep("kd3", 2), rep("kd4", 2),
                         rep("wt1", 2), rep("wt2", 2), rep("wt3", 2), rep("wt4", 2))

gg_pura_tf$condition = c(rep("kd1", 3), rep("kd2", 3), rep("kd3", 3), rep("kd4", 3),
                         rep("wt1", 3), rep("wt2", 3), rep("wt3", 3), rep("wt4", 3))

gg_pura_mq$processing = "maxquant"
gg_pura_tf$processing = "thermo fisher"


gg_pura <- rbind(gg_pura_mq, gg_pura_tf)

### TODO MK
### For FFFDVGSNK the KD peptide counts are 0. In the plot half of the points
### is visible at the bottom of the plot, which might be misleading

ggplot(gg_pura, aes(x = condition, y = log(value), color = processing ))+
  geom_point()+
  facet_wrap(~Sequence)+
  theme_paper()


# PURB
gg_purb_mq <- peptides_PURB_mq[,(grepl(pattern="LFQ", colnames(peptides_PURB_mq))|
                   grepl(pattern="Sequence", colnames(peptides_PURB_mq)))] %>% reshape2::melt(.)
gg_purb_tf <- peptides_PURB_tf[,(grepl(pattern="Normalized", colnames(peptides_PURB_tf))|
                   grepl(pattern="Annotated Sequence", colnames(peptides_PURB_tf)))] %>% reshape2::melt(.)
gg_purb_tf <- gg_purb_tf %>% rowwise(.) %>% mutate(`Annotated Sequence` = unlist(strsplit(`Annotated Sequence`, split = ".", fixed =T))[2]) %>%
  dplyr::rename(Sequence = `Annotated Sequence`)

gg_purb_mq$condition = c(rep("kd1", 2), rep("kd2", 2), rep("kd3", 2), rep("kd4", 2),
                         rep("wt1", 2), rep("wt2", 2), rep("wt3", 2), rep("wt4", 2))

gg_purb_tf$condition = c(rep("kd1", 2), rep("kd2", 2), rep("kd3", 2), rep("kd4", 2),
                         rep("wt1", 2), rep("wt2", 2), rep("wt3", 2), rep("wt4", 2))

gg_purb_mq$processing = "maxquant"
gg_purb_tf$processing = "thermo fisher"


gg_purb <- rbind(gg_purb_mq, gg_purb_tf)


### TODO MK
### Why don't you use the theme_paper()

ggplot(gg_purb, aes(x = condition, y = log(value), color = processing ))+
  geom_point()+
  facet_wrap(~Sequence)  


```
--> third peptide only detected in thermofisher preprocessing

## countplots unique peptide vs matching

```{r echo=FALSE}
gg_pura_mq_detect_mod <- peptides_PURA_mq[,(grepl(pattern="Identification type", colnames(peptides_PURA_mq))|
                   grepl(pattern="Sequence", colnames(peptides_PURA_mq)))] %>% reshape2::melt(., id.vars = "Sequence")
gg_pura_tf_detect_mod <- peptides_PURA_tf[,(grepl(pattern="Found in Sample", colnames(peptides_PURA_tf))|
                   grepl(pattern="Annotated Sequence", colnames(peptides_PURA_tf)))] %>% reshape2::melt(., id.vars = "Annotated Sequence")


gg_pura$identification <- c(gg_pura_mq_detect_mod$value, gg_pura_tf_detect_mod$value  )

gg_pura <- gg_pura %>% mutate(identification = case_when(
  identification == "By MS/MS" ~ "By unique peptide",
  identification == "High" ~ "By unique peptide",
  identification == "Peak Found" ~ "By matching",
  is.na(identification) ~ "Not found",
  identification == "Not Found" ~ "Not found",
  T ~ identification
  ))

### TODO MK
### As you remove the color legend with legend.position = "None" no one will 
### understand the conclusion you draw. I suggest to add the legend so everyone
### sees that for the third peptide only one point is labeled as "By unique peptide"

ggplot(gg_pura, aes(x = condition, y = log(value), color = identification ))+
  geom_point()+
  facet_grid(processing~Sequence)  +
  theme_paper()+
  theme(panel.border=element_rect(color="black", fill = NA),
        panel.grid.major =  element_line(color = "lightgrey", size = 0.25),
        legend.position = "None")+
  scale_color_manual(values =  wesanderson::wes_palette("GrandBudapest1")[c(2,3,1)])

# ggsave(filename = paste0(outpath, "peptide_counts.pdf"), width = 12, height = 8, units = "cm")

```

--> thrid peptide is only found once in thermofisher and then matched into all other samples, this did not happen for maxquant