forked from MelinaKlostermann/Molitor-et-al-2022
-
Notifications
You must be signed in to change notification settings - Fork 0
/
8_Compare_proteomics_preprocessing_v2.Rmd
818 lines (575 loc) · 30.5 KB
/
8_Compare_proteomics_preprocessing_v2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
---
title: "8 Compare proteomics preprocessing for reviewer"
author: "Melina"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
BiocStyle::html_document:
toc_float: TRUE
code_folding: hide
toc: TRUE
---
```{r setup, include=FALSE}
require("knitr")
knitr::opts_chunk$set(warning=FALSE, message=FALSE, cache=TRUE) #, fig.pos = "!H", out.extra = ""
```
# What is done here?
- Types of bound genes
- Distribution of binding sites over gene regions (with and without normalisation for region length)
- Distribution of crosslink event over gene regions, metagene profile
- Distribution of binding sites per gene and in relation to TPM
- Relative maps of crosslink distribution
- GeneOntology and REACTOME enrichment analysis
```{r libraries, include=FALSE}
library(knitr)
library(GenomicRanges)
library(tidyverse)
library(ComplexHeatmap)
library(hypeR)
library(DEqMS)
library(matrixStats)
library(ggrastr)
library(ggpubr)
source("/Users/melinaklostermann/Documents/projects/PURA/02_R_new_pip/XX-helpful-chunks/theme_paper.R")
```
```{r}
### TODO MK
################################################################################
### CODE REVIEW SUMMARY ###
################################################################################
###
### Review by: Mario Keller
### Comment shortcut: MK
### Date: 17.08.2022 / 18.08.2022
### Time spent: ~2.5h
# Main objective:
# - Comparison of Thermofisher and Maxquant detection and quantificantion
#
# Input:
# - Thermofisher and MaxQuant protein and peptide quantifications as XLSX-Files
#
# #
# Major/ issues:
# - As discussed in Slack Thermofisher has some genes grouped (sep=";")
# - As discussed in Slack in the MaxQuant output there are multiple
# Majority Protein IDs but only a single gene name
#
# Minor/ suggestions:
# - Most of the time you use the dplyr way to filter things but sometime you
# switch to subsetting data.frames via brackets (see line 644)
# - You often do the same analyses on different data.frames (e.g the 3 processing
# approaches or PURA/PURAB). At some point you could combine things into a
# single data.frame and differentia via a column you could add (e.g. $processing)
# - Sometimes things can be done with less code. I placed some shorter code at
# selected positions
# - In your ggplot code you often use xlim() and ylim() to adjust axes. This may
# be risky as you might remove entries from your data.frame. Better use
# coord_cartesian(xlim=X, ylim=X)
# - At some points you use code that has no effect like when you apply na.omit()
# after you have removed all NA containing entries
# - There is one duplicated plot, which just lacks the title
# - In another plot you use geom_point() and geom_density(), whereby the geom_point()
# is completely overplotted
# - You often remove the color legend, which makes it hard to understand the plot
#
# Conclusion:
# - Analyses are reliable and trustworthy
# - Well commented
#
```
# What is done?
- this is an additional analysis for the reply to the reviewers
- comparison of the differential protein expression from the same PURA proteomics experiment with three different methods of protein discovery from the measured peptides
- the thermofisher proteome discoverer is compared to Maxquant
- moreover Maxquant results with and without isobaric matching are compared (isobaric matching matched ions of similar run-times if no unique peptide is found)
- I also look at the discovered proteins for PURA and PURB depending on these methods
# Input
```{r}
# proteomics table
proteomics_tables <- list( thermo_fisher = readxl::read_excel("/Users/melinaklostermann/Documents/projects/PURA/02_R_new_pip/04-BioID-SILAC/03-Proteomics/data-without-outlier-rep-3/MOL13920_noRepl3_SwissHum_perc_precDet_int_ratio_TTEST_Peptide-high_proteins.xlsx", sheet = 2),
maxquant_matching = readxl::read_excel("/Users/melinaklostermann/Documents/projects/PURA/02_R_new_pip/04-BioID-SILAC/03-Proteomics/XX_reprocessed_data/MOL13920_wo-repl3_MQ1670_withMBR_proteinGroups_LFQ1_matching.xlsx"),
maxquant_no_matching = readxl::read_excel("/Users/melinaklostermann/Documents/projects/PURA/02_R_new_pip/04-BioID-SILAC/03-Proteomics/XX_reprocessed_data/MOL13920_wo-repl3_MQ1670_noMBR_proteinGroups_LFQ1.xlsx"))
p_cut_prot <- 0.05
```
# Clean up count tabels
## Number of proteins detected with at least one unique peptide
```{r}
map(proteomics_tables, ~nrow(.x))
```
## How are proteins matched to genes?
```{r}
proteomics_tables[[1]] <- proteomics_tables[[1]] %>% dplyr::rename(., gene_name = `Gene Symbol`)
proteomics_tables[[2]] <- proteomics_tables[[2]] %>% dplyr::rename(., gene_name = `Gene symbol`)
proteomics_tables[[3]] <- proteomics_tables[[3]] %>% dplyr::rename(., gene_name = `Gene symbol`)
```
Found proteins, that are not matched to any gene:
```{r}
map(proteomics_tables, ~sum(is.na(.x$gene_name)))
```
Genes, that are matched to multiple proteins:
```{r}
map(proteomics_tables, ~sum(duplicated(.x$gene_name, incomparables=NA)))
map(proteomics_tables, ~.x[duplicated(.x$gene_name, incomparables=NA),]$gene_name)
```
Removing both duplicated genes and unknown genes
```{r}
proteomics_tables <- lapply(proteomics_tables, function(x){
x <- x[!is.na(x$gene_name),]
x <- x[!duplicated(x$gene_name),]
})
```
Protein groups with no specific protein
```{r eval=FALSE, include=T}
##########################
# get alternative gene names from ensemble
#########################
#get names
ensembl38p13 <- useMart(biomart='ENSEMBL_MART_ENSEMBL',
dataset='hsapiens_gene_ensembl')
my_attributes <- c("ensembl_gene_id", "hgnc_symbol", "external_synonym", "uniprot_isoform", "protein_id", "ensembl_peptide_id")
info_biomart_new<- getBM(attributes=my_attributes,
filters = "hgnc_symbol",
values = proteomics_tables[[1]]$gene_name,
mart = ensembl38p13, useCache = F)
#combine
idx_biomart_new <- match(proteomics_tables[[1]]$gene_name, info_biomart_new$hgnc_symbol)
proteomics_tables[[1]] <- cbind(proteomics_tables[[1]], info_biomart_new[idx_biomart_new,])
#save
saveRDS(proteomics_tables[[1]], "/Users/melinaklostermann/Documents/projects/PURA/Molitor-et-al-2022/thermofisher_table.rds")
```
```{r}
#####################
# resolve results with multiple major protein ids
######################
# split up mutiple ids into separate columns
proteomics_tables[[1]] <- readRDS("/Users/melinaklostermann/Documents/projects/PURA/Molitor-et-al-2022/thermofisher_table.rds")
multiple_protein_ids <- map(proteomics_tables[2:3], ~.x[grepl(pattern = ";", .x$`Majority protein IDs`),])
multiple_protein_ids <- map(multiple_protein_ids, ~ .x %>%
separate(`Majority protein IDs`,
sep = ";", into = c("ID1", "ID2", "ID3")) )
multiple_protein_ids <- map(multiple_protein_ids, ~ .x %>%
rowwise(.) %>%
mutate(across(c("ID1", "ID2", "ID3"),
~unlist(strsplit(.x, split = "\\|"))[2] )))
# remove proteins that were detected in group in maxquant also from thermofisher
proteomics_tables[[1]] <- proteomics_tables[[1]][-which(proteomics_tables[[1]]$Accession %in% c(multiple_protein_ids[[1]]$ID1, multiple_protein_ids[[1]]$ID2, multiple_protein_ids[[1]]$ID3)),]
proteomics_tables[[1]] <- proteomics_tables[[1]][-which(proteomics_tables[[1]]$Accession %in% c(multiple_protein_ids[[2]]$ID1, multiple_protein_ids[[2]]$ID2, multiple_protein_ids[[2]]$ID3) ),]
######################
# resolve results with multiple gene names
######################
multiple_gene_names <- proteomics_tables[[1]][grepl(pattern = ";", proteomics_tables[[1]]$gene_name),] %>%
separate(gene_name, sep = ";", into = c("name1", "name2", "name3"))
# proteomics_tables[[1]][which(c(multiple_gene_names$name1, multiple_gene_names$name2, multiple_gene_names$name3) %in% proteomics_tables[[2]]$gene_name),]
# search vor major proteins in alternative names
multiple_gene_names <- multiple_gene_names %>%
rowwise(.) %>%
mutate(synonymous_names = case_when(
(name1 == external_synonym |
name2 == external_synonym |
name3 == external_synonym) ~ T,
T~ F
))
# how many of the mutilple major gene ids belong to alternative names of the same gene
table(multiple_gene_names$synonymous_names)
# none? remove all?
```
## Overlap of found genes, that the proteins belong to
```{r}
### TODO MK
### You removed all NAs in the chunk above. So you can remove the na.omit()
### TODO MK
### In proteomics_tables[[1]]$gene_name some entries are combined genes (e.g GPR64; ADGRG2)
all = c(proteomics_tables[[1]]$gene_name, proteomics_tables[[2]]$gene_name, proteomics_tables[[3]]$gene_name) %>% unique()
overlap_mat <- cbind(
thermo_fisher = all %in% proteomics_tables[[1]]$gene_name,
maxquant_matching = all %in% proteomics_tables[[2]]$gene_name,
maxquant_no_matching = all %in% proteomics_tables[[3]]$gene_name)
fit <- eulerr::euler(overlap_mat)
plot(fit, quantities = TRUE)
head(all[!(all %in% proteomics_tables[[2]]$gene_name)])
```
# Normalised abundances
```{r clean_lfq}
# decoy matches and matches to contaminant should be removed --> already happened?
# any(proteomics_tables[[2]]$Reverse == "+", na.rm = T)
# any(proteomics_tables[[2]]$`Potential contaminant` == "+", na.rm = T)
#
# any(proteomics_tables[[3]]$Reverse == "+", na.rm = T)
# any(proteomics_tables[[3]]$`Potential contaminant` == "+", na.rm = T)
### TODO MK
### Could be done with dplyr
### LFQs <- list(thermo_fisher = proteomics_tables[[1]] %>% select(gene_name, starts_with("Abundances (Normalized):")),
### maxquant_matching = proteomics_tables[[2]] %>% select(gene_name, starts_with("Abundances (Normalized):")),
### maxquant_no_matchingproteomics_tables[[2]] %>% select(gene_name, starts_with("Abundances (Normalized):")))
# LFQ table
LFQs <- list(thermo_fisher = proteomics_tables[[1]][,c(grepl(pattern = "(Normalized)", colnames(proteomics_tables[[1]])) |
grepl(pattern = "gene_name", colnames(proteomics_tables[[1]])))] %>%
as.data.frame(),
maxquant_matching = proteomics_tables[[2]][,c(grepl(pattern = "LFQ", colnames(proteomics_tables[[2]]))|
grepl(pattern = "gene_name", colnames(proteomics_tables[[2]])))] %>%
as.data.frame(),
maxquant_no_matching = proteomics_tables[[3]][,c(grepl(pattern = "LFQ", colnames(proteomics_tables[[3]]))|
grepl(pattern = "gene_name", colnames(proteomics_tables[[3]])))] %>%
as.data.frame())
LFQs <- lapply(LFQs, function(x){
colnames(x) <- c("gene_name", "kd1", "kd2", "kd3", "kd4", "wt1", "wt2", "wt3", "wt4")
return(x)})
LFQs_t <- map(LFQs, ~reshape2::melt(.x))
<<<<<<< HEAD
LFQs_t2 <- s(LFQs_t[[1]], LFQs_t[[2]], by = c("gene_name", "variable"), suffix = c(".tf", ".mq_ma"))
=======
### TODO MK
### You could rename the columns of LFQs_t2 as value.tf, value.mq_ma and especially
### value are not very informative
LFQs_t2 <- full_join(LFQs_t[[1]], LFQs_t[[2]], by = c("gene_name", "variable"), suffix = c(".tf", ".mq_ma"))
>>>>>>> bb1a0b1937fbba76ccd712dfa5ba04bec0e5ad6f
LFQs_t2 <- full_join(LFQs_t2, LFQs_t[[3]], by = c("gene_name", "variable"), suffix = c("", ".mq_noma"))
LFQs_t3 <- reshape2::melt(LFQs_t2)
colnames(LFQs_t3) <- c("gene_name", "sample", "processing", "abundance")
### TODO MK
### Be careful with xlim() and ylim(). I always prefer coord_cartesian() as it
### will never eliminate data
### TODO MK
### you can remove ggrastr::rasterise(geom_point(size = 0.1), dpi = 400)+
### as you overplot it completely with the pointdensity
### TODO MK
### If find it kind of strange that Proteins appear multiple times in the
### scatter plot as they were measured in 8 experiments. Wouldn't a merge
### of replicates or facetwrap on ~variable would make more sense?
ggplot(LFQs_t2, aes(x=log10(value.tf), y= log10(value.mq_ma)))+
xlim(5,11)+
ylim(5,11)+
ggrastr::rasterise(geom_point(size = 0.1), dpi = 400)+
ggrastr::rasterise(ggpointdensity::geom_pointdensity(size = 0.1), dpi = 400)+
xlab("thermofisher abundances")+
ylab("maxquant - matching abundances")+
geom_point(data=LFQs_t2[(LFQs_t2$gene_name == "PURA" & grepl(LFQs_t2$variable, pattern = "kd")), ], aes(x=log10(value.tf), y= log10(value.mq_ma)), color = "orange", size = 0.5)+
geom_point(data=LFQs_t2[(LFQs_t2$gene_name == "PURA" & grepl(LFQs_t2$variable, pattern = "wt")), ], aes(x=log10(value.tf), y= log10(value.mq_ma)), color = "red", size = 0.5)+
theme_paper()+
theme(legend.key.size = unit(0.15, 'cm'))
#ggsave(paste0(outpath, "Scatter_tf_MQ.pdf"), width = 7, height = 6, units = "cm")
ggplot(LFQs_t2, aes(x=log10(value), y= log10(value.mq_ma)))+
ggrastr::rasterise(geom_point(size = 0.1), dpi = 400)+
ggrastr::rasterise(ggpointdensity::geom_pointdensity(size = 0.1), dpi = 400)+
xlab("maxquant - no matching abundances")+
ylab("maxquant - matching abundances")+
geom_point(data=LFQs_t2[(LFQs_t2$gene_name == "PURA" & grepl(LFQs_t2$variable, pattern = "wt")), ], aes(x=log10(value), y= log10(value.mq_ma)), color = "red", size = 0.5)+
geom_point(data=LFQs_t2[(LFQs_t2$gene_name == "PURA" & grepl(LFQs_t2$variable, pattern = "kd")), ], aes(x=log10(value), y= log10(value.mq_ma)), color = "orange", size = 0.5)+
theme_paper()+
theme(legend.key.size = unit(0.15, 'cm'))
#ggsave(paste0(outpath, "Scatter_MQ_match_no_match.pdf"), width = 7, height = 6, units = "cm")
```
## Excluding to high fallout rates
--> We exclude proteins with 2 or 3 missing values in either of the conditions
```{r zero_filt}
# turn 0 into NA
LFQs[[1]][LFQs[[1]]==0] = NA
LFQs[[2]][LFQs[[2]]==0] = NA
LFQs[[3]][LFQs[[3]]==0] = NA
# sum fall outs
LFQs <- lapply(LFQs, function(x){
x$na_kd <- apply(x,1, function(y) sum(is.na(y[2:5])))
x$na_wt <- apply(x,1, function(y) sum(is.na(y[6:9])))
return(x)
})
LFQ_filt <- LFQs %>% map(., ~dplyr::filter(.x, (na_wt %in% c(0,1)) & (na_kd %in% c(0,1))))
LFQ_4_zeros <- LFQs %>% map(., ~dplyr::filter(.x, (((na_wt == 4) & (na_kd %in% c(0,1)))| ((na_kd == 4)& (na_wt %in% c(0,1))))))
#saveRDS(LFQ_filt, paste0(outpath, "LFQ_tables.rds"))
```
Number of Proteins after zero filtering:
(max one missing value per condition)
```{r}
map(LFQ_filt, ~ nrow(.x))
```
Proteins with fall-out only in one condition:
(4 zeros in one condition and max 1 zero in other condition)
```{r}
map(LFQ_4_zeros, ~ .x$gene_name)
map(LFQ_4_zeros, ~ nrow(.x))
```
# Differential analysis
There are multiple options for differential analysis
- Thermo fisher in build analysis
- limma
- DEqMs
- Perseus
## Is data median centered?
Use boxplot to check if the samples have medians centered. if not, do median centering.
```{r fit, echo=FALSE}
### TODO MK
### You write that you use a boxplot to check if things are median centered
### but don't show a boxplot.
protein.matrix <- LFQ_filt %>% map(., ~.x[,2:9])
#map(protein.matrix, ~invisible(plotDensities(.x, legend = F)))
# log transformation to linerase changes
protein.matrix <- protein.matrix %>% map( ~log2(.x))
```
--> yes
```{r}
###################
# function DEqMS
###################
f_DEqMS <- function(protein_matrix, pep_count_table){
# model
##################
condition = as.factor(c(rep("kd",4), rep("wt",4)))
design = model.matrix(~0+condition)
contrasts <- c("conditionkd-conditionwt")
contrast <- makeContrasts(contrasts=contrasts, levels = design)
# fit model to matrix
########################
fit1 = lmFit(protein_matrix, design = design)
fit2 = contrasts.fit(fit1, contrasts = contrast)
fit3 <- eBayes(fit2)
# correct bias of variance estimate
####################################
fit3$count = pep_count_table[rownames(fit3$coefficients),"count"]
fit4 <- spectraCounteBayes(fit3)
VarianceBoxplot(fit4, main = "Fit of variance estimate",
xlab="peptide count + 1")
VarianceScatterplot(fit4, main = "Fit of variance estimate",
xlab="peptide count + 1")
# adding gene info to results
DEqMS.results = outputResult(fit4, coef_col = 1)
DEqMS.results$gene_name <- rownames(DEqMS.results)
return(DEqMS.results)
}
```
DEqMS is based on limma, but adjustes the p-values in the end for the variance given for the specific number of unique peptides found
### Fit variance to peptide count in DEqMS
```{r}
#####################
# peptide count tables
########################
# Maxquant: we use minimum peptide count among six samples
# count unique+razor peptides used for quantification
pep.count.table = list( thermo_fisher = data.frame(count = proteomics_tables[[1]]$`# Unique Peptides`,
row.names = proteomics_tables[[1]]$gene_name),
maxquant_mat = data.frame(count = rowMins(as.matrix(proteomics_tables[[2]][,20:27])),
row.names = proteomics_tables[[2]]$gene_name),
maxquant_nomat = data.frame(count = rowMins(as.matrix(proteomics_tables[[3]][,20:27])),
row.names = proteomics_tables[[3]]$gene_name))
# Minimum peptide count of some proteins can be 0
# add pseudocount 1 to all proteins
pep.count.table = map(pep.count.table, ~mutate(.x, count = count+1))
# run
rownames(protein.matrix[[1]]) = LFQ_filt[[1]]$gene_name
rownames(protein.matrix[[2]]) = LFQ_filt[[2]]$gene_name
rownames(protein.matrix[[3]]) = LFQ_filt[[3]]$gene_name
deqms <- map2(protein.matrix, pep.count.table, ~ f_DEqMS(protein_matrix = .x, pep_count_table = .y))
```
## Vulcanos
```{r}
titles <- as.list(names(deqms))
map2(deqms, titles, ~ggplot(.x, aes(x = logFC, y = (-log10(sca.adj.pval))))+
geom_point(color ="grey", shape =1)+
geom_point(data = .x[(.x$sca.adj.pval< 0.05) & (abs(.x$logFC) > log(0.5)), ],
aes(x= logFC, y = (-log10(sca.adj.pval))))+
geom_point(data = .x[.x$gene_name %in% c("CTNNA1", "SQSTM1", "LSM14A", "RBFOX2", "DDX6", "GAPDH", "TUBB", "PURA", "PURB"),],
aes(x= logFC, y = -log10(sca.adj.pval)), color = "orange")+
ggrepel::geom_label_repel( data = .x[.x$gene_name %in% c("CTNNA1", "SQSTM1", "LSM14A", "RBFOX2", "DDX6", "GAPDH", "TUBB","PURA", "PURB"),],
aes( x = logFC, y = (-log10(sca.adj.pval)), label = gene_name), max.overlaps = 10)+
xlab("Protein fold change (kd/wt) [log2]")+
ylab("adj. p-value (protein fold change)")+
theme_paper()+
theme(legend.position = "none", aspect.ratio = 1/1)+
ggtitle(paste(.y, "\n DEqMS differential analysis")))
```
```{r eval=FALSE, include=FALSE}
### TODO MK
### You could remove the chunk if it is not included
p <- map2(deqms, titles, ~ggplot(.x, aes(x = logFC, y = (-log10(sca.adj.pval))))+
geom_vline(xintercept = 0) +
ggrastr::rasterise( geom_point(color ="grey", shape =1, size = 0.5), dpi = 300)+
geom_point(data = .x[(.x$sca.adj.pval< 0.05) & (abs(.x$logFC) > log(0.5)), ],
aes(x= logFC, y = (-log10(sca.adj.pval))), size = 0.5)+
geom_point(data = .x[.x$gene_name %in% c("CTNNA1", "SQSTM1", "LSM14A", "RBFOX2", "DDX6", "GAPDH", "TUBB", "PURA", "PURB"),],
aes(x= logFC, y = -log10(sca.adj.pval)), color = "blue", size = 0.5)+
ggrepel::geom_label_repel( data = .x[.x$gene_name %in% c("CTNNA1", "SQSTM1", "LSM14A", "RBFOX2", "DDX6", "GAPDH", "TUBB","PURA", "PURB"),],
aes( x = logFC, y = (-log10(sca.adj.pval)),
label = gene_name), size = 2, label.size = 0.2,label.padding = 0.2, max.overlaps = 10)+
xlab("Protein fold change (kd/wt) [log2]")+
ylab("adj. p-value (protein fold change)")+
xlim(c(-4,2))+
theme_paper()+
theme(legend.position = "none", aspect.ratio = 1/1)
)
#ggsave(plot = p[[1]], paste0(outpath, "vulcano_tf.pdf"), width = 6, height = 6, unit = "cm")
#ggsave(plot = p[[2]], paste0(outpath, "vulcano_mq.pdf"), width = 6, height = 6, unit = "cm")
#ggsave(plot = p[[3]], paste0(outpath, "vulcano_mq_no_match.pdf"), width = 6, height = 6, unit = "cm")
```
## default differential analysis by Thermofisher result
--> striped
```{r}
### TODO MK
### You could provide some more information as a comment. If I understand it
### correctly the Thermofisher output already contained diff. analysis results,
### which is not useful/reliable. Did I get it right?
ggplot(proteomics_tables[[1]], aes(x = log2(`Abundance Ratio: (PurA-KD) / (WT)`), y = `Exp. q-value: Combined`))+
geom_point(color ="grey", shape =1)+
geom_point(data = proteomics_tables[[1]][(proteomics_tables[[1]]$`Exp. q-value: Combined`< 0.05) & (abs(log2(proteomics_tables[[1]]$`Abundance Ratio: (PurA-KD) / (WT)`)) > log(0.5)), ],
aes(x=log2(`Abundance Ratio: (PurA-KD) / (WT)`), y = (-log10(`Exp. q-value: Combined`))))+
geom_point(data = proteomics_tables[[1]][proteomics_tables[[1]]$gene_name %in% c("CTNNA1", "SQSTM1", "LSM14A", "RBFOX2", "DDX6", "GAPDH", "TUBB", "PURA", "PURB"),],
aes(x=log2(`Abundance Ratio: (PurA-KD) / (WT)`), y = -log10(`Exp. q-value: Combined`)), color = "orange")+
ggrepel::geom_label_repel( data = proteomics_tables[[1]][proteomics_tables[[1]]$gene_name %in% c("CTNNA1", "SQSTM1", "LSM14A", "RBFOX2", "DDX6", "GAPDH", "TUBB","PURA", "PURB"),],
aes( x =log2(`Abundance Ratio: (PurA-KD) / (WT)`), y = (-log10(`Exp. q-value: Combined`)), label = gene_name), max.overlaps = 10)+
xlab("Protein fold change (kd/wt) [log2]")+
ylab("adj. p-value (protein fold change)")+
theme_paper()+
theme(legend.position = "none", aspect.ratio = 1/1)+
ggtitle(paste(" Detection & quantification from pipeline"))
```
# Why are the results for PURA and PURB differnet for the different preprocessings?
## Countplots PURA, PURB
```{r}
### TODO MK
### You could think about combining PURA and PURB information in the same
### data.frame and use facet_wrap() in your ggplot call.
###############
# countplot PURA
#################
LFQ_filt_PURA <- map(LFQ_filt, ~.x[.x$gene_name=="PURA",])
LFQ_filt_PURA <- map(LFQ_filt_PURA, ~reshape2::melt(.x))
LFQ_filt_PURA <- map(LFQ_filt_PURA, ~.x[1:8,]) %>% map(., ~mutate(.x, condition = substr(variable,1,2)))
LFQ_filt_PURA_no_ma <- LFQ_4_zeros[[3]][LFQ_4_zeros[[3]]$gene_name=="PURA",] %>% reshape2::melt()
LFQ_filt_PURA_no_ma <- LFQ_filt_PURA_no_ma[1:8,] %>% mutate(., condition = substr(variable,1,2))
LFQ_filt_PURA[[1]]$processing <- "thermo fisher"
LFQ_filt_PURA[[2]]$processing <- "maxquant + matching"
LFQ_filt_PURA_no_ma$processing <- "maxquant, no matching"
gg_df <- rbind(LFQ_filt_PURA[[1]], LFQ_filt_PURA[[2]], LFQ_filt_PURA_no_ma)
### TODO MK
### Why do you remove the color legend? One does not know which color corresponds
### to thermo fisher and maxquant. I know you show the legend in the PURB count
### plot but still would put it here aswell
ggplot(gg_df, aes(x = variable, y = log(value), color = processing))+
geom_point()+
ggtitle("Intensities of PURA from thermo fisher or maxquant processing")+
theme_paper()+
theme(panel.grid.major = element_line(color = "lightgrey", size = 0.25),
legend.position = "None")
### TODO MK
### This plot is a duplicate of the plot before
ggplot(gg_df, aes(x = variable, y = log(value), color = processing))+
geom_point()+
theme_paper()+
theme(panel.grid.major = element_line(color = "lightgrey", size = 0.25),
legend.position = "None")
#ggsave(filename = paste0(outpath, "protein_counts.pdf"), width = 6, height = 6, units = "cm")
###############
# countplot PURB
#################
LFQ_filt_PURB <- map(LFQ_filt, ~.x[.x$gene_name=="PURB",])
LFQ_filt_PURB <- map(LFQ_filt_PURB, ~reshape2::melt(.x))
LFQ_filt_PURB <- map(LFQ_filt_PURB, ~.x[1:8,]) %>% map(., ~mutate(.x, condition = substr(variable,1,2)))
# LFQ_filt_PURB_no_ma <- LFQ_4_zeros[[3]][LFQ_4_zeros[[3]]$gene_name=="PURB",] %>% reshape2::melt()
# LFQ_filt_PURB_no_ma <- LFQ_filt_PURB_no_ma[1:8,] %>% mutate(., condition = substr(variable,1,2))
LFQ_filt_PURB[[1]]$processing <- "thermo fisher"
LFQ_filt_PURB[[2]]$processing <- "maxquant + matching"
LFQ_filt_PURB[[3]]$processing <- "maxquant, no matching"
gg_df <- rbind(LFQ_filt_PURB[[1]], LFQ_filt_PURB[[2]], LFQ_filt_PURB[[3]])
### TODO MK
### Why don't you use your theme_paper()?
ggplot(gg_df, aes(x = variable, y = log(value), color = processing))+
geom_point()+
ggtitle("Intensities of PURB from thermo fisher or maxquant processing")
#LFQs[[3]][LFQs[[3]]$gene_name=="PURB",]
```
--> different quantification of proteins
# Look at PURA and PURB found peptides
```{r}
peptides_tf <- readxl::read_excel("/Users/melinaklostermann/Documents/projects/PURA/02_R_new_pip/04-BioID-SILAC/03-Proteomics/peptide_lists/MOL13920/MOL13920_noRepl3_SwissHum_perc_precDet_int_ratio_TTEST_Peptide-high_peptides.xlsx")
peptides_mq <- readxl::read_excel("/Users/melinaklostermann/Documents/projects/PURA/02_R_new_pip/04-BioID-SILAC/03-Proteomics/peptide_lists/MOL13920/MOL13920_wo-repl3_MQ1670_withMBR_peptides_LFQ1.xlsx")
PURA_protein_id <- "Q00577"
PURB_protein_id <- "Q96QR8"
# peptides maxquant
peptides_mq <- peptides_mq %>%
rowwise() %>%
mutate(., protein_id = unlist(strsplit(`Leading razor protein`, split = "|", fixed = T ))[2],
protein_name = unlist(strsplit(`Leading razor protein`, split = "|", fixed = T))[3],
protein_name = unlist(strsplit(protein_name, split = "_", fixed = T))[1]) %>% as.data.frame(.)
peptides_PURA_mq <- peptides_mq %>% filter(protein_id == PURA_protein_id )
peptides_PURB_mq <- peptides_mq %>% filter(protein_id == PURB_protein_id )
# peptides thermo fisher
peptides_PURA_tf <- peptides_tf %>% filter(`Master Protein Accessions` == PURA_protein_id )
peptides_PURB_tf <- peptides_tf %>% filter(`Master Protein Accessions` == PURB_protein_id )
peptides_PURA_mq[,c("Sequence", "PEP")]
peptides_PURA_tf[,c("Annotated Sequence", "Abundance Ratio: (PurA-KD) / (WT)", "Qvality PEP", "Qvality q-value" , "Percolator q-Value (by Search Engine): Sequest HT")]
peptides_PURB_mq[,c("Sequence", "PEP")]
peptides_PURB_tf[,c("Annotated Sequence", "Abundance Ratio: (PurA-KD) / (WT)", "Qvality PEP", "Qvality q-value" , "Percolator q-Value (by Search Engine): Sequest HT")]
```
## Countplots of peptides
```{r}
### TODO MK
### This would be the dplyr way
### gg_pura_mq <- peptides_PURA_mq %>% select(starts_with("LFQ"), starts_with("Sequence")) %>% pivot_longer(1:8)
# PURA
gg_pura_mq <- peptides_PURA_mq[,(grepl(pattern="LFQ", colnames(peptides_PURA_mq))|
grepl(pattern="Sequence", colnames(peptides_PURA_mq)))] %>% reshape2::melt(.)
gg_pura_tf <- peptides_PURA_tf[,(grepl(pattern="Normalized", colnames(peptides_PURA_tf))|
grepl(pattern="Annotated Sequence", colnames(peptides_PURA_tf)))] %>% reshape2::melt(.)
gg_pura_tf <- gg_pura_tf %>% rowwise(.) %>% mutate(`Annotated Sequence` = unlist(strsplit(`Annotated Sequence`, split = ".", fixed =T))[2]) %>%
dplyr::rename(Sequence = `Annotated Sequence`)
gg_pura_mq$condition = c(rep("kd1", 2), rep("kd2", 2), rep("kd3", 2), rep("kd4", 2),
rep("wt1", 2), rep("wt2", 2), rep("wt3", 2), rep("wt4", 2))
gg_pura_tf$condition = c(rep("kd1", 3), rep("kd2", 3), rep("kd3", 3), rep("kd4", 3),
rep("wt1", 3), rep("wt2", 3), rep("wt3", 3), rep("wt4", 3))
gg_pura_mq$processing = "maxquant"
gg_pura_tf$processing = "thermo fisher"
gg_pura <- rbind(gg_pura_mq, gg_pura_tf)
### TODO MK
### For FFFDVGSNK the KD peptide counts are 0. In the plot half of the points
### is visible at the bottom of the plot, which might be misleading
ggplot(gg_pura, aes(x = condition, y = log(value), color = processing ))+
geom_point()+
facet_wrap(~Sequence)+
theme_paper()
# PURB
gg_purb_mq <- peptides_PURB_mq[,(grepl(pattern="LFQ", colnames(peptides_PURB_mq))|
grepl(pattern="Sequence", colnames(peptides_PURB_mq)))] %>% reshape2::melt(.)
gg_purb_tf <- peptides_PURB_tf[,(grepl(pattern="Normalized", colnames(peptides_PURB_tf))|
grepl(pattern="Annotated Sequence", colnames(peptides_PURB_tf)))] %>% reshape2::melt(.)
gg_purb_tf <- gg_purb_tf %>% rowwise(.) %>% mutate(`Annotated Sequence` = unlist(strsplit(`Annotated Sequence`, split = ".", fixed =T))[2]) %>%
dplyr::rename(Sequence = `Annotated Sequence`)
gg_purb_mq$condition = c(rep("kd1", 2), rep("kd2", 2), rep("kd3", 2), rep("kd4", 2),
rep("wt1", 2), rep("wt2", 2), rep("wt3", 2), rep("wt4", 2))
gg_purb_tf$condition = c(rep("kd1", 2), rep("kd2", 2), rep("kd3", 2), rep("kd4", 2),
rep("wt1", 2), rep("wt2", 2), rep("wt3", 2), rep("wt4", 2))
gg_purb_mq$processing = "maxquant"
gg_purb_tf$processing = "thermo fisher"
gg_purb <- rbind(gg_purb_mq, gg_purb_tf)
### TODO MK
### Why don't you use the theme_paper()
ggplot(gg_purb, aes(x = condition, y = log(value), color = processing ))+
geom_point()+
facet_wrap(~Sequence)
```
--> third peptide only detected in thermofisher preprocessing
## countplots unique peptide vs matching
```{r echo=FALSE}
gg_pura_mq_detect_mod <- peptides_PURA_mq[,(grepl(pattern="Identification type", colnames(peptides_PURA_mq))|
grepl(pattern="Sequence", colnames(peptides_PURA_mq)))] %>% reshape2::melt(., id.vars = "Sequence")
gg_pura_tf_detect_mod <- peptides_PURA_tf[,(grepl(pattern="Found in Sample", colnames(peptides_PURA_tf))|
grepl(pattern="Annotated Sequence", colnames(peptides_PURA_tf)))] %>% reshape2::melt(., id.vars = "Annotated Sequence")
gg_pura$identification <- c(gg_pura_mq_detect_mod$value, gg_pura_tf_detect_mod$value )
gg_pura <- gg_pura %>% mutate(identification = case_when(
identification == "By MS/MS" ~ "By unique peptide",
identification == "High" ~ "By unique peptide",
identification == "Peak Found" ~ "By matching",
is.na(identification) ~ "Not found",
identification == "Not Found" ~ "Not found",
T ~ identification
))
### TODO MK
### As you remove the color legend with legend.position = "None" no one will
### understand the conclusion you draw. I suggest to add the legend so everyone
### sees that for the third peptide only one point is labeled as "By unique peptide"
ggplot(gg_pura, aes(x = condition, y = log(value), color = identification ))+
geom_point()+
facet_grid(processing~Sequence) +
theme_paper()+
theme(panel.border=element_rect(color="black", fill = NA),
panel.grid.major = element_line(color = "lightgrey", size = 0.25),
legend.position = "None")+
scale_color_manual(values = wesanderson::wes_palette("GrandBudapest1")[c(2,3,1)])
# ggsave(filename = paste0(outpath, "peptide_counts.pdf"), width = 12, height = 8, units = "cm")
```
--> thrid peptide is only found once in thermofisher and then matched into all other samples, this did not happen for maxquant