-
Notifications
You must be signed in to change notification settings - Fork 0
/
rpart_imputation_for_proteomics.qmd
98 lines (87 loc) · 3.47 KB
/
rpart_imputation_for_proteomics.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
title: "Steps to impute multiple columns in a data frame using recursive partitioning and regression trees method implemented in the {dlookr} R package"
author: "Alison Felipe Alencar Chaves"
execute:
echo: true
warning: false
message: false
format: html
code-fold: true
editor_options:
chunk_output_type: console
---
## What our function is doing?
1. We create a list to store the imputed columns;
2. Loop through the columns to be imputed by applying the `dlookr::imputate_na` function from **{dlookr}** package to each column;
3. Return the list with the imputed columns
Load the necessary libraries
```{r}
library(tidyverse)
library(dlookr)
```
## Our function to impute multiple columns requires the following arguments:
1. data = a data frame containing the columns to be imputed and the id column;
2. columns = the columns to be imputed. Here we use the column names from the second column to the last one;
3. id_column = the column to be used as id (e.g., gene, protein, phosphosite, etc.);
4. method = the method to be used for imputation. Here we use the **rpart** method, but you can use any other method available in the **{dlookr}** package.
```{r}
imputate_multiple_columns <- function(data, columns,
id_col, method = "rpart") {
imputation_list <- list()
for (i in columns) {
imputation_list[[i]] <- dlookr::imputate_na(data,
i, id_col,
method = method)
}
return(imputation_list)
}
```
# Using Recursive Partitioning and Regression Trees (rpart) method to impute data
```{r}
# apply the function to the abundance data
imputation_list <- imputate_multiple_columns(data = imputation_df,
columns = colnames(imputation_df)[2:ncol(imputation_df)],
id_col = "phosphoSite",
method = "rpart") %>%
as.data.frame() %>%
dplyr::rename_all(~ str_c("imputed_", .)) %>%
cbind(imputation_df$phosphoSite) %>%
rename("phosphoSite" = `imputation_df$phosphoSite`)
```
Combine the imputed columns with the original dataframe
```{r}
combined_data <- imputation_list %>%
left_join(imputation_df, by = "phosphoSite") %>%
gather(key = "sample", value = "abundance",
-phosphoSite, na.rm = FALSE) %>%
dplyr::mutate(imputation_status = case_when(
str_detect(sample, "imputed_") ~ "imputed",
TRUE ~ "not imputed"),
imputation_status = as.factor(imputation_status),
sample = str_remove(sample, "imputed_"),
sample = as.factor(sample)
)
```
Plot the distribution of the imputed and not imputed data to check the imputation quality
```{r}
combined_data %>%
ggplot(aes(x = abundance,
y = sample,
color = imputation_status,
fill = imputation_status)) +
geom_density_ridges(
alpha = 0.5
) +
scale_fill_manual(values = c("#D55E0050", "#0072B250"),
labels = c("imputed", "not imputed")) +
scale_color_manual(values = c("#D55E00", "#0072B2"), guide = "none") +
coord_cartesian(clip = "off") +
guides(fill = guide_legend(
override.aes = list(
fill = c("#D55E00A0", "#0072B2A0"),
color = NA)
)
) +
labs(fill = NULL, color = NULL,
x = "log2(Intensity)", y = "Sample")
```