-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[r] add lsi, var feature selection #156
base: main
Are you sure you want to change the base?
Conversation
Looks like a good start! A couple high-level thoughts/comments while this is still in the draft stage:
|
…`save_lsi`, remove `save_in_memory` in `lsi()`
I'm leaving the memory saving stuff to just be saving to a temporary file for now. Let's discuss further in person, then see the optimal way of implementing! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work on the comparisons with ArchR + Scanpy -- it made it much simpler for me to get in and try some of the code myself to better understand it.
Most comments are in-line. There are two analyses we might want to do (off github) to check methods:
- LSI: Try the analysis from the tutorial section "What if we skip z-score normalization on the LSI matrix?", comparing z-score along genes, along cells (current implementation), and no z-score normalization. (EDIT: also worth trying mean-centering along the gene axis but not a full z-score normalization)
- variable genes: Try making some scatter plots of gene mean vs. normalized dispersion (or vs. variance, raw dispersion etc.). Visualize bin boundaries also. Goal is to check for uneven genes per bin and any obvious artifacts that might be caused from most genes falling into one bin (confirm if this happens on e.g. 10x pbmc3k dataset from Seurat/Scanpy's example datasets)
Then for organization of real_data tests:
- Should add in a folder when a feature takes multiple files to test
- Can make relative paths work a bit more robustly by detecting the path of the currently running script, see snippet below
- I might prefer avoiding the
config.csv
system, and rely on eitherlibrary(ArchR)
, or doingdevtools::load_all
of BPCells based on finding parent directory of currently running script.
Example detection of current script's directory
script_dir <- commandArgs(trailingOnly=FALSE)
script_dir <- script_dir[str_detect(script_dir, "--file=")] |> str_replace("--file=", "") |> dirname() |> normalizePath()
# Find a parent directory of interest (e.g. could use to get to the BPCells R package folder for use with devtools::load_all)
script_root <- script_dir
while (!("config_vars.sh" %in% list.files(script_root)) && dirname(script_root) != script_root) {
script_root <- dirname(script_root)
}
Note this might be possible to make work better for interactive use by setting script_dir <- getwd()
when there is no --file
argument in commandArgs
EDIT: Immediate to-dos based on discussion
- Analyze the feature selection + LSI to make sure the methods look reasonable on example datasets (as discussed above)
- Make Iterative LSI prototype implementation (no need to go too flexible on arguments, just start with a quick simple version with reasonable defaults and we can discuss what args to add for flexibility later)
- Probably implement
iterative_lsi(mat)
which returns a custom LSI object, andproject_lsi(lsi_object, new_mat)
- Probably implement
- Leave the
Transformer
andPipeline
steps alone for now, then we can decide how to utilize/adjust them based on what would work well for refactoring Iterative LSI
npeaks <- colSums(mat) # Finding that sums are non-multithreaded and there's no interface to pass it in, but there is implementation in `ConcatenateMatrix.h` | ||
tf <- mat %>% multiply_cols(1 / npeaks) | ||
idf_ <- ncol(mat) / rowSums(mat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the rowSums(mat)
and colSums(mat)
should be possible to calculate multithreaded in a single pass of matrix_stats()
. Then ncol(mat)/rowSums(mat)
is just 1/row_means
, and npeaks
is col_means * nrow(mat)
if (z_score_norm) { | ||
cell_peak_stats <- matrix_stats(mat_log_tfidf, col_stats = "variance", threads = threads)$col_stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really second-guessing the z-score normalization along the cell axis. I know I did it that way in the BPCells tutorial but I'm thinking now that was probably a mistake. The wikipedia page suggests that the columns should have zero empirical mean, but it has columns as features, not as observations. In our case I think we want to z-score normalize by row.
mat_tfidf <- tf %>% multiply_rows(idf_) | ||
mat_log_tfidf <- log1p(scale_factor * mat_tfidf) | ||
# Save to prevent re-calculation of queued operations | ||
mat_log_tfidf <- write_matrix_dir(mat_log_tfidf, tempfile("mat_log_tfidf"), compress = FALSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The temp matrix saving is fine for now. We could optimize space a little bit by saving as floats rather than doubles (12 bytes -> 8 bytes per non-zero), and actually keeping compression enabled might provide some benefit despite what the warning says (8 bytes -> ~5 bytes per non-zero)
# Give a small value to features with 0 mean, helps with 0 division | ||
feature_means[feature_means == 0] <- 1e-12 | ||
# Calculate dispersion, and log normalize | ||
feature_dispersion <- feature_vars / feature_means | ||
feature_dispersion[feature_dispersion == 0] <- NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the 1e-12
and presence of NA
really necessary here? What if we just say if feature_dispersion[feature_means==0] <- 0
?
dplyr::mutate( | ||
bin_mean = mean(dispersion, na.rm = TRUE), | ||
bin_sd = sd(dispersion, na.rm = TRUE), | ||
bin_sd_is_na = is.na(bin_sd), | ||
bin_sd = ifelse(bin_sd_is_na, bin_mean, bin_sd), # Set feats that are in bins with only one feat to have a norm dispersion of 1 | ||
bin_mean = ifelse(bin_sd_is_na, 0, bin_mean), | ||
feature_dispersion_norm = (dispersion - bin_mean) / bin_sd | ||
) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it's also inheriting weird logic from being a too-literal translation of scanpy code. Would this work just as well? (assuming we don't set any dispersion to NA
above)
dplyr::mutate( | |
bin_mean = mean(dispersion, na.rm = TRUE), | |
bin_sd = sd(dispersion, na.rm = TRUE), | |
bin_sd_is_na = is.na(bin_sd), | |
bin_sd = ifelse(bin_sd_is_na, bin_mean, bin_sd), # Set feats that are in bins with only one feat to have a norm dispersion of 1 | |
bin_mean = ifelse(bin_sd_is_na, 0, bin_mean), | |
feature_dispersion_norm = (dispersion - bin_mean) / bin_sd | |
) %>% | |
dplyr::mutate( | |
feature_dispersion_norm = (dispersion - mean(dispersion)) / sd(dispersion), | |
feature_dispersion_norm = dplyr::if_else(n() == 1, 1, feature_dispersion_norm) # Set feats that are in bins with only one feat to have a norm dispersion of 1 | |
) %>% |
|
||
# Bin by mean, and normalize dispersion with each bin | ||
features_df <- features_df %>% | ||
dplyr::mutate(bin = cut(mean, n_bins, labels=FALSE)) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature of cut gives me pause, though I think it's probably okay: cut(c(1:100,1000), 5, labels=FALSE)
puts the numbers 1-100 in the lowest bin, then leaves the middle 3 bins empty and puts 1000 in the largest bin.
The cell_ranger
flavor in Scanpy bins by decile which is the main alternative we could do, which appears to deciles for bin size
From a quick test on SeuratData's pbmc3k
dataset, it appears that 92% of genes get put into one bin, though there's no huge enrichment of which bins genes get picked from vs. others. Either worth a bit of follow-up analysis to see if there is bias for high/low expressions within the huge first bin, or at least figuring out a function naming so it's clear this is one variable genes option among multiple possibilities
dplyr::arrange(desc(feature_dispersion_norm)) %>% | ||
dplyr::slice(1:min(num_feats, nrow(.))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dplyr::slice_max()
is another way to accomplish the same thing fyi
dplyr::slice(1:min(num_feats, nrow(.))) | ||
if (save_feat_selection) { | ||
return(list( | ||
mat = mat[features_df$name,], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that this is also ordering the genes by normalized dispersion as opposed to simple filtering of rows without changing the order?
lsi <- function( | ||
mat, | ||
z_score_norm = TRUE, n_dimensions = 50L, scale_factor = 1e4, | ||
save_lsi = FALSE, | ||
threads = 1L |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put n_dimensions
ahead of z_score_norm
since I think it's a more likely argument to need to be changed. Otherwise, the argument order seems good
names(config) <- as.list(config_csv)$key | ||
|
||
# Set up temp dir in case it's not already set | ||
create_temp_dir <- function(dir = NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is maybe a small enough function that it makes sense to just re-implement per file as needed. Keeping the real_data
tests as standalone as possible I think will make them easier to re-run later when needed
Description
As discussed previously, we would like to have a built-in function for creating latent representations of input matrices. The most common way to do this is using LSI. We implemented functions
lsi()
andhighly_variable_features()
to implement this.Tests
lsi()
andhighly_variable_features()
have had a rudimentary test that checks the output shapes, to determine whether they look reasonable.lsi()
, the.computeLSI()
function was taken, which does not include the iterative steps, and extra bells and whistles. As low rank SVD approximation is non-deterministic, we re-project from the latent space back to the log(tf-idf) matrix, then compare against both packages.highly_variable_features()
, we compare against the scanpy function_highly_variable_genes.highly_variable_genes()
. Top features, (non)normalized dispersions, and means were all compared against each other.Details
We are not looking for a one to one implementation of what has been done on ArchR, which is an iterative LSI approach, with clustering, variable feature selection etc. Instead, we implement feature selection as a separate step, as well as LSI procedure using log(tf-idf) -> z-score norm -> SVD.
As for projection into LSI space, that will be built in a follow up PR for the sklearn interface