Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r] add lsi, var feature selection #156

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

[r] add lsi, var feature selection #156

wants to merge 12 commits into from

Conversation

immanuelazn
Copy link
Collaborator

@immanuelazn immanuelazn commented Nov 4, 2024

Description

As discussed previously, we would like to have a built-in function for creating latent representations of input matrices. The most common way to do this is using LSI. We implemented functions lsi() and highly_variable_features() to implement this.

Tests

  • Both lsi() and highly_variable_features() have had a rudimentary test that checks the output shapes, to determine whether they look reasonable.
  • For lsi(), the .computeLSI() function was taken, which does not include the iterative steps, and extra bells and whistles. As low rank SVD approximation is non-deterministic, we re-project from the latent space back to the log(tf-idf) matrix, then compare against both packages.
  • For highly_variable_features(), we compare against the scanpy function _highly_variable_genes.highly_variable_genes(). Top features, (non)normalized dispersions, and means were all compared against each other.

Details

We are not looking for a one to one implementation of what has been done on ArchR, which is an iterative LSI approach, with clustering, variable feature selection etc. Instead, we implement feature selection as a separate step, as well as LSI procedure using log(tf-idf) -> z-score norm -> SVD.

As for projection into LSI space, that will be built in a follow up PR for the sklearn interface

@bnprks
Copy link
Owner

bnprks commented Nov 5, 2024

Looks like a good start! A couple high-level thoughts/comments while this is still in the draft stage:

  • One of the main goals of the matrix_stats function is to allow collection of multiple row and column stats in a single pass over the matrix. There are a couple spots where calls to rowSums() and colSums() or even separate calls to matrix_stats could be combined into a single call to reduce read passes over the matrix.

  • For highly variable features, what you've got is perfectly good, though I personally find the dplyr interfaces the easiest way to express this kind of logic. An underutilized feature is being able to a group_by then mutate (rather than group_by + summarize). Would look something like this:

    features_df %>%
      mutate(bin=cut(mean, n_bins, labels=FALSE)) %>%
      group_by(bin) %>%
      mutate(normalized_dispersion=(dispersion-mean(dispersion))/sd(dispersion))
  • It's a good thought to save intermediate data to avoid re-calculating normalizations. I think saving to an in-memory matrix is not the right approach for BPCells because it re-introduces a memory bottleneck unnecessarily, but saving to disk is a reasonable option to offer. Options that I used benchmarking in the paper are:

    • stage_none -- don't save any intermediates. This is totally fine for almost everything downstream except for PCA which will be slower than needed
    • stage_int -- Stage the counts matrix subset to just the features of interest for PCA. When utilizing a lot of parallelization, it's better to do this to than to save normalized data as disk I/O becomes the real bottleneck not CPU
    • stage_float -- Save the latest normalized copy of the data just prior to any dense transformations like z-scoring. This is best for PCA speed at low core counts but at high core counts will be surpassed by stage_int. You can save PCA disk bandwidth by saving as float rather than double, and keeping compression enabled (gives about 30% total space savings on float, which is not nothing)
  • I'm not sure what the best options are to offer to end-users and how to explain them. If we got a fancier PCA solver we also might be able to make stage_none work fine in all cases by doing fewer dense matrix multiplies rather than more vector multiplies

@immanuelazn
Copy link
Collaborator Author

I'm leaving the memory saving stuff to just be saving to a temporary file for now. Let's discuss further in person, then see the optimal way of implementing!

@immanuelazn immanuelazn marked this pull request as ready for review November 8, 2024 00:16
Copy link
Owner

@bnprks bnprks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work on the comparisons with ArchR + Scanpy -- it made it much simpler for me to get in and try some of the code myself to better understand it.

Most comments are in-line. There are two analyses we might want to do (off github) to check methods:

  • LSI: Try the analysis from the tutorial section "What if we skip z-score normalization on the LSI matrix?", comparing z-score along genes, along cells (current implementation), and no z-score normalization. (EDIT: also worth trying mean-centering along the gene axis but not a full z-score normalization)
  • variable genes: Try making some scatter plots of gene mean vs. normalized dispersion (or vs. variance, raw dispersion etc.). Visualize bin boundaries also. Goal is to check for uneven genes per bin and any obvious artifacts that might be caused from most genes falling into one bin (confirm if this happens on e.g. 10x pbmc3k dataset from Seurat/Scanpy's example datasets)

Then for organization of real_data tests:

  • Should add in a folder when a feature takes multiple files to test
  • Can make relative paths work a bit more robustly by detecting the path of the currently running script, see snippet below
  • I might prefer avoiding the config.csv system, and rely on either library(ArchR), or doing devtools::load_all of BPCells based on finding parent directory of currently running script.
Example detection of current script's directory
script_dir <- commandArgs(trailingOnly=FALSE)
script_dir <- script_dir[str_detect(script_dir, "--file=")] |> str_replace("--file=", "") |> dirname() |> normalizePath()

# Find a parent directory of interest (e.g. could use to get to the BPCells R package folder for use with devtools::load_all)
script_root <- script_dir
while (!("config_vars.sh" %in% list.files(script_root)) && dirname(script_root) != script_root) {
    script_root <- dirname(script_root)
}

Note this might be possible to make work better for interactive use by setting script_dir <- getwd() when there is no --file argument in commandArgs

EDIT: Immediate to-dos based on discussion

  • Analyze the feature selection + LSI to make sure the methods look reasonable on example datasets (as discussed above)
  • Make Iterative LSI prototype implementation (no need to go too flexible on arguments, just start with a quick simple version with reasonable defaults and we can discuss what args to add for flexibility later)
    • Probably implement iterative_lsi(mat) which returns a custom LSI object, and project_lsi(lsi_object, new_mat)
  • Leave the Transformer and Pipeline steps alone for now, then we can decide how to utilize/adjust them based on what would work well for refactoring Iterative LSI

Comment on lines +108 to +110
npeaks <- colSums(mat) # Finding that sums are non-multithreaded and there's no interface to pass it in, but there is implementation in `ConcatenateMatrix.h`
tf <- mat %>% multiply_cols(1 / npeaks)
idf_ <- ncol(mat) / rowSums(mat)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rowSums(mat) and colSums(mat) should be possible to calculate multithreaded in a single pass of matrix_stats(). Then ncol(mat)/rowSums(mat) is just 1/row_means, and npeaks is col_means * nrow(mat)

Comment on lines +116 to +117
if (z_score_norm) {
cell_peak_stats <- matrix_stats(mat_log_tfidf, col_stats = "variance", threads = threads)$col_stats
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really second-guessing the z-score normalization along the cell axis. I know I did it that way in the BPCells tutorial but I'm thinking now that was probably a mistake. The wikipedia page suggests that the columns should have zero empirical mean, but it has columns as features, not as observations. In our case I think we want to z-score normalize by row.

mat_tfidf <- tf %>% multiply_rows(idf_)
mat_log_tfidf <- log1p(scale_factor * mat_tfidf)
# Save to prevent re-calculation of queued operations
mat_log_tfidf <- write_matrix_dir(mat_log_tfidf, tempfile("mat_log_tfidf"), compress = FALSE)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The temp matrix saving is fine for now. We could optimize space a little bit by saving as floats rather than doubles (12 bytes -> 8 bytes per non-zero), and actually keeping compression enabled might provide some benefit despite what the warning says (8 bytes -> ~5 bytes per non-zero)

Comment on lines +177 to +181
# Give a small value to features with 0 mean, helps with 0 division
feature_means[feature_means == 0] <- 1e-12
# Calculate dispersion, and log normalize
feature_dispersion <- feature_vars / feature_means
feature_dispersion[feature_dispersion == 0] <- NA
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the 1e-12 and presence of NA really necessary here? What if we just say if feature_dispersion[feature_means==0] <- 0?

Comment on lines +195 to +202
dplyr::mutate(
bin_mean = mean(dispersion, na.rm = TRUE),
bin_sd = sd(dispersion, na.rm = TRUE),
bin_sd_is_na = is.na(bin_sd),
bin_sd = ifelse(bin_sd_is_na, bin_mean, bin_sd), # Set feats that are in bins with only one feat to have a norm dispersion of 1
bin_mean = ifelse(bin_sd_is_na, 0, bin_mean),
feature_dispersion_norm = (dispersion - bin_mean) / bin_sd
) %>%
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it's also inheriting weird logic from being a too-literal translation of scanpy code. Would this work just as well? (assuming we don't set any dispersion to NA above)

Suggested change
dplyr::mutate(
bin_mean = mean(dispersion, na.rm = TRUE),
bin_sd = sd(dispersion, na.rm = TRUE),
bin_sd_is_na = is.na(bin_sd),
bin_sd = ifelse(bin_sd_is_na, bin_mean, bin_sd), # Set feats that are in bins with only one feat to have a norm dispersion of 1
bin_mean = ifelse(bin_sd_is_na, 0, bin_mean),
feature_dispersion_norm = (dispersion - bin_mean) / bin_sd
) %>%
dplyr::mutate(
feature_dispersion_norm = (dispersion - mean(dispersion)) / sd(dispersion),
feature_dispersion_norm = dplyr::if_else(n() == 1, 1, feature_dispersion_norm) # Set feats that are in bins with only one feat to have a norm dispersion of 1
) %>%


# Bin by mean, and normalize dispersion with each bin
features_df <- features_df %>%
dplyr::mutate(bin = cut(mean, n_bins, labels=FALSE)) %>%
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature of cut gives me pause, though I think it's probably okay: cut(c(1:100,1000), 5, labels=FALSE) puts the numbers 1-100 in the lowest bin, then leaves the middle 3 bins empty and puts 1000 in the largest bin.

The cell_ranger flavor in Scanpy bins by decile which is the main alternative we could do, which appears to deciles for bin size

From a quick test on SeuratData's pbmc3k dataset, it appears that 92% of genes get put into one bin, though there's no huge enrichment of which bins genes get picked from vs. others. Either worth a bit of follow-up analysis to see if there is bias for high/low expressions within the huge first bin, or at least figuring out a function naming so it's clear this is one variable genes option among multiple possibilities

Comment on lines +205 to +206
dplyr::arrange(desc(feature_dispersion_norm)) %>%
dplyr::slice(1:min(num_feats, nrow(.)))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dplyr::slice_max() is another way to accomplish the same thing fyi

dplyr::slice(1:min(num_feats, nrow(.)))
if (save_feat_selection) {
return(list(
mat = mat[features_df$name,],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that this is also ordering the genes by normalized dispersion as opposed to simple filtering of rows without changing the order?

Comment on lines +94 to +98
lsi <- function(
mat,
z_score_norm = TRUE, n_dimensions = 50L, scale_factor = 1e4,
save_lsi = FALSE,
threads = 1L
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put n_dimensions ahead of z_score_norm since I think it's a more likely argument to need to be changed. Otherwise, the argument order seems good

names(config) <- as.list(config_csv)$key

# Set up temp dir in case it's not already set
create_temp_dir <- function(dir = NULL) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe a small enough function that it makes sense to just re-implement per file as needed. Keeping the real_data tests as standalone as possible I think will make them easier to re-run later when needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants