[r] add lsi, var feature selection #156

immanuelazn · 2024-11-04T23:05:13Z

Description

As discussed previously, we would like to have a built-in function for creating latent representations of input matrices. The most common way to do this is using LSI. We implemented functions lsi() and highly_variable_features() to implement this.

Tests

Both lsi() and highly_variable_features() have had a rudimentary test that checks the output shapes, to determine whether they look reasonable.
For lsi(), the .computeLSI() function was taken, which does not include the iterative steps, and extra bells and whistles. As low rank SVD approximation is non-deterministic, we re-project from the latent space back to the log(tf-idf) matrix, then compare against both packages.
For highly_variable_features(), we compare against the scanpy function _highly_variable_genes.highly_variable_genes(). Top features, (non)normalized dispersions, and means were all compared against each other.

Details

We are not looking for a one to one implementation of what has been done on ArchR, which is an iterative LSI approach, with clustering, variable feature selection etc. Instead, we implement feature selection as a separate step, as well as LSI procedure using log(tf-idf) -> z-score norm -> SVD.

As for projection into LSI space, that will be built in a follow up PR for the sklearn interface

bnprks · 2024-11-05T06:04:29Z

Looks like a good start! A couple high-level thoughts/comments while this is still in the draft stage:

One of the main goals of the matrix_stats function is to allow collection of multiple row and column stats in a single pass over the matrix. There are a couple spots where calls to rowSums() and colSums() or even separate calls to matrix_stats could be combined into a single call to reduce read passes over the matrix.
For highly variable features, what you've got is perfectly good, though I personally find the dplyr interfaces the easiest way to express this kind of logic. An underutilized feature is being able to a group_by then mutate (rather than group_by + summarize). Would look something like this:
```
features_df %>%
  mutate(bin=cut(mean, n_bins, labels=FALSE)) %>%
  group_by(bin) %>%
  mutate(normalized_dispersion=(dispersion-mean(dispersion))/sd(dispersion))
```
It's a good thought to save intermediate data to avoid re-calculating normalizations. I think saving to an in-memory matrix is not the right approach for BPCells because it re-introduces a memory bottleneck unnecessarily, but saving to disk is a reasonable option to offer. Options that I used benchmarking in the paper are:
- stage_none -- don't save any intermediates. This is totally fine for almost everything downstream except for PCA which will be slower than needed
- stage_int -- Stage the counts matrix subset to just the features of interest for PCA. When utilizing a lot of parallelization, it's better to do this to than to save normalized data as disk I/O becomes the real bottleneck not CPU
- stage_float -- Save the latest normalized copy of the data just prior to any dense transformations like z-scoring. This is best for PCA speed at low core counts but at high core counts will be surpassed by stage_int. You can save PCA disk bandwidth by saving as float rather than double, and keeping compression enabled (gives about 30% total space savings on float, which is not nothing)
I'm not sure what the best options are to offer to end-users and how to explain them. If we got a fancier PCA solver we also might be able to make stage_none work fine in all cases by doing fewer dense matrix multiplies rather than more vector multiplies

…`save_lsi`, remove `save_in_memory` in `lsi()`

immanuelazn · 2024-11-07T03:44:30Z

I'm leaving the memory saving stuff to just be saving to a temporary file for now. Let's discuss further in person, then see the optimal way of implementing!

bnprks

Good work on the comparisons with ArchR + Scanpy -- it made it much simpler for me to get in and try some of the code myself to better understand it.

Most comments are in-line. There are two analyses we might want to do (off github) to check methods:

LSI: Try the analysis from the tutorial section "What if we skip z-score normalization on the LSI matrix?", comparing z-score along genes, along cells (current implementation), and no z-score normalization. (EDIT: also worth trying mean-centering along the gene axis but not a full z-score normalization)
variable genes: Try making some scatter plots of gene mean vs. normalized dispersion (or vs. variance, raw dispersion etc.). Visualize bin boundaries also. Goal is to check for uneven genes per bin and any obvious artifacts that might be caused from most genes falling into one bin (confirm if this happens on e.g. 10x pbmc3k dataset from Seurat/Scanpy's example datasets)

Then for organization of real_data tests:

Should add in a folder when a feature takes multiple files to test
Can make relative paths work a bit more robustly by detecting the path of the currently running script, see snippet below
I might prefer avoiding the config.csv system, and rely on either library(ArchR), or doing devtools::load_all of BPCells based on finding parent directory of currently running script.

Example detection of current script's directory

script_dir <- commandArgs(trailingOnly=FALSE)
script_dir <- script_dir[str_detect(script_dir, "--file=")] |> str_replace("--file=", "") |> dirname() |> normalizePath()

# Find a parent directory of interest (e.g. could use to get to the BPCells R package folder for use with devtools::load_all)
script_root <- script_dir
while (!("config_vars.sh" %in% list.files(script_root)) && dirname(script_root) != script_root) {
    script_root <- dirname(script_root)
}

Note this might be possible to make work better for interactive use by setting script_dir <- getwd() when there is no --file argument in commandArgs

EDIT: Immediate to-dos based on discussion

Analyze the feature selection + LSI to make sure the methods look reasonable on example datasets (as discussed above)
Make Iterative LSI prototype implementation (no need to go too flexible on arguments, just start with a quick simple version with reasonable defaults and we can discuss what args to add for flexibility later)
- Probably implement iterative_lsi(mat) which returns a custom LSI object, and project_lsi(lsi_object, new_mat)
Leave the Transformer and Pipeline steps alone for now, then we can decide how to utilize/adjust them based on what would work well for refactoring Iterative LSI

bnprks · 2024-11-21T00:55:29Z

r/R/singlecell_utils.R

+  npeaks <- colSums(mat) # Finding that sums are non-multithreaded and there's no interface to pass it in, but there is implementation in `ConcatenateMatrix.h`
+  tf <- mat %>% multiply_cols(1 / npeaks)
+  idf_ <- ncol(mat) / rowSums(mat)


I think the rowSums(mat) and colSums(mat) should be possible to calculate multithreaded in a single pass of matrix_stats(). Then ncol(mat)/rowSums(mat) is just 1/row_means, and npeaks is col_means * nrow(mat)

bnprks · 2024-11-21T00:58:38Z

r/R/singlecell_utils.R

+  if (z_score_norm) {
+    cell_peak_stats <- matrix_stats(mat_log_tfidf, col_stats = "variance", threads = threads)$col_stats


I'm really second-guessing the z-score normalization along the cell axis. I know I did it that way in the BPCells tutorial but I'm thinking now that was probably a mistake. The wikipedia page suggests that the columns should have zero empirical mean, but it has columns as features, not as observations. In our case I think we want to z-score normalize by row.

bnprks · 2024-11-21T01:00:38Z

r/R/singlecell_utils.R

+  mat_tfidf <- tf %>% multiply_rows(idf_)
+  mat_log_tfidf <- log1p(scale_factor * mat_tfidf)
+  # Save to prevent re-calculation of queued operations
+  mat_log_tfidf <- write_matrix_dir(mat_log_tfidf, tempfile("mat_log_tfidf"), compress = FALSE)


The temp matrix saving is fine for now. We could optimize space a little bit by saving as floats rather than doubles (12 bytes -> 8 bytes per non-zero), and actually keeping compression enabled might provide some benefit despite what the warning says (8 bytes -> ~5 bytes per non-zero)

bnprks · 2024-11-21T01:04:04Z

r/R/singlecell_utils.R

+  # Give a small value to features with 0 mean, helps with 0 division
+  feature_means[feature_means == 0] <- 1e-12
+  # Calculate dispersion, and log normalize
+  feature_dispersion <- feature_vars / feature_means
+  feature_dispersion[feature_dispersion == 0] <- NA


Is the 1e-12 and presence of NA really necessary here? What if we just say if feature_dispersion[feature_means==0] <- 0?

bnprks · 2024-11-21T01:07:35Z

r/R/singlecell_utils.R

+    dplyr::mutate( 
+      bin_mean = mean(dispersion, na.rm = TRUE), 
+      bin_sd = sd(dispersion, na.rm = TRUE),
+      bin_sd_is_na = is.na(bin_sd), 
+      bin_sd = ifelse(bin_sd_is_na, bin_mean, bin_sd), # Set feats that are in bins with only one feat to have a norm dispersion of 1
+      bin_mean = ifelse(bin_sd_is_na, 0, bin_mean),
+      feature_dispersion_norm = (dispersion - bin_mean) / bin_sd
+    ) %>%


This feels like it's also inheriting weird logic from being a too-literal translation of scanpy code. Would this work just as well? (assuming we don't set any dispersion to NA above)

Suggested change

dplyr::mutate(

bin_mean = mean(dispersion, na.rm = TRUE),

bin_sd = sd(dispersion, na.rm = TRUE),

bin_sd_is_na = is.na(bin_sd),

bin_sd = ifelse(bin_sd_is_na, bin_mean, bin_sd), # Set feats that are in bins with only one feat to have a norm dispersion of 1

bin_mean = ifelse(bin_sd_is_na, 0, bin_mean),

feature_dispersion_norm = (dispersion - bin_mean) / bin_sd

) %>%

dplyr::mutate(

feature_dispersion_norm = (dispersion - mean(dispersion)) / sd(dispersion),

feature_dispersion_norm = dplyr::if_else(n() == 1, 1, feature_dispersion_norm) # Set feats that are in bins with only one feat to have a norm dispersion of 1

) %>%

bnprks · 2024-11-21T01:38:12Z

r/R/singlecell_utils.R

+
+  # Bin by mean, and normalize dispersion with each bin
+  features_df <- features_df %>% 
+    dplyr::mutate(bin = cut(mean, n_bins, labels=FALSE)) %>% 


This feature of cut gives me pause, though I think it's probably okay: cut(c(1:100,1000), 5, labels=FALSE) puts the numbers 1-100 in the lowest bin, then leaves the middle 3 bins empty and puts 1000 in the largest bin.

The cell_ranger flavor in Scanpy bins by decile which is the main alternative we could do, which appears to deciles for bin size

From a quick test on SeuratData's pbmc3k dataset, it appears that 92% of genes get put into one bin, though there's no huge enrichment of which bins genes get picked from vs. others. Either worth a bit of follow-up analysis to see if there is bias for high/low expressions within the huge first bin, or at least figuring out a function naming so it's clear this is one variable genes option among multiple possibilities

bnprks · 2024-11-21T01:38:35Z

r/R/singlecell_utils.R

+    dplyr::arrange(desc(feature_dispersion_norm)) %>% 
+    dplyr::slice(1:min(num_feats, nrow(.)))


dplyr::slice_max() is another way to accomplish the same thing fyi

bnprks · 2024-11-21T01:41:13Z

r/R/singlecell_utils.R

+    dplyr::slice(1:min(num_feats, nrow(.)))
+  if (save_feat_selection) {
+    return(list(
+      mat = mat[features_df$name,],


Is it intentional that this is also ordering the genes by normalized dispersion as opposed to simple filtering of rows without changing the order?

bnprks · 2024-11-21T20:28:59Z

r/R/singlecell_utils.R

+lsi <- function(
+  mat,
+  z_score_norm = TRUE, n_dimensions = 50L, scale_factor = 1e4,
+  save_lsi = FALSE,
+  threads = 1L


I'd put n_dimensions ahead of z_score_norm since I think it's a more likely argument to need to be changed. Otherwise, the argument order seems good

bnprks · 2024-11-21T22:33:33Z

r/tests/real_data/test_helpers.R

+names(config) <- as.list(config_csv)$key
+
+# Set up temp dir in case it's not already set
+create_temp_dir <- function(dir = NULL) {


This is maybe a small enough function that it makes sense to just re-implement per file as needed. Keeping the real_data tests as standalone as possible I think will make them easier to re-run later when needed

immanuelazn added 2 commits October 30, 2024 19:00

[r] add lsi, var feature selection

ed15820

[r] add lsi, variable feature selection

d8df4cd

immanuelazn force-pushed the ia/lsi branch from 6342e01 to d8df4cd Compare November 4, 2024 23:14

Merge branch 'main' into ia/lsi

c5715f9

immanuelazn added 5 commits November 6, 2024 16:53

[r] parametrize z_score_norm, create temp option to return more info …

16047ae

…`save_lsi`, remove `save_in_memory` in `lsi()`

[r] add test case for LSI comparing to archr

d6c674b

[r] clean up var gene selection, lsi docstring

7bda387

[r] add variable gene selection test

16237b4

[r] provide more colour to scanpy feat selection test

fa1eb58

immanuelazn added 2 commits November 7, 2024 15:57

[r] cleanup real data tests

b895cbd

[r] clean up lsi, var features docstrings

aee5d22

immanuelazn marked this pull request as ready for review November 8, 2024 00:16

immanuelazn added 2 commits November 7, 2024 18:45

[r] add in more lsi real data tests

f8eec12

[r] remove unused variable from lsi()

0810635

bnprks reviewed Nov 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[r] add lsi, var feature selection #156

[r] add lsi, var feature selection #156

immanuelazn commented Nov 4, 2024 •

edited

Loading

bnprks commented Nov 5, 2024

immanuelazn commented Nov 7, 2024

bnprks left a comment •

edited

Loading

bnprks Nov 21, 2024

bnprks Nov 21, 2024

bnprks Nov 21, 2024

bnprks Nov 21, 2024

bnprks Nov 21, 2024

bnprks Nov 21, 2024

bnprks Nov 21, 2024

bnprks Nov 21, 2024

bnprks Nov 21, 2024

bnprks Nov 21, 2024

		if (z_score_norm) {
		cell_peak_stats <- matrix_stats(mat_log_tfidf, col_stats = "variance", threads = threads)$col_stats

		dplyr::arrange(desc(feature_dispersion_norm)) %>%
		dplyr::slice(1:min(num_feats, nrow(.)))

[r] add lsi, var feature selection #156

Are you sure you want to change the base?

[r] add lsi, var feature selection #156

Conversation

immanuelazn commented Nov 4, 2024 • edited Loading

Description

Tests

Details

bnprks commented Nov 5, 2024

immanuelazn commented Nov 7, 2024

bnprks left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

immanuelazn commented Nov 4, 2024 •

edited

Loading

bnprks left a comment •

edited

Loading