In version 2.0.0 of clusterExperiment
we have made some major changes to the package. The most relevant changes for the user are:
- Changed how we handle dimensionality reduction and filtering of genes -- this includes changes to the names of important arguments of key functions!
- Added support for the main data matrix to be stored in a HDF5 file (i.e. not read into memory)
- Improved implementation of subsampling (in C++) to improve speed
- Some functions changed names
- Some additional helper functions were added
These changes are fully detailed in the NEWS file of the package (all releases since October 31, 2017).
Below we will explain the first two of these changes in more detail. We would also note that in Bioconductor 3.6 (the release before) there were many more plotting functions that were introduced.
This version consists of a major update of how dimensionality reduction and filtering is done. The ClusterExperiment
class has been updated to extend the new bioconductor class, SingleCellExperiment
(see the SingleCellExperiment package).
Briefly, the SingleCellExperiment
class extends the SummarizedExperiment
class to give a structure for saving the results of dimensionality reductions. This is done by adding a slot reducedDims
, which is a SimpleList
of datasets that have the same number of observations but reduced dimensions (i.e. genes). This gives a unified slot for saving the results of applying a dimensionality reduction method and helper functions to access them, etc. The user gives them names, e.g. "PCA" or "tSNE".
Dimensionality Reduction clusterExperiment
now makes use these stored dimensionality reductions in functions like RSEC
and clusterMany
. This allows clusterExperiment
to make use of any dimensionality reduction method so long as the user saves it in the appropriate slot in a SingleCellExperiment
object. The user can also choose not to calculate any dimensionality reduction and just do as before and have a function like clusterMany
do the dimensionality reduction (i.e. PCA) internally. The difference is that now the results of the PCA will be stored in the appropriate slot so that they will not need to be recalculated in the future.
Filtering of genes We also added in clusterExperiment
package a similar option for filtering statistics. In particular, in clusterExperiment
we have always allowed the user -- instead of dimensionality reduction via PCA-like methods -- to instead reduce the dimensionality of the problem by filtering to the top set of genes, e.g. the top 500 most variable genes. In this case clusterExperiment
will calculate the variance for every gene and reduce down to the top 500 genes. Now clusterExperiment
when calculating statistics (like var
or mad
) will add the per-gene value of the statistic in a column of the rowData
of the ClusterExperiment
object (rowData
is a standard slot of a SummarizedExperiment
). Similarly, if the user has already calculated a per-gene statistic and saved it as a column in the rowData
slot, this user-defined statistic can be used for filtering. This means that the user is not limited to the built-in functions provided in clusterExperiment
.
Note that simplicity we call "dimensionality reduction methods" to be those like PCA that reduce the data in a way that is not a simple selection of existing variables, but rather create new variables to represent the data. This is because a simple selection of variables can be stored as a single vector of the length of the number of genes and the reduced data can be obtained from the original matrix. The more complicated methods actually have to save a matrix with a value for each observation for each new variable.
A great deal has changed under the hood of any function that allowed for dimensionality reduction and greatly simplified and unified our treatment of filtering and dimensionality reduction. The main function affected is clusterMany
, which runs the clustering, but makeDendrogram
is another example (and of course RSEC
which is a wrapper around these).
To make the function compatible with SingleCellExperiment
, we have changed many of the names of our arguments related to dimensionality reduction. This is because the slot names and related functions of SingleCellExperiment
take the form of reducedDims
and our previous versions of clusterExperiment
used dimReduce
format instead. We also changed to the name of arguments to be less tied to 'PCA' and 'var':
nPCADims
changed tonReducedDims
in clusterMany-related functionsnVarDims
changed tonFilterDims
in clusterMany-related functionsdimReduce
changed toreduceMethod
across functionsplotDimReduce
toplotReducedDims
ndims
tonDims
inclusterSingle
andmakeDendrogram
to keep consistency.
The package clusterExperiment
has built in functions for both dimensionality reduction (right now only PCA) and for filtering. These can be obtained by the new functions listBuiltInReducedDims
and listBuiltInFilterStats
, which give a character vector of the names of currently available functions for dimensionality reduction and statistics for filtering genes, respectively.
The argument reduceMethod
, like the previous argument dimReduce
, defines either the dimensionality reduction method or the filtering method to be used to reduce the number of dimensions of the data that will be used (and for clusterMany
this can be a mixture of the two if the user wants to compare them). In places where multiple values can be given (i.e. clusterMany
or RSEC
) the user has the choice to give to reduceMethod
either
- the names of stored values (which can be a mixture of names of stored statistics for filtering and names of stored dimensionality reduction methods) OR
- names of built-in functions provided by
clusterExperiment
to be calculated internally (as given bylistBuiltInReducedDims
andlistBuiltInFilterStats
)
The user cannot do both (i.e. give reducedMethod
2 names that match user-defined stored values and 3 names that are built-in functions). To do this the user can call the new functions makeReducedDims
and makeFilterStats
that will apply the built-in method (and store them appropriately) the built-in methods for dimensionality reductions and statistics for filtering the data, respectively. These are indeed the functions called internally by functions like clusterMany
. In the above example, after calling these functions for the 3 built-in functions, the user can then call clusterMany
on all 5 of the names that are stored values (the 2 user defined, and the 3 that were created by the built-in functions).
The clusterExperiment
packages is now compatible with SummarizedExperiment
objects that have DelayedArray
classes in their assay
slot (which includes HDF5Matrix
and DelayedMatrix
). These are classes from the DelayedArray
and HDF5Array
packages that allow the assay to be stored on file rather than in memory.
Note, however, that while the package allows for these objects, it doesn't mean that it makes use of the HDF5 structure. Many times if the assay
object must be actually used beyond simple subsetting, it will call the entire matrix into memory for the computations. The advantages currently of having the full dataset in HDF5 format are:
- If you are not using the full assay for calculations, but rather a version after dimensionality reduction, then the full assay will not be needed in memory, but only the reduced version of the data. This is true for filtering as well as the more complicated dimensionality reductions stored in the
reducedDims
slot, since subsetting of the HDF5 matrix doesn't call the entire matrix into memory. Furthermore, some (but not all) of our built-in filtering statistics are HDF5 aware, and do not call the entire matrix into memory to calculate the statistics used for filtering. - The matrix is not continually held in memory in R workspace. So even if it is called into memory for a specific calculation, it will free up that memory once the calculation is done (assuming it can hold the full data in memory).