Image-based profiling represents a series of data processing steps that turn image-based readouts into more manageable data matrices for downstream analyses (Caicedo et al. 2017). Typically, you derive image-based readouts using software, like CellProfiler (McQuin et al. 2018), that segment cells and extract so-called hand-engineered single cell morphology measurements. In this folder, we process the CellProfiler derived morphology features for the LINCS Cell Painting dataset using pycytominer - a tool enabling reproducible image-based profiling.
Specifically, we include:
- Data processing scripts to perform the full unified, image-based profiling pipeline
- Processed data for each Cell Painting plate (for several "data levels")
- Instructions on how to reproduce the profiling pipeline
Pycytominer is a code base built by @gwaygenomics et al. It allows for easy processing of CellProfiler data and contains all functions that were used to create the data in this repository. Below, we describe the different steps of the pipeline. Please check the pycytominer repo for more details.
The steps from Level 3 to Level 4b can be found in the profile_cells script, the steps for spherizing can be found in this script, and the final aggregation to the consensus data is found in this notebook.
Note here that we do not include the intermediate step of generating .sqlite
files per plate using a tool called cytominer-database.
This repository and workflow begins after we applied cytominer-database.
We use the aggregation method twice in the workflow. First at this point and later for the creation of the consensus profiles. Here, the median of all cells within a well is aggregated to one profiler per well. The aggregation method doesn't persist the metadata which is why this step is followed by an annotation step to add the MOA data and others.
Normalization can be done via different methods and over all wells in a plate or only the negative controls (DMSOs).
In this case, we used mad_robustize
method and both the output of the whole-plate and the DMSO normalization are saved in this repository.
It is important to note that we normalize over each plate but not over the full batch.
The feature_select method incorporates ["variance_threshold", "correlation_threshold", "drop_na_columns", "drop_outliers"]
.
We developed these functions to drop redundant and invariant features and to improve post processing.
Spherizing (aka whitening) is a transformation of the data that tries to correct batch effects. Within Pycytominer, Spherizing can be found in the normalization function. Unlike the other normalizations, we perform spherizing on the full batch (all plates). Check the code for details and relevant papers.
The consensus function is another step of aggregation and completes the pipeline. We aggregate the normalized and feature selected data (4b) via the Median or MODZ functions to consensus data. This means that each of the five replicates are combined to one profile representing one perturbation with a given dose.
While having the same overall goal, these two processes target different artifacts:
The normalization step simply aligns all plates by scaling each plates values to a "unit height".
The specific transformation used is RobustMad
.
On the other hand, we use a spherize (a.k.a. whiten) transformation to adjust for plate and plate position effects (e.g. some plates may influence the data while the position within a plate could also create artifacts). The spherize transformation adjusts for plate position effects by transforming the profile data such that the DMSO profiles are left with an identity covariance matrix. Note that this operation is done on the full dataset, not per plate. See spherize-batch-effects.ipynb for implementation details.
Since CellProfiler data is not naturally normalized to zero, mean all data is 'MadRobustized' before Spherizing to maximize the usefulness of the spherize transformation.
We include two batches of Cell Painting data in this repository: 2016_04_01_a549_48hr_batch1
and 2017_12_05_Batch2
.
For each batch, we include:
Data level | Description | File format | Included in this repo? | Version control |
---|---|---|---|---|
Level 1 | Cell images | .tif |
No^ | NA |
Level 2 | Single cell profiles | .sqlite |
No^ | NA |
Level 3 | Aggregated profiles with metadata | .csv.gz |
Yes | dvc |
Level 4a | Normalized profiles with metadata | .csv.gz |
Yes | dvc |
Level 4b | Normalized and feature selected profiles with metadata | .csv.gz |
Yes | dvc |
Level 5 | Consensus perturbation profiles | .csv.gz |
Yes | git lfs |
Importantly, we include files for two different types of normalization: Whole-plate normalization, and DMSO-specific normalization. See profile_cells.py for more details.
Note: If you use normalized profiles without feature selection, you may need to remove additional outlier features. See issues/65 for more details and a list of features.
For each batch we include four different spherized profiles. These data include all level 4b profiles for every batch.
Batch | Input data | Spherized output file | Version control |
---|---|---|---|
2016_04_01_a549_48hr_batch1 | DMSO normalized | 2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_dmso.csv.gz | git lfs |
2016_04_01_a549_48hr_batch1 | Whole plate normalized | 2017_12_05_Batch2_dmso_spherized_profiles_with_input_normalized_by_whole_plate.csv.gz | git lfs |
2017_12_05_Batch2 | DMSO normalized | 2017_12_05_Batch2_dmso_spherized_profiles_with_input_normalized_by_dmso.csv.gz | git lfs |
2017_12_05_Batch2 | Whole plate normalized | 2017_12_05_Batch2_dmso_spherized_profiles_with_input_normalized_by_whole_plate.csv.gz | git lfs |
We use a combination of git lfs
and dvc
to version all the data in this repository (see tables above for a breakdown).
In order to access the data locally, you must first install both services.
# Download consensus profiles and spherized data
git lfs pull
# Download individual plate data
dvc pull
Note: The DVC remote is an AWS S3 bucket. You must have aws cli installed and configured (see details).
To access the files stored via DVC, you will need to created a AWS IAM user, who should, minimally, be able to Get objects and List buckets.
One way of achieving this is to attach the AmazonS3ReadOnlyAccess
policy, which is:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:List*",
"s3-object-lambda:Get*",
"s3-object-lambda:List*"
],
"Resource": "*"
}
]
}
(Note that the s3-object-lambda:Get*
and s3-object-lambda:List*
are not required but they don't hurt)
TBD
After activating the conda environment with:
conda activate lincs
you can reproduce the pipeline by simply executing the following:
# Make sure you are in the profiles/ directory
./run.sh
The Drug Repurposing Hub collected data on 6 to 7 dose points per compound. In general, most doses are very near the following 7 dose points (mmoles per liter):
[0.04, 0.12, 0.37, 1.11, 3.33, 10, 20]
Therefore, to make it easier to filter by dose when comparing compounds, we first align the doses collected in the dataset to their nearest dose point above.
We then recode the dose points into ascending numerical levels and add a new metadata annotation Metadata_dose_recode
to all profiles.
Dose | Dose Recode |
---|---|
0 (DMSO) | 0 |
~0.04 | 1 |
~0.12 | 2 |
~0.37 | 3 |
~1.11 | 4 |
~3.33 | 5 |
~10 | 6 |
~20 | 7 |
There are several critical details that are important for understanding data generation and processing.
See profile_cells.py
for more details about the specific processing steps and decisions.