Optimizing the Image-Based Cell Profiling Pipeline

In this project, we explore images of cancer cells reacting to some treatments (compounds) in a high-throughput assay. The aim is to improve the morphological profiling pipeline steps which is converting cell images into a signature for each treatment.

We compare different methods to select appropriate features, to select compounds showing a strong phenotype (names as hit selection), and to evaluate whether the profiles having the same Mechanism of Action (MOA) are indeed more similar.

We found out that an unsupervised feature selection based on SVD-entropy combined with a similarity metric based on Jaccard distance for the selection of the hits was improving the specificity of the signatures of each compound compared to the actual pipeline by 30%. To evaluate this specificity, we defined a metric called the enrichment ratio. It uses the MOA information in order to evaluate the global accuracy of the signatures. We additionally defined a visualization method called ''waterfalls plot'', that helps to determine the optimal dose that should be used for each drug.

The feature selection based on SVD-entropy can be found in the following paper: "Novel Unsupervised Feature Filtering of Biological Data" by R. Varshavsky and al.

Getting Started

These files give examples in order to perform the cell profiling pipeline on two different dataset (the BBBC022 dataset and the Repurposig dataset).

The BBBC022 images can be found here.

The Repurposing dataset is private.

The code is each time for the two different datasets, because the MOAs annotations files is different (separation based on , or |). Moreover the metadata are also different (the BBBC022 data is older).

Abbreviation:

sc = single cell
fs = feature selection
SVD = SVD-entropy feature selection
fc = findCorrelation feature selection

Prerequisites

You need to use R. The R libraries should be installed using: install.packages('name of the package').

The following packages are needed:

dplyr / magrittr / ggplot2 / reshape2 / foreach / doMC / stringr / tidyverse / caret / Rcpp / RcppArmadillo

You also need to install cytominer.

Most important files

BBBC022_waterfallsPlot.Rmd: example to create waterfalls plot on BBBC022 dataset
enrichment_ratio_function.R: function to calculate the enrichment ratio on BBBC022 dataset
enrichment_ratio_function_repurposing.R: function to calculate the enrichment ratio on the Repurposing dataset
feature_selection_SVD_entropy_cpp_old.Rmd: example of feature selection using SVD-entropy
feature_selection_findCorrelation.Rmd: example of feature selection using findCorrelation on BBBC022
global_pipeline.R: example of pipeline going from hit selection (with two different metrics: Pearson's correlation and the Jaccard distance) to the enrichment ratio.
hit_selection_correlation_function.R: hit selection function using correlation metric
hit_selection_jaccard_function.R: hit selection function using the Jaccard distance metric
jaccard_distance_function.cpp: c++ code for the Jaccard distance
random_feature_selection.R: code to select randomly features in a dataset
ranking_SVD_entropy.cpp: c++ code to do the SVD-entropy feature selection
repurposing_feature_selection_sc_fc.R: example of feature selection with findCorrelation on Repurposing dataset
repurposing_waterfallPlot.Rmd: example to create waterfalls plot on the Repurposing dataset

Author

Marie Duc

Acknowledgments

Dr. Mohammad Rohban
Dr. Anne Carpenter
All the Imaging Platform at the Broad Institute

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Optimizing the Image-Based Cell Profiling Pipeline

Getting Started

Prerequisites

Most important files

Author

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

Optimizing the Image-Based Cell Profiling Pipeline

Getting Started

Prerequisites

Most important files

Author

Acknowledgments