import polars as pl
The shapes of the available datasets are:
@@ -270,11 +270,11 @@Retrieve JUMP profiles
cpg0016-jump[compound]
: Chemical perturbations.Their explicit location is determined by the transformations that produce the datasets. The aws paths of the dataframes are built from a prefix below:
-= "https://raw.githubusercontent.com/jump-cellpainting/datasets/50cd2ab93749ccbdb0919d3adf9277c14b6343dd/manifests/profile_index.csv" INDEX_FILE
We use a version-controlled csv to release the latest corrected profiles
-= pl.read_csv(INDEX_FILE)
profile_index profile_index.head()
Retrieve JUMP profiles
We do not need the ‘etag’ (used to check file integrity) column nor the ‘interpretable’ (i.e., before major modifications)
-= profile_index.filter(
selected_profiles "subset").is_in(("crispr", "orf", "compound"))
pl.col("etag"))
@@ -344,7 +344,7 @@ ).select(pl.exclude(Retrieve JUMP profiles
We will lazy-load the dataframes and print the number of rows and columns
-= {k: [] for k in ("dataset", "#rows", "#cols", "#Metadata cols", "Size (MB)")}
info for name, path in filepaths.items():
= pl.scan_parquet(path)
@@ -414,7 +414,7 @@ data Retrieve JUMP profiles
Let us now focus on the crispr
dataset and use a regex to select the metadata columns. We will then sample rows and display the overview. Note that the collect() method enforces loading some data into memory.
= pl.scan_parquet(filepaths["crispr"])
data "^Metadata.*$").sample(n=5, seed=1)).collect() data.select(pl.col(
Retrieve JUMP profiles
The following line excludes the metadata columns:
-= data.select(pl.all().exclude("^Metadata.*$").sample(n=5, seed=1)).collect()
data_only data_only
Retrieve JUMP profiles
Finally, we can convert this to pandas
if we want to perform analyses with that tool. Keep in mind that this loads the entire dataframe into memory.
data_only.to_pandas()
Incorporate metadata into profiles
A very common task when processing morphological profiles is knowing which ones are treatments and which ones are controls. Here we will explore how we can use broad-babel to accomplish this task.
-import polars as pl
from broad_babel.query import get_mapper
We will be using the CRISPR dataset specificed in our index csv.
-= "https://raw.githubusercontent.com/jump-cellpainting/datasets/50cd2ab93749ccbdb0919d3adf9277c14b6343dd/manifests/profile_index.csv"
INDEX_FILE = pl.read_csv(INDEX_FILE).filter(pl.col("subset") == "crispr").item(0, "url")
CRISPR_URL = pl.scan_parquet(CRISPR_URL)
@@ -275,7 +275,7 @@ profiles Incorporate metadata into profiles
For simplicity the contents of our processed profiles are minimal: “The profile origin” (source, plate and well) and the unique JUMP identifier for that perturbation. We will use broad-babel to further expand on this metadata, but for simplicity’s sake let us sample subset of data.
-= (
jcp_ids "Metadata_JCP2022")).unique().collect().to_series().sort()
profiles.select(pl.col(
@@ -298,7 +298,7 @@ )Incorporate metadata into profiles
We will use these JUMP ids to obtain a mapper that indicates the perturbation type (trt, negcon or, rarely, poscon)
-= get_mapper(
pert_mapper ="JCP2022", output_columns="JCP2022,pert_type"
subsample, input_column
@@ -319,7 +319,7 @@ )Incorporate metadata into profiles
A couple of important notes about broad_babel’s get mapper and other functions: - these must be fed tuples, as these are cached and provide significant speed-ups for repeated calls - ‘get-mapper’ works for datasets for up to a few tens of thousands of samples. If you try to use it to get a mapper for the entirety of the ‘compounds’ dataset it is likely to fail. For these cases we suggest the more general function ‘run_query’. You can read more on this and other use-cases on Babel’s readme.
We will now repeat the process to get their ‘standard’ name
-= get_mapper(
name_mapper *subsample, "JCP2022_800002"),
(="JCP2022",
@@ -341,7 +341,7 @@ input_columnIncorporate metadata into profiles
To wrap up, we will fetch all the available profiles for these perturbations and use the mappers to add the missing metadata. We also select a few features to showcase how how selection can be performed in polars.
-= profiles.filter(
subsample_profiles "Metadata_JCP2022").is_in(subsample)
pl.col(
diff --git a/howto/2_add_metadata.ipynb b/howto/2_add_metadata.ipynb
index a182273..9a41af7 100644
--- a/howto/2_add_metadata.ipynb
+++ b/howto/2_add_metadata.ipynb
@@ -10,7 +10,7 @@
"which ones are treatments and which ones are controls. Here we will\n",
"explore how we can use broad-babel to accomplish this task."
],
- "id": "9b0c4aca-4139-4e19-9963-54860e5b2c26"
+ "id": "b7c23e83-3a1c-4867-a24a-5ea1e6619f45"
},
{
"cell_type": "code",
@@ -23,7 +23,7 @@
"import polars as pl\n",
"from broad_babel.query import get_mapper"
],
- "id": "3e7e9095"
+ "id": "50c8f860"
},
{
"cell_type": "markdown",
@@ -31,7 +31,7 @@
"source": [
"We will be using the CRISPR dataset specificed in our index csv."
],
- "id": "f46085f0-e839-4061-81fc-8d4c59de2596"
+ "id": "83e02149-8f7c-42be-829a-e1d1dad3fae8"
},
{
"cell_type": "code",
@@ -54,7 +54,7 @@
"profiles = pl.scan_parquet(CRISPR_URL)\n",
"print(profiles.collect_schema().names()[:6])"
],
- "id": "cb0495b5"
+ "id": "b8bcb0fb"
},
{
"cell_type": "markdown",
@@ -65,7 +65,7 @@
"for that perturbation. We will use broad-babel to further expand on this\n",
"metadata, but for simplicity’s sake let us sample subset of data."
],
- "id": "46c6af9b-0680-4ac4-809e-a7d904d742b1"
+ "id": "e5f8e5e6-8079-436b-a039-a61440ef6896"
},
{
"cell_type": "code",
@@ -103,7 +103,7 @@
"subsample = (*subsample, \"JCP2022_800002\")\n",
"subsample"
],
- "id": "4359c0f0"
+ "id": "de261922"
},
{
"cell_type": "markdown",
@@ -112,7 +112,7 @@
"We will use these JUMP ids to obtain a mapper that indicates the\n",
"perturbation type (trt, negcon or, rarely, poscon)"
],
- "id": "bd58da00-1aa9-4910-973d-cf774b093503"
+ "id": "c6761bb3-3f38-4ff6-9a02-d60723934736"
},
{
"cell_type": "code",
@@ -147,7 +147,7 @@
")\n",
"pert_mapper"
],
- "id": "7c3e462f"
+ "id": "635c1625"
},
{
"cell_type": "markdown",
@@ -164,7 +164,7 @@
"\n",
"We will now repeat the process to get their ‘standard’ name"
],
- "id": "59a9634c-eab8-4a6a-8ea4-c274639d9d8a"
+ "id": "9b30d97e-8c92-49c2-8a55-dfe7654be290"
},
{
"cell_type": "code",
@@ -201,7 +201,7 @@
")\n",
"name_mapper"
],
- "id": "eb65c06c"
+ "id": "3c645adf"
},
{
"cell_type": "markdown",
@@ -212,7 +212,7 @@
"select a few features to showcase how how selection can be performed in\n",
"polars."
],
- "id": "e2d6af67-0def-4485-9caf-20cd06890c96"
+ "id": "89e60df7-2aee-4de5-9e15-10f34b936e14"
},
{
"cell_type": "code",
@@ -243,7 +243,7 @@
" pl.col((\"name\", \"pert_type\", \"^Metadata.*$\", \"^X_[0-3]$\"))\n",
").sort(by=\"pert_type\")"
],
- "id": "ae5ef7e7"
+ "id": "499181c8"
}
],
"nbformat": 4,
diff --git a/howto/3_calculate_activity.html b/howto/3_calculate_activity.html
index c6f8c62..ea97c3b 100644
--- a/howto/3_calculate_activity.html
+++ b/howto/3_calculate_activity.html
@@ -261,7 +261,7 @@ ).collect()Calculate phenotypic activity
A common first analysis for morphological datasets is the activity of the cells’ phenotypes. We will use the copairs package, which makes use of mean average precision to obtain a metric of replicability for any set of morphological profiles. In other words, it indicates how similar a given set of compounds are, relative to their negative controls, which is usually cells that have experienced no perturbation.
-
+
import polars as pl
import polars.selectors as cs
import seaborn as sns
@@ -269,13 +269,13 @@ Calculate phenotypic activity
from copairs.map import average_precision
We will be using the CRISPR dataset specificed in our index csv, but we will select a subset of perturbations and the controls present.
-
+
= "https://raw.githubusercontent.com/jump-cellpainting/datasets/50cd2ab93749ccbdb0919d3adf9277c14b6343dd/manifests/profile_index.csv"
INDEX_FILE = pl.read_csv(INDEX_FILE).filter(pl.col("subset") == "crispr").item(0, "url")
CRISPR_URL = pl.scan_parquet(CRISPR_URL) profiles
Sample perturbations and add known negative control.
-
+
= (
jcp_ids "Metadata_JCP2022")).unique().collect().to_series().sort()
profiles.select(pl.col(
@@ -291,7 +291,7 @@ )Calculate phenotypic activity
perts_controls.head()
Now we create a mapper to label treatments and controls. See the previous tutorial for details on fetching metadata.
-
+
= get_mapper(
pert_mapper ="JCP2022", output_columns="JCP2022,pert_type"
subsample, input_column
@@ -300,7 +300,7 @@ )Calculate phenotypic activity
)
Finally we use the parameters from . See the copairs wiki for more details on the parameters that copairs requires.
-
+
= ["Metadata_JCP2022"] # We want to match perturbations
pos_sameby = []
pos_diffby = []
@@ -326,12 +326,12 @@ neg_sameby Calculate phenotypic activity
result.head()
@@ -426,7 +426,7 @@ Calculate phenotypic activity
The result of copairs is a dataframe containing, in addition to the original metadata, the average precision with which perturbations were retrieved. Perturbations that look more similar to each other than to the negative controls in the plates present in the same plates will be higher. Perturbations that do not differentiate themselves against negative controls will be closer to zero.
To wrap up we pull the standard gene symbol and plot the distribution of average precision.
-
+
= get_mapper(
name_mapper ="JCP2022", output_columns="JCP2022,standard_key"
subsample, input_column
@@ -451,7 +451,7 @@ )Calculate phenotypic activity