Skip to content

Commit

Permalink
Built site for gh-pages
Browse files Browse the repository at this point in the history
  • Loading branch information
Quarto_GHA_Runner committed Sep 12, 2024
1 parent 9d47d63 commit 1438c02
Show file tree
Hide file tree
Showing 24 changed files with 373 additions and 373 deletions.
2 changes: 1 addition & 1 deletion .nojekyll
Original file line number Diff line number Diff line change
@@ -1 +1 @@
681c0219
0bb361a4
2 changes: 1 addition & 1 deletion explanations/FAQ.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@
" of these replicates’ value was in turn the mean of all the sites\n",
" and cells in a given well."
],
"id": "8d5696cb-d7b6-4d19-9daf-316acf4fb80e"
"id": "76f1d9ce-f912-41b2-bcba-c93d55b1541d"
}
],
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion explanations/Resources.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
" [website](https://www.springscience.com/jump-cp) for data\n",
" exploration (account needed)."
],
"id": "929910cb-1283-4ad0-b8b1-01ca932ba8c2"
"id": "9638699e-e48e-4448-a927-06063f27081c"
}
],
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion explanations/glossary.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@
"for compound probes). q-value: Expected False Discovery Rate (FDR): the\n",
"proportion of false positives among all positive results."
],
"id": "8a5d959e-1f0a-43a9-ad87-b95ef603e94e"
"id": "c06bf953-fe68-450b-a7a6-c8fd8e4c9a2b"
}
],
"nbformat": 4,
Expand Down
16 changes: 8 additions & 8 deletions howto/1_retrieve_profiles.html
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@ <h1 class="title">Retrieve JUMP profiles</h1>


<p>This is a tutorial on how to access profiles from the <a href="https://github.com/jump-cellpainting/datasets">JUMP Cell Painting datasets</a>. We will use polars to fetch the data frames lazily, with the help of <code>s3fs</code> and <code>pyarrow</code>. We prefer lazy loading because the data can be too big to be handled in memory.</p>
<div id="62ea9adc" class="cell" title="Imports" data-execution_count="1">
<div id="dcc0eab6" class="cell" title="Imports" data-execution_count="1">
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> polars <span class="im">as</span> pl</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>The shapes of the available datasets are:</p>
Expand All @@ -270,11 +270,11 @@ <h1 class="title">Retrieve JUMP profiles</h1>
<li><code>cpg0016-jump[compound]</code>: Chemical perturbations.</li>
</ol>
<p>Their explicit location is determined by the transformations that produce the datasets. The aws paths of the dataframes are built from a prefix below:</p>
<div id="d992d9d3" class="cell" title="Paths" data-execution_count="2">
<div id="18eed3b7" class="cell" title="Paths" data-execution_count="2">
<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>INDEX_FILE <span class="op">=</span> <span class="st">"https://raw.githubusercontent.com/jump-cellpainting/datasets/50cd2ab93749ccbdb0919d3adf9277c14b6343dd/manifests/profile_index.csv"</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>We use a version-controlled csv to release the latest corrected profiles</p>
<div id="94abcbc3" class="cell" data-execution_count="3">
<div id="c6e5ae1a" class="cell" data-execution_count="3">
<div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>profile_index <span class="op">=</span> pl.read_csv(INDEX_FILE)</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>profile_index.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-display" data-execution_count="3">
Expand Down Expand Up @@ -333,7 +333,7 @@ <h1 class="title">Retrieve JUMP profiles</h1>
</div>
</div>
<p>We do not need the ‘etag’ (used to check file integrity) column nor the ‘interpretable’ (i.e., before major modifications)</p>
<div id="f1ddc271" class="cell" data-execution_count="4">
<div id="1af7c2fb" class="cell" data-execution_count="4">
<div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>selected_profiles <span class="op">=</span> profile_index.<span class="bu">filter</span>(</span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a> pl.col(<span class="st">"subset"</span>).is_in((<span class="st">"crispr"</span>, <span class="st">"orf"</span>, <span class="st">"compound"</span>))</span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>).select(pl.exclude(<span class="st">"etag"</span>))</span>
Expand All @@ -344,7 +344,7 @@ <h1 class="title">Retrieve JUMP profiles</h1>
</div>
</div>
<p>We will lazy-load the dataframes and print the number of rows and columns</p>
<div id="54e44892" class="cell" data-execution_count="5">
<div id="5d88c9f0" class="cell" data-execution_count="5">
<div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>info <span class="op">=</span> {k: [] <span class="cf">for</span> k <span class="kw">in</span> (<span class="st">"dataset"</span>, <span class="st">"#rows"</span>, <span class="st">"#cols"</span>, <span class="st">"#Metadata cols"</span>, <span class="st">"Size (MB)"</span>)}</span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> name, path <span class="kw">in</span> filepaths.items():</span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a> data <span class="op">=</span> pl.scan_parquet(path)</span>
Expand Down Expand Up @@ -414,7 +414,7 @@ <h1 class="title">Retrieve JUMP profiles</h1>
</div>
</div>
<p>Let us now focus on the <code>crispr</code> dataset and use a regex to select the metadata columns. We will then sample rows and display the overview. Note that the collect() method enforces loading some data into memory.</p>
<div id="eaf2605d" class="cell" data-execution_count="6">
<div id="8c8bbef7" class="cell" data-execution_count="6">
<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>data <span class="op">=</span> pl.scan_parquet(filepaths[<span class="st">"crispr"</span>])</span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>data.select(pl.col(<span class="st">"^Metadata.*$"</span>).sample(n<span class="op">=</span><span class="dv">5</span>, seed<span class="op">=</span><span class="dv">1</span>)).collect()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-display" data-execution_count="6">
Expand Down Expand Up @@ -480,7 +480,7 @@ <h1 class="title">Retrieve JUMP profiles</h1>
</div>
</div>
<p>The following line excludes the metadata columns:</p>
<div id="772cde92" class="cell" data-execution_count="7">
<div id="795fc4f1" class="cell" data-execution_count="7">
<div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>data_only <span class="op">=</span> data.select(pl.<span class="bu">all</span>().exclude(<span class="st">"^Metadata.*$"</span>).sample(n<span class="op">=</span><span class="dv">5</span>, seed<span class="op">=</span><span class="dv">1</span>)).collect()</span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>data_only</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-display" data-execution_count="7">
Expand Down Expand Up @@ -1043,7 +1043,7 @@ <h1 class="title">Retrieve JUMP profiles</h1>
</div>
</div>
<p>Finally, we can convert this to <code>pandas</code> if we want to perform analyses with that tool. Keep in mind that this loads the entire dataframe into memory.</p>
<div id="5123b93b" class="cell" data-execution_count="8">
<div id="da384f48" class="cell" data-execution_count="8">
<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>data_only.to_pandas()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-display" data-execution_count="8">
<div>
Expand Down
32 changes: 16 additions & 16 deletions howto/1_retrieve_profiles.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"and `pyarrow`. We prefer lazy loading because the data can be too big to\n",
"be handled in memory."
],
"id": "e97fd426-67e8-4ede-a761-020bf876ed25"
"id": "c207e014-f2db-4136-8569-5fac72a40e20"
},
{
"cell_type": "code",
Expand All @@ -24,7 +24,7 @@
"source": [
"import polars as pl"
],
"id": "bda87b61"
"id": "050fdbe0"
},
{
"cell_type": "markdown",
Expand All @@ -40,7 +40,7 @@
"produce the datasets. The aws paths of the dataframes are built from a\n",
"prefix below:"
],
"id": "12964272-caf6-4d5a-bffd-79e266306ec4"
"id": "456b6441-9b59-4668-a3f6-6ca9fd0638e2"
},
{
"cell_type": "code",
Expand All @@ -52,15 +52,15 @@
"source": [
"INDEX_FILE = \"https://raw.githubusercontent.com/jump-cellpainting/datasets/50cd2ab93749ccbdb0919d3adf9277c14b6343dd/manifests/profile_index.csv\""
],
"id": "40c00f57"
"id": "dff5e9f9"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use a version-controlled csv to release the latest corrected profiles"
],
"id": "eaedc4b3-e2ff-497e-8b06-fa8ccad23f15"
"id": "294aca9f-26b0-4db9-b3ae-4cd328db8c6c"
},
{
"cell_type": "code",
Expand All @@ -81,7 +81,7 @@
"profile_index = pl.read_csv(INDEX_FILE)\n",
"profile_index.head()"
],
"id": "da1585b1"
"id": "be3d7bc9"
},
{
"cell_type": "markdown",
Expand All @@ -90,7 +90,7 @@
"We do not need the ‘etag’ (used to check file integrity) column nor the\n",
"‘interpretable’ (i.e., before major modifications)"
],
"id": "fb519c55-606c-44a5-9de3-9646b3904a9b"
"id": "8748e5de-4a85-4aec-a40e-17fe629c4da9"
},
{
"cell_type": "code",
Expand All @@ -112,7 +112,7 @@
"filepaths = dict(selected_profiles.iter_rows())\n",
"print(filepaths)"
],
"id": "fa128411"
"id": "f0d2b220"
},
{
"cell_type": "markdown",
Expand All @@ -121,7 +121,7 @@
"We will lazy-load the dataframes and print the number of rows and\n",
"columns"
],
"id": "0f9f9d9b-5fad-4c4d-848c-917f996870b2"
"id": "8211888a-f29b-4140-8ada-2eb7fd26b9e2"
},
{
"cell_type": "code",
Expand Down Expand Up @@ -153,7 +153,7 @@
"\n",
"pl.DataFrame(info)"
],
"id": "be9cab4b"
"id": "110f0995"
},
{
"cell_type": "markdown",
Expand All @@ -163,7 +163,7 @@
"metadata columns. We will then sample rows and display the overview.\n",
"Note that the collect() method enforces loading some data into memory."
],
"id": "ce4f1a2f-13ee-4154-aca2-d1337ae63131"
"id": "1ba2c4e6-32d7-490c-8ae4-ea2ed71af70a"
},
{
"cell_type": "code",
Expand All @@ -184,15 +184,15 @@
"data = pl.scan_parquet(filepaths[\"crispr\"])\n",
"data.select(pl.col(\"^Metadata.*$\").sample(n=5, seed=1)).collect()"
],
"id": "c6ee77bb"
"id": "6c420d21"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following line excludes the metadata columns:"
],
"id": "8722a143-fe76-4580-a487-3299ad49840f"
"id": "1be03951-55c7-4e1c-bda9-0be92a0083d4"
},
{
"cell_type": "code",
Expand All @@ -213,7 +213,7 @@
"data_only = data.select(pl.all().exclude(\"^Metadata.*$\").sample(n=5, seed=1)).collect()\n",
"data_only"
],
"id": "1c4f6b4c"
"id": "7ac64f6f"
},
{
"cell_type": "markdown",
Expand All @@ -223,7 +223,7 @@
"with that tool. Keep in mind that this loads the entire dataframe into\n",
"memory."
],
"id": "0bdebe1f-9cdd-457a-a46c-f1b07966f120"
"id": "14c7a423-e8c6-4b18-b3f2-ee46f7350fa7"
},
{
"cell_type": "code",
Expand All @@ -245,7 +245,7 @@
"source": [
"data_only.to_pandas()"
],
"id": "80e1a977"
"id": "c8a420b3"
}
],
"nbformat": 4,
Expand Down
12 changes: 6 additions & 6 deletions howto/2_add_metadata.html
Original file line number Diff line number Diff line change
Expand Up @@ -260,12 +260,12 @@ <h1 class="title">Incorporate metadata into profiles</h1>


<p>A very common task when processing morphological profiles is knowing which ones are treatments and which ones are controls. Here we will explore how we can use broad-babel to accomplish this task.</p>
<div id="661e1bda" class="cell" title="Imports" data-execution_count="1">
<div id="619eda04" class="cell" title="Imports" data-execution_count="1">
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> polars <span class="im">as</span> pl</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> broad_babel.query <span class="im">import</span> get_mapper</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>We will be using the CRISPR dataset specificed in our index csv.</p>
<div id="e82dc867" class="cell" title="Fetch the CRISPR dataset" data-execution_count="2">
<div id="b778ce48" class="cell" title="Fetch the CRISPR dataset" data-execution_count="2">
<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>INDEX_FILE <span class="op">=</span> <span class="st">"https://raw.githubusercontent.com/jump-cellpainting/datasets/50cd2ab93749ccbdb0919d3adf9277c14b6343dd/manifests/profile_index.csv"</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>CRISPR_URL <span class="op">=</span> pl.read_csv(INDEX_FILE).<span class="bu">filter</span>(pl.col(<span class="st">"subset"</span>) <span class="op">==</span> <span class="st">"crispr"</span>).item(<span class="dv">0</span>, <span class="st">"url"</span>)</span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>profiles <span class="op">=</span> pl.scan_parquet(CRISPR_URL)</span>
Expand All @@ -275,7 +275,7 @@ <h1 class="title">Incorporate metadata into profiles</h1>
</div>
</div>
<p>For simplicity the contents of our processed profiles are minimal: “The profile origin” (source, plate and well) and the unique JUMP identifier for that perturbation. We will use broad-babel to further expand on this metadata, but for simplicity’s sake let us sample subset of data.</p>
<div id="6b6af033" class="cell" title="Subset data" data-execution_count="3">
<div id="c72a1211" class="cell" title="Subset data" data-execution_count="3">
<div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>jcp_ids <span class="op">=</span> (</span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a> profiles.select(pl.col(<span class="st">"Metadata_JCP2022"</span>)).unique().collect().to_series().sort()</span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>)</span>
Expand All @@ -298,7 +298,7 @@ <h1 class="title">Incorporate metadata into profiles</h1>
</div>
</div>
<p>We will use these JUMP ids to obtain a mapper that indicates the perturbation type (trt, negcon or, rarely, poscon)</p>
<div id="b214eb16" class="cell" title="Pull mapper" data-execution_count="4">
<div id="67377dbd" class="cell" title="Pull mapper" data-execution_count="4">
<div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>pert_mapper <span class="op">=</span> get_mapper(</span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a> subsample, input_column<span class="op">=</span><span class="st">"JCP2022"</span>, output_columns<span class="op">=</span><span class="st">"JCP2022,pert_type"</span></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>)</span>
Expand All @@ -319,7 +319,7 @@ <h1 class="title">Incorporate metadata into profiles</h1>
</div>
<p>A couple of important notes about broad_babel’s get mapper and other functions: - these must be fed tuples, as these are cached and provide significant speed-ups for repeated calls - ‘get-mapper’ works for datasets for up to a few tens of thousands of samples. If you try to use it to get a mapper for the entirety of the ‘compounds’ dataset it is likely to fail. For these cases we suggest the more general function ‘run_query’. You can read more on this and other use-cases on Babel’s <a href="https://github.com/broadinstitute/monorepo/tree/main/libs/jump_babel">readme</a>.</p>
<p>We will now repeat the process to get their ‘standard’ name</p>
<div id="aa62c9c5" class="cell" title="Fetch standard name" data-execution_count="5">
<div id="4a4119ac" class="cell" title="Fetch standard name" data-execution_count="5">
<div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>name_mapper <span class="op">=</span> get_mapper(</span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a> (<span class="op">*</span>subsample, <span class="st">"JCP2022_800002"</span>),</span>
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> input_column<span class="op">=</span><span class="st">"JCP2022"</span>,</span>
Expand All @@ -341,7 +341,7 @@ <h1 class="title">Incorporate metadata into profiles</h1>
</div>
</div>
<p>To wrap up, we will fetch all the available profiles for these perturbations and use the mappers to add the missing metadata. We also select a few features to showcase how how selection can be performed in polars.</p>
<div id="7474b214" class="cell" title="Filter profiles and merge metadata" data-execution_count="6">
<div id="afc5221d" class="cell" title="Filter profiles and merge metadata" data-execution_count="6">
<div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>subsample_profiles <span class="op">=</span> profiles.<span class="bu">filter</span>(</span>
<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a> pl.col(<span class="st">"Metadata_JCP2022"</span>).is_in(subsample)</span>
<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>).collect()</span>
Expand Down
Loading

0 comments on commit 1438c02

Please sign in to comment.