update documentation (add filtering, update images)

pinellolab · Sep 22, 2023 · c0a94ea · c0a94ea
1 parent 514fca7
commit c0a94ea
Show file tree

Hide file tree

Showing 7 changed files with 91 additions and 32 deletions.
diff --git a/README.md b/README.md
@@ -8,16 +8,22 @@
 This is an analysis toolkit for the pooled CRISPR reporter or sensor data. The reporter technique transfects cells with plasmid with not only sgRNA but with the **target sequence surrogate** which we call **reporter** or **sensor**.  
 
 
-<img src="imgs/reporter_construct.svg" alt="Reporter construct" width="500"/>
+<img src="imgs/reporter.jpg" alt="Reporter construct" width="700"/>
 
 ## Overview
-`crispr-bean` supports the following functionalities.
-* 1. [`bean-count`, `bean-count-sample`](#count-reporter-screen-data): Base-editing-aware mapping of guide, optionally with reporter from `.fastq` files.  
-* 2. [`bean-qc`](#bean-qc-qc-of-reporter-screen-data): Quality control report and filtering out / masking of aberrant sample and guides  
-* 3. [`bean-filter`](#bean-filter-filtering-and-optionally-translating-alleles): Filter reporter alleles
-* 4. [`bean-run`](#bean-run-quantify-variant-effects): Quantify targeted variants' effect sizes from screen data.
+`crispr-bean` supports end-to-end analysis of pooled sorting screens, with or without reporter.
+<img src="imgs/dag_bean.png" alt="dag_bean.svg" width="700"/>
+1. [`bean-count-sample`](#bean-count-samples-count-reporter-screen-data): Base-editing-aware **mapping** of guide, optionally with reporter from `.fastq` files.  
+2. [`bean-qc`](#bean-qc-qc-of-reporter-screen-data): Quality control report and filtering out / masking of aberrant sample and guides  
+3. [`bean-filter`](#bean-filter-filtering-and-optionally-translating-alleles): Filter reporter alleles; essential for `tiling` mode that allows for all alleles generated from gRNA.
+4. [`bean-run`](#bean-run-quantify-variant-effects): Quantify targeted variants' effect sizes from screen data.  
+
+BEAN stores mapped gRNA and allele counts in `ReporterScreen` object which is compatible with [AnnData](https://anndata.readthedocs.io/en/latest/index.html). See [Data Structure](#data-structure) section for more information.
+
 We provide example scripts in `tests/`. Running `pytest --sparse-ordering` generates example input/output files from running 1 and 2-4 sequentially.
 
+<br/><br/>
+
 ## Installation 
 ### Full installation
 First install [pytorch](https://pytorch.org/) >=12.1,<2.
@@ -32,8 +38,10 @@ pip install crispr-bean
 ```
 This wouldn't have variant effect size quantification (`bean-run`) functionality.  
 
-## Count reporter screen data  
-`bean-count-samples` or `bean-count` maps guide into guide counts, **allowing for base transition in spacer sequence**. When the matched reporter information is provided, it can count the **target site edits** and **alleles produced by each guide**. Mapping is efficiently done based on [CRISPResso2](https://github.com/pinellolab/CRISPResso2) modified for base-edit-aware mapping.
+<br/><br/>
+
+## `bean-count-samples`: Count (reporter) screen data  
+`bean-count-samples` (or `bean-count` for a single sample) maps guide into guide counts, **allowing for base transition in spacer sequence**. When the matched reporter information is provided, it can count the **target site edits** and **alleles produced by each guide**. Mapping is efficiently done based on [CRISPResso2](https://github.com/pinellolab/CRISPResso2) modified for base-edit-aware mapping.
 
 
 
@@ -47,8 +55,9 @@ bean-count-samples \
   -t 12                     `# number of threads` \
   --name my_sorting_screen  `# name of this sample run`
 ```
+
 ### Input file format
-#### gRNA_library.csv
+#### 1. gRNA_library.csv
 File should contain following columns.
 * `name`: gRNA ID column
 * `sequence`: gRNA sequence
@@ -67,7 +76,7 @@ File should contain following columns.
   * `start_pos`: gRNA starting position in the genome. Required when you provide `strand` column. Should specify the smaller coordinate value among start and end position regardless of gRNA strandedness.
 
 
-#### sample_list.csv
+#### 2. sample_list.csv
 File should contain following columns with header.
 * `R1_filepath`: Path to read 1 `.fastq[.gz]` file
 * `R2_filepath`: Path to read 1 `.fastq[.gz]` file
@@ -81,16 +90,11 @@ Optional columns are not required but can be provided for compatibility with `be
 
 ### Output file format
 `count` or `count-samples` produces `.h5ad` and `.xlsx` file with guide and per-guide allele counts.  
-* `.h5ad`: This output file follows annotated matrix format compatible with `AnnData` and is based on `Screen` object in [purturb_tools](https://github.com/pinellolab/perturb-tools). The object contains the per-guide allele counts.
-  * `.guides`: guide information provided in input (`gRNA_library.csv` in above example)
-  * `.samples`: sample information provided in input (`sample_list.csv` in above example)
-  * `.X`: Main guide count matrix, where row corresponds to each guide in `.guides` and columns correspond to samples in `.samples`.
-Following attributes are included if matched reporter is provided and you chose to read edit/allele information from the reporter using `-r` option.
-  * `.X_bcmatch [Optional]`: Contains information about number of barcode-matched reads. Information about R2 barcode should be specified as `barcode` column in your `gRNA_library.csv` file.
-  * `.X_edits [Optional]`: If target position of each guide is specified as `target_pos` in input `gRNA_library.csv` file and `--match-target-position` option is provided, the result has the matrix with the number of target edit at the specified positions.
-  * `.allele_tables [Optional]`: Dictionary with a single allele count table that counts per guide and allele combination, what is the count per sample. 
-* `.xlsx`: This output file contains `.guides`, `.samples`, `.X[_bcmatch,_edits]`. (`allele_tables` are often too large to write into an Excel!)
-<img src="imgs/screendata.svg" alt="screendata" width="700"/>
+* `.h5ad`: This output file follows annotated matrix format compatible with `AnnData` and is based on `Screen` object in [purturb_tools](https://github.com/pinellolab/perturb-tools). See [Data Structure](#data-structure) section for more information.  
+* `.xlsx`: This output file contains `.guides`, `.samples`, `.X[_bcmatch,_edits]`. (`allele_tables` are often too large to write into an Excel!)  
+
+
+<br/><br/>
 
 ## `bean-qc`: QC of reporter screen data
 ```
@@ -123,32 +127,69 @@ Above command produces
 * `--posctrl-val` (default: `PosCtrl`): Value in .h5ad.guides[`posctrl_col`] that specifies guide will be used as the positive control in calculating log fold change.
 * `--lfc-thres` (default: `0.1`): Positive guides' correlation threshold to filter out.
 
+<br/><br/>
+
+
 ## `bean-filter`: Filtering (and optionally translating) alleles
+As `tiling` mode of `bean-run` accounts for any robustly observed alleles, `bean-filter` filters for such alleles.
+```
+bean-filter my_sorting_screen_masked.h5ad -o my_sorting_screen_filtered.h5ad 
 ```
-bean-filter my_sorting_screen.h5ad -o my_sorting_screen_masked.h5ad [--translate [--translate-fasta gene_exon.fa, --translate-fastas-csv gene_exon_fas.csv]]
+
+If you want to obtain **amino acid level variant** for coding sequence tiling screens, provide coding sequence positions which variants occuring within the coding sequence will be translated. 
+```
+bean-filter my_sorting_screen.h5ad -o my_sorting_screen_masked.h5ad --translate [--translate-fasta gene_exon.fa, --translate-fastas-csv gene_exon_fas.csv]
 ```
-If you want to obtain amino acid level variant
-* When library covers single gene: Feed `--translate --translate-fasta gene_exon.fa` argument where `gene_exon.fa` is the FASTA file with exon sequence and range (following `range=` tag in the FASTA header line) for  OR
+* When library covers single gene: Feed `--translate --translate-fasta gene_exon.fa` argument where `gene_exon.fa` is the FASTA file with exon sequence and range (following `range=` tag in the FASTA header line) 
 * When library covers multiple genes: Feed `--translate --translate-fastas-csv gene_exon_fas.csv` where `gene_exon_fas.csv` is the csv file with lines `gene_id,gene_exon_fasta_path` without header.
 * Exon FASTA files can be downloaded from UCSC Genomic sequences / Table Browser: [see the instruction](https://www.youtube.com/watch?v=T4E0Ez5Vjz8)
+    * If you provide exon positions in `.fa`, uppercase positions will be considered as coding sequence.
+* Translation will keep the variants outside the coding sequence as nucleotide-level variants, while aggregating variants leading to the same coding sequence variants.
 
+### Output
+Above command produces 
+* `my_sorting_screen_filtered.h5ad` with filtered alleles stored in `.uns`,   
+* `my_sorting_screen_filtered.filtered_allele_stats.pdf`, and `my_sorting_screen_filtered.filter_log.txt` that report allele count stats in each filtering step.
+
+
+### Additional parameters
+* `-o`, `--output-prefix` (default: `None`): Output prefix for log and ReporterScreen file with allele assignment.
+* `-p`, `--plasmid-path` (default: `None`): Plasmid `ReporterScreen` object path. 
+  * If provided, alleles are filtered based on if a nucleotide edit is more significantly enriched in sample compared to the plasmid data. 
+  * Negative control data where no edit is expected can be fed in instead of plasmid library.
+* `-w`, `--filter-window` (default: `False`): "Only consider edit within window provided by (`edit_start_pos`, `edit_end_pos`). If this flag is not provided, `--edit-start-pos` and `--edit-end-pos` flags are ignored.
+* `-s`, `--edit-start-pos` (default: `2`): 0-based start posiiton (inclusive) of edit relative to the start of guide spacer.
+* `-e`, `--edit-end-pos` (default: `7`): 0-based end position (exclusive) of edit relative to the start of guide spacer.
+* `-b`, `--filter-target-basechange` (default: `False`): Filter for target (intended) base change of edits (stored in `bdata.uns['target_base_change']`)
+* `-j`, `--jaccard-threshold` (default: `0.3`): Jaccard Index threshold when the alleles are mapped to the most similar alleles. In each filtering step, allele counts of filtered out alleles will be mapped to the most similar allele only if they have Jaccard Index of shared edit higher than this threshold.
+* `--translate` (default: `False`): Translate nucleotide-level variants prior to allele proportion filtering.
+* `-f`, `--translate-fasta` (defulat: `None`): fasta file path with exon positions. If not provided and `--translate` flag is provided, LDLR hg19 coordinates will be used.
+* `-fs`, `--translate-fastas-csv` (defulat: `None`): .csv with two columns with gene IDs and FASTA file path corresponding to each gene.
+* `-ap`, `--filter-allele-proportion` (default: `0.05`): If provided, only the alleles that exceed `filter_allele_proportion` in `filter-sample-proportion` will be retained.
+* `-ac`, `--filter-allele-count` (default: `5`): If provided, alleles that exceed `filter_allele_proportion` AND `filter_allele_count` in `filter-sample-proportion` will be retained.
+* `sp`, `--filter-sample-proportion` (default: `0.2`): "If `filter_allele_proportion` is provided, alleles that exceed `filter_allele_proportion` in `filter-sample-proportion` will be retained.
+
+
+<br/><br/>
 
 ## `bean-run`: Quantify variant effects
+BEAN uses Bayesian network to incorporate gRNA editing outcome to provide posterior estimate of variant phenotype.  
+<img src="imgs/bean.gif" alt="model" width="700"/>
 ```
-bean-run variant[tiling] my_sorting_screen_masked.h5ad --scale-by-acc --acc-bw-path accessibility_signal.bw -o output_prefix/ --fit-negctrl
+bean-run variant[tiling] my_sorting_screen_filtered.h5ad --scale-by-acc --acc-bw-path accessibility_signal.bw -o output_prefix/ --fit-negctrl
 ```
 
 ### Output
+<img src="imgs/model_output.png" alt="model" width="700"/>
+
 Above command produces
 * `output_prefix/bean_element_result.[model_type].csv` with following columns:
-  * `mu`: Mean of variant phenotype, given the wild type has standard normal phenotype distribution of `mu = 0, sd = 1`.
+  * `mu` (Effect size): Mean of variant phenotype, given the wild type has standard normal phenotype distribution of `mu = 0, sd = 1`.
   * `mu_sd`: Mean of variant phenotype `mu` is modeled as normal distribution. The column shows fitted standard deviation of `mu` that quantify the uncertainty of the variant effect.
   * `mu_z`: z-score of `mu`
   * `sd`: Standard deviation of variant phenotype, given the wild type has standard normal phenotype distribution of `mu = 0, sd = 1`.
-  * `fdr_dec`: FDR of `mu` being smaller than 0
-  * `fdr_inc`: FDR of `mu` being larger than 0
-  * `fdr`: FDR of `mu` being nonzero (`=min(fdr_dec, fdr_inc)`)
-When negative control is provided, above columns with `_adj` suffix are provided, which are the corresponding values adjusted for negative control.  
+  * `CI[0.025`, `0.975]`: Credible interval of `mu`
+  * When negative control is provided, above columns with `_adj` suffix are provided, which are the corresponding values adjusted for negative control.  
 * `output_prefix/bean_sgRNA_result.[model_type].csv`: 
   * `edit_rate`: Estimated editing rate at the target loci.
 
@@ -177,12 +218,30 @@ When negative control is provided, above columns with `_adj` suffix are provided
   * `-uq`, `--sorting-bin-upper-quantile-column` (default: `upper_quantile`): Column name with upper quantile values of each sorting bin in bdata.samples
   * `-lq`, `--sorting-bin-lower-quantile-column` (default: `lower_quantile`): Column name with lower quantile values of each sorting bin in bdata.samples
 
-## Using as python module
+<br></br>
+
+## Data Structure
+### ReporterScreen object
+BEAN stores mapped gRNA and allele counts in `ReporterScreen` object which is compatible with [AnnData](https://anndata.readthedocs.io/en/latest/index.html).  
+
+<img src="imgs/data_structure_v2.png" alt="ReporterScreen object structure" width="700" />
+
+  * `.guides`: guide information provided in input (`gRNA_library.csv` in above example)
+  * `.samples`: sample information provided in input (`sample_list.csv` in above example)
+  * `.X`: Main guide count matrix, where row corresponds to each guide in `.guides` and columns correspond to samples in `.samples`.
+Following attributes are included if matched reporter is provided and you chose to read edit/allele information from the reporter using `-r` option.
+  * `.X_bcmatch [Optional]`: Contains information about number of barcode-matched reads. Information about R2 barcode should be specified as `barcode` column in your `gRNA_library.csv` file.
+  * `.X_edits [Optional]`: If target position of each guide is specified as `target_pos` in input `gRNA_library.csv` file and `--match-target-position` option is provided, the result has the matrix with the number of target edit at the specified positions.
+  * `.allele_tables [Optional]`: Dictionary with a single allele count table that counts per guide and allele combination, what is the count per sample. 
+
+### Using BEAN as Python module
 ```
-import crispr_bean as be
+import bean as be
 cdata = be.read_h5ad("bean_counts_sample.h5ad")
 ```
-See the [**ReporterScreen API tutorial**](docs/ReporterScreen_api.ipynb) for more detail.
+Python package `bean` supports multiple data wrangling functionalities for `ReporterScreen` objects. See the [**ReporterScreen API tutorial**](docs/ReporterScreen_api.ipynb) for more detail.
+
+<br></br>
 
 ## Run time
 * Installation takes 14.4 mins after pytorch installation with pytorch in Dell XPS 13 Ubuntu WSL.

diff --git a/imgs/bean.gif b/imgs/bean.gif
diff --git a/imgs/dag_bean.png b/imgs/dag_bean.png
diff --git a/imgs/data_structure_v2.png b/imgs/data_structure_v2.png
diff --git a/imgs/model_output.png b/imgs/model_output.png
diff --git a/imgs/reporter.jpg b/imgs/reporter.jpg
diff --git a/imgs/reporter_construct.png b/imgs/reporter_construct.png