MSc Thesis Project
This project is producing SHAP summary plots for the features of some CRISPR sgRNA design tools when ran on different datasets. The main functionality is that given a tool name and a dataset name, it will run the SHAP analysis and produce a plot with the SHAP values for the top 20 most important features. The values are also saved in pickle files to be loaded more easily afterwards.
The scripts in this project require Python 3 and the following packages (which can be installed with pip3) with their corresponding dependencies:
joblib==0.15.1
matplotlib==3.2.2
pandas==1.0.4
pybedtools==0.8.1
scikit-learn==0.20.0
seaborn==0.10.1
shap==0.35.0
For a fast set-up create a conda environment with the provided environment.yml file. This contains all the packages needed by the project. For instructions on how to load and use an environment, refer to: conda env
To run the complete SHAP analysis:
python computeShapVals.py --tool ToolName --data DatasetName
Where the tool name is any of [tuscan-classification, tuscan-regression, sgRNAScorer2, wu-crispr, ssc, chop-chop-xu, chop-chop-doench, chop-chop-moreno] and dataset name is any of [xu, doench]. Case insensitive.
The plot will appear on the screen and the SHAP values will be saved in results/SHAP-toolName-datasetName as a pickle file. This is an example output:
To compute the plot from the presaved SHAP values:
python plotShapFromPickle.py --file pickleFileName
Where the pickleFileName must be a pickle file in the results directory and contain two pickles (one with the SHAP values and one with the data). These files are produced automatically by the computeShapVals.py script.
The plot produced will be the same as the above one, but the running time will be much faster.
To produce plots with the SHAP values for positional features:
python plotPositionsFromPickle.py --file pickleFileName
Where the pickleFileName is the same as above (from the results directory). This script will produce a plot looking at the positional features (e.g. G at position 19 in the guide) and plot the SHAP values in a bar plot.
The plot will appear on the screen and will have the SHAP values for each position in the guide. This is an example output:
To compare positions across different methods:
python comparePositionTools.py
This script will produce a plot similar to the one above (for one tool, all the positional features), but by combining multiple tools. This serves to compare the positional preferences across tools. It requires the pickle files to be generated for the tools to be used.
An example of a produced plot, which compares the positional features of SSC, Tuscan-Classification and Wu-CRISPR:
To create a heatmap and csv of all the average SHAP values across all the tools:
python shapToHeatMap.py
This script will take the average SHAP values of all positional features across all the tools and produce a heatmap showing all the values together. It requires the pickle files for all tools to be generated. For now the names of tools and datasets to be ran are selected from within the file.
An example of a heatmap with all the tools ran on Xu:
To crate spider plots comparing the positional features across all the tools:
python shapToHeatMap.py
This script will require the pickle files to have been generated. Customizing the choice of models/dataset is done via the arrays declared (for now).
An example of spider plot with all the tools ran on Xu:
datasets : scripts for extracting the guide sequences (in tool specific format) from the original dataset files
results: contains pickle files with saved SHAP values for all the models and tools ran
plots: contains saved plots produced throughout this project
src/tooldata.py : interface class for the tool-specific data file in order to be ran by shapleyvals.py
src/tool-model : contains the necessary files for running the specific tool
src/utils.py: this contains general methods used by the other scripts
- Explain your model with the shap values
- TUSCAN paper and code
- sgRNAScorer paper and code
- WU-CRISPR paper and code
- SSC paper and code
- CHOP-CHOP paper and code
For each tool, the code parts which construct the model and/or score the gRNA have been extracted and adapted to fit to the tooldata interface. For adding any other tool, its code only needs to be put into the specified format by the interface.
Some of the tools require different packages to unpickle files (different versions of scikit-learn). There will be a warning informing you if the wrong version is ran.
The scripts take guides of length 20 and score them assuming PAM NGG. All the guide positions are standardized to 0-19, where 19 is the position adjacent to the PAM.
- better interface for comparing scripts. (to allow specifying which to include from command line)