keras accessibility models for genomic data
See full workflow example here: https://github.com/kundajelab/cardiogenesis
- pysam
- tiledb>=0.5.2
- psutil
- tables
- numpy>=1.9
- keras>=2.4 (for tag 2.5; earlier keras can be used for tag<=2.4)
- tensorflow>=2.3 (for tag 2.5; earlier tensorflow can be used for tag<=2.4)
- h5py
- pandas
- pybigwig (I had had errors with the pip version, recommend bioconda:https://anaconda.org/bioconda/pybigwig)
- deeplift
- abstention
- boto3
To install the module:
pip install -e kerasAC
kerasAC takes tiledb and hdf5 inputs generated by seqdataloader(https://github.com/kundajelab/seqdataloader)
profile models can be trained on tiledb databases generated by https://github.com/kundajelab/seqdataloader#dbingest
binned models can be trained from outputs generated by https://github.com/kundajelab/seqdataloader#labelgen or from outputs generated by https://github.com/kundajelab/seqdataloader#dbingest
See examples in seqdataloader repo (https://github.com/kundajelab/seqdataloader/tree/master/examples) for examples of input generation.
Examples in the examples
folder utilize an ENCODE canonical cell line DNAse tiledb dataset, generated by this script: https://github.com/kundajelab/seqdataloader/blob/master/examples/dbingest/run_db_ingest.sh
kerasAC provides the following commands:
Find the count and profile loss weights for a bpnet model from an input data set. See example: https://github.com/kundajelab/kerasAC/blob/master/examples/bpnet_loss_weights/get_loss_weights.sh Returns the counts loss and profile loss weights for a specific task.
Train a neural network.
Examples of training a binned model from an hdf5 input: https://github.com/kundajelab/kerasAC/tree/master/examples/hdf5_train_examples
Examples of training bpnet on K562 DNase from tiledb input: https://github.com/kundajelab/kerasAC/blob/master/examples/tiledb_train_examples/train.wrapper.sh
When training on a tiledb dataset, tasks are separated by commas, while individual inputs/outputs are separated by a space, see example bedlow: This model has 2 tasks (+ and - strands), 3 inputs ( sequence, control profile, control counts), and 2 outputs (predicted profile, predicted counts):
UDA_VISIBLE_DEVICES=$gpu kerasAC_train \
--seed $seed \
--revcomp \
--batch_size 50 \
--ref_fasta /mnt/data/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
--tdb_array /srv/scratch/annashch/encode_dnase_tiledb/db/histone \
--tdb_partition_attribute_for_upsample overlap_peak \
--tdb_partition_thresh_for_upsample 1 \
--tdb_input_source_attribute seq control_count_bigwig_plus_5p,control_count_bigwig_minus_5p control_count_bigwig_plus_5p,control_count_bigwig_minus_5p \
--tdb_input_aggregation None None sum \
--tdb_input_transformation None None log \
--tdb_input_flank 3000 500 500 \
--tdb_output_source_attribute count_bigwig_plus_5p,count_bigwig_minus_5p count_bigwig_plus_5p,count_bigwig_minus_5p \
--tdb_output_flank 500 500 \
--tdb_output_aggregation None sum \
--tdb_output_transformation None log \
--tdb_ambig_attribute ambig_peak \
--tdb_input_min None None None \
--tdb_input_max None None None \
--tdb_output_min None 4.6 \
--tdb_output_max None 11.5 \
--num_inputs 3 \
--num_outputs 2 \
--fold $fold \
--genome hg38 \
--num_train 10000 \
--num_valid 10000 \
--num_tasks 2 \
--upsample_threads 24 \
--threads 0 \
--max_queue_size 20 \
--patience 3 \
--patience_lr 2 \
--model_prefix $outdir/$model_name.$fold \
--architecture_spec profile_bpnet_chipseq \
--model_params params.txt \
--use_multiprocessing False \
--tasks K562 \
--upsample_ratio_list_train 1.0 \
--upsample_ratio_list_eval 1.0 \
--trackables logcount_predictions_loss loss profile_predictions_loss val_logcount_predictions_loss val_loss val_profile_predictions_loss
A single predictions file, in hdf5 format, will be generated.
The file will have keys:
- coords
- lab_0, lab_1, ... lab_n, where the index {0,1,...,n} corresponds to the loss index (i.e. for bpnet, generally lab_0 is the profile label, lab_1 is the profile count).
- pred_0, pred_1, ... pred_n, where the index {0,1,...,n} corresponds to the loss index (i.e. for bpnet, generally pred_0 is the profile prediction, pred_1 is the count prediction).
The format for lab_i will be (N, L, T):
- N = number of samples
- L = length of profile (or 1 for counts)
- T = number of tasks (i.e. 2 for chipseq for the 2 strands; 1 for DNASE)
Example prediction script for H3K27ac dataset:
CUDA_VISIBLE_DEVICES=$gpu kerasAC_predict_tdb \
--batch_size 20 \
--ref_fasta /mnt/data/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
--tdb_array /srv/scratch/annashch/encode_dnase_tiledb/db/histone \
--tdb_partition_attribute_for_upsample overlap_peak \
--tdb_partition_thresh_for_upsample 2 \
--tdb_input_source_attribute seq control_count_bigwig_plus_5p,control_count_bigwig_minus_5p control_count_bigwig_plus_5p,control_count_bigwig_minus_5p \
--tdb_input_aggregation None None sum \
--tdb_input_transformation None None log \
--tdb_input_flank 3000 500 500 \
--tdb_output_source_attribute count_bigwig_plus_5p,count_bigwig_minus_5p count_bigwig_plus_5p,count_bigwig_minus_5p \
--tdb_output_flank 500 500 \
--tdb_output_aggregation None sum \
--tdb_output_transformation None log \
--num_inputs 3 \
--num_outputs 2 \
--tdb_ambig_attribute ambig_peak \
--chrom_sizes ~/hg38.chrom.sizes \
--fold $fold \
--genome hg38 \
--upsample_ratio_list_predict 1 \
--predictions_and_labels_hdf5 $outdir/$model_name.$fold \
--load_model_hdf5 $outdir/$model_name.$fold.hdf5 \
--tasks K562 \
--upsample_threads 1 \
--tdb_transformation_pseudocount 0.001
Calibrates model's predictions using isotonic regression (for a regression model) or Platt scaling (for a classification model)
outdir=$1
model_name=$2
for fold in 0 #`seq 0 4`
do
kerasAC_score_bpnet \
--predictions $outdir/$model_name.$fold.predictions \
--outf $outdir/$model_name.$fold.scores \
--title "K562 H3K27ac, counts loss x 25, seed 1234" \
--label_min_to_score 3 \
--label_max_to_score 11.5 \
--num_tasks 2 \
--losses profile counts
done
CUDA_VISIBLE_DEVICES=3 kerasAC_bpnet_shap_wrapper \
--model_hdf5 /srv/scratch/annashch/cardiogenesis/bias_corrected_bpnet/bias_corrected_model.$fold.hdf5 \
--ref_fasta /mnt/data/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
--bed_regions peaks.bed \
--bed_regions_center summit \
--tdb_array /srv/scratch/annashch/cardiogenesis/tiledb/db \
--chrom_sizes hg38.chrom.sizes \
--tasks cf_new \
--batch_size 200 \
--tdb_output_source_attribute count_bigwig_unstranded_5p count_bigwig_unstranded_5p \
--tdb_output_flank 500 500 \
--tdb_output_aggregation None sum \
--tdb_output_transformation None log \
--tdb_input_source_attribute seq \
--tdb_input_flank 673 \
--tdb_input_aggregation None \
--tdb_input_transformation None \
--out_pickle CF_NEW.fold$fold.deepSHAP \
--num_threads 10
kerasAC allows you to load the training profile data using the following two settings, here is a brief description of the settings and the dataset distribution generated
- Setting 1 - Training on the full genome, with some degree of upsampling of positive regions. Example script - https://github.com/kundajelab/kerasAC/blob/master/examples/tiledb_train_examples/train.sh . Here the arguments provided are as follows -
--tdb_partition_attribute_for_upsample overlap_peak
--tdb_partition_thresh_for_upsample 1
--upsample_ratio_list_train 0.8
--upsample_ratio_list_eval 0.8
This means that the code finds all genomic coordinates where idr_peak>=1 (i.e. there is a peak covering that position). When generating a batch of data for training, 80% of the batch will be sampled from this list of positive regions, and 20% will be sampled at random from the genome. Typically for BPNET, we train only on peaks, so we have been using --upsample_ratio_list_train 1, --upsample_ratio_list_eval 1 : this means the code just finds all bases where there is a peak present and shuffles them to serve as a pool of indices for batch creation. For example, consider we have three peaks in our dataset with corresponding widths as mentioned as follows - Peak 1 : 100 bases width, Peak 2: 200 bases width and Peak 3: 300 bases width. Using this setting we first generate 100+200+300 candidate regions and then sample 80% of data points from these candidate regions.
- Setting 2 - Training on bed regions. Example script - https://github.com/kundajelab/chrombpnet/blob/master/k562_dnase/bpnet/joint_tier1_train/train.sh . Here the arguments provided are as follows -
--bed_regions /srv/scratch/annashch/deeplearning/profile/joint_peak_set/dnase/dnase.tier1.overlap.merged.bed
--bed_regions_center random
--bed_regions_jitter 5
This setting overrides setting 1 whenever used. In this workflow, the peak indices will be used to train from the bed regions file. In addition to the provided peak indices, the code will add some amount of additional regions for jittering, based on the --bed_regions_jitter parameter
. Consider the same example as above read in from the bed file provided - Peak 1 : 100 bases width, Peak 2: 200 bases width and Peak 3: 300 bases width. Here we sample 5 regions at random from each peak widtth generating 15 data points. In this case all these points will be used for training. Refer to the kerasAC/train.py
file for other available options for bed_regions_center
.
NOTE: If you are testing data distribution is only summit centered regions then Setting 1 will create a dataset bias where longer profiles will dominate the dataset.