diff --git a/.gitignore b/.gitignore index e8a744c..0ffac01 100644 --- a/.gitignore +++ b/.gitignore @@ -12,6 +12,7 @@ join.log *.npy *.npz *.db -*.tsv __pycache__ round1 +*.pyc + diff --git a/README.md b/README.md index cc17590..1301c17 100644 --- a/README.md +++ b/README.md @@ -1,15 +1,25 @@ -# ENCODE Imputation Challenge Scoring and Validation Scripts +# ENCODE Imputation Challenge Scoring/Ranking and Validation Scripts ## Installation -1) [Install Conda](https://docs.conda.io/en/latest/miniconda.html) first. +1) [Install Conda 4.6.14](https://docs.conda.io/en/latest/miniconda.html) first. Answer `yes` to all Y/N questions. Use default installation paths. Re-login after installation. + ```bash + $ wget https://repo.anaconda.com/miniconda/Miniconda3-4.6.14-Linux-x86_64.sh + $ bash Miniconda3-4.6.14-Linux-x86_64.sh + ``` 2) Install `numpy`, `scikit-learn` and `pyBigWig`. ```bash - $ conda install -c bioconda numpy scikit-learn pyBigWig sqlite scipy + $ conda install -y -c bioconda numpy scikit-learn pyBigWig sqlite scipy ``` -## Example (hg38) +## Validating a submission + +```bash +$ python validate.py [YOUR_SUBMISSION_BIGWIG] +``` + +## Scoring a submission 1) Download ENCFF622DXZ and ENCFF074VQD from ENCODE portal. ```bash @@ -18,59 +28,41 @@ $ wget https://www.encodeproject.org/files/ENCFF074VQD/@@download/ENCFF074VQD.bigWig ``` -2) Convert it to numpy array. +2) Convert it to numpy array. This is to speed up scoring multiple submissions. `score.py` can also take bigwigs so you can skip this step. ```bash - $ python build_npy_from_bigwig.py test/hg38/ENCFF622DXZ.bigWig - $ python build_npy_from_bigwig.py test/hg38/ENCFF074VQD.bigWig + $ python bw_to_npy.py test/hg38/ENCFF622DXZ.bigWig + $ python bw_to_npy.py test/hg38/ENCFF074VQD.bigWig ``` 3) Run it. If you score without a variance `.npy` file specified as `--var-npy`, then `msevar` metric will be `0.0`. ```bash - $ python score.py test/hg38/ENCFF622DXZ.npy test/hg38/ENCFF074VQD.npy \ - --chrom chr20 --out-file test/hg38/ENCFF622DXZ.ENCFF074VQD.score.txt - ``` - -4) Output looks like: (header: bootstrap_index, mse, mse1obs, mse1imp, gwcorr, match1, catch1obs, catch1imp, aucobs1, aucimp1, mseprom, msegene, mseenh). - ```bash - bootstrap_-1 20.45688606636623 1730.3503548526915 195.52252657980728 0.01705378703206674 848 3462 2976 0.5852748736100822 0.590682173511888 376.1018309950674 31.24613030186926 94.01719916101615 + $ python score.py test/hg38/ENCFF622DXZ.npy test/hg38/ENCFF074VQD.npy --chrom chr20 ``` - -## Validation for submissions - -In order to validate your BIGWIG. Use `validate.py`. - -```bash -$ python validate.py [YOUR_SUBMISSION_BIGWIG] -``` - ## Ranking for submissions -1) [Generate bootstrap label](#how-to-generate-bootstrap-labels) - -2) In order to speed up scoring, convert `TRUTH_BIGWIG` into numpy array/object (binned at `25`). Repeat this for each pair of cell type and assay. +1) Create a score database. ```bash - $ python build_npy_from_bigwig.py [TRUTH_BIGWIG] --out-npy-prefix [TRUTH_NPY_PREFIX] + $ python db.py [NEW_SCORE_DB_FILE] ``` -3) Create a score database. +2) In order to speed up scoring, convert `TRUTH_BIGWIG` into numpy array/object (binned at `25`). Repeat this for each pair of cell type and assay. `--out-npy-prefix [TRUTH_NPY_PREFIX]` is optional. Repeat this for all truth bigwigs. ```bash - $ python create_db.py [SCORE_DB_FILE] + $ python bw_to_npy.py [TRUTH_BIGWIG] --out-npy-prefix [TRUTH_NPY_PREFIX] ``` -4) For each assay type, build a variance `.npy` file, which calculates a variance for each bin for each chromosome across all cell types. Without this variance file, `msevar` will be `0.0`. - $ python build_var_npy.py [TRUTH_NPY_CELL1] [TRUTH_NPY_CELL2] ... \ - --out-npy-prefix var_[ASSAY_OR_MARK_ID] +3) For each assay type, build a variance `.npy` file, which calculates a variance for each bin for each chromosome across all cell types. Without this variance file, `msevar` will be `0.0`. + ```bash + $ python build_var_npy.py [TRUTH_NPY_CELL1] [TRUTH_NPY_CELL2] ... --out-npy-prefix var_[ASSAY_OR_MARK_ID] + ``` -5) Score each submission with bootstrap labels. `--validated` is only for validated submissions binned at `25`. With this flag turned on, `score.py` will skip interpolation of intervals in a bigwig. For ranking, you need to define all metadata for a submission like `--cell [CELL_ID] --assay [ASSAY_OR_MARK_ID] -t [TEAM_ID_INT] -s [SUBMISSION_ID_INT]`. These values will be written to a database file together with bootstrap scores. Repeat this for each submission (one submission per team for each pair of cell type and assay). +4) Score each submission. `--validated` is only for a validated bigwig submission binned at `25`. With this flag turned on, `score.py` will skip interpolation of intervals in a bigwig. For ranking, you need to define metadata for a submission like -t [TEAM_ID_INT] -s [SUBMISSION_ID_INT]`. These values will be written to a database file together with bootstrap scores. Repeat this for each submission (one submission per team for each pair of cell type and assay). ```bash - $ python score.py [YOUR_VALIDATED_SUBMISSION] [TRUTH_NPY] \ - --bootstrapped-label-npy [BOOTSTRAP_LABEL_NPY] \ - --var-npy var_[ASSAY_OR_MARK_ID].npy - --out-db-file [SCORE_DB_FILE] \ - --cell [CELL_ID] --assay [ASSAY_OR_MARK_ID] \ - -t [TEAM_ID_INT] -s [SUBMISSION_ID_INT] \ - --validated + $ python score.py [YOUR_VALIDATED_SUBMISSION_BIGWIG_OR_NPY] [TRUTH_NPY] \ + --var-npy var_[ASSAY_OR_MARK_ID].npy \ + --db-file [SCORE_DB_FILE] \ + --validated \ + -t [TEAM_ID_INT] -s [SUBMISSION_ID_INT] ``` 5) Calculate ranks based on DB file @@ -78,18 +70,42 @@ $ python validate.py [YOUR_SUBMISSION_BIGWIG] $ python rank.py [SCORE_DB_FILE] ``` +## Setting up a leaderboard server (admins only) -## For challenge admins +1) Create a server instance on AWS. -### How to generate bootstrap labels? +2) Install Synapse client. + ```bash + $ pip install synapseclient + ``` -Download `submission_template.bigwig` from Synapse imputation challenge site. The following command will make 10-fold (default) bootstrap index for each chromosome. Output is a single `.npy` file which have all bootstrap labels for corresponding bootstrap index and chromosomes. +3) Authenticate yourself on the server + ```bash + $ synapse login --remember-me -u [USERNAME] -p [PASSWORD] + ``` -```bash -$ python build_bootstrapped_label.py submission_template.bigwig -``` +4) Create a score database. + ```bash + $ python db.py [NEW_SCORE_DB_FILE] + ``` -### How to use bootstrapped label? +5) Run `score_leaderboard.py`. Files on `TRUTH_NPY_DIR` should be like `CXXMYY.npy`. Files on `VAR_NPY_DIR` should be like `var_MYY.npy`. Submissions will be downloaded on `SUBMISSION_DOWNLOAD_DIR`. + ```bash + $ NTH=3 # number of threads to parallelize bootstrap scoring + $ python score_leaderboard.py [EVALUATION_QUEUE_ID] [TRUTH_NPY_DIR] \ + --var-npy-dir [VAR_NPY_DIR] \ + --submission-dir [SUBMISSION_DOWNLOAD_DIR] \ + --send-msg-to-admin \ + --send-msg-to-user \ + --db-file [SCORE_DB_FILE] \ + --nth $NTH \ + --project-id [SYNAPSE_PROJECT_ID] \ + --leaderboard-wiki-id [LEADERBOARD_WIKI_ID] \ + --bootstrap-chrom chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr20,chr21,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr4,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr6,chr7,chr8,chr9,chrX chr1,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr5,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr5,chr6,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr14,chr16,chr17,chr18,chr19,chr2,chr21,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chrX chr1,chr10,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chr9 chr1,chr10,chr11,chr12,chr13,chr14,chr15,chr19,chr2,chr20,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chrX + ``` -Simply run `score.py` with `--bootstrapped-label-npy bootstrapped_label.npy`. + Example: + ```bash + $ python score_leaderboard.py $EVAL_Q_ID /mnt/imputation-challenge/output/score_robust_min_max/validation_data_npys --var-npy-dir /mnt/imputation-challenge/output/score_robust_min_max/var_npys --submission-dir /mnt/imputation-challenge/data/submissions/round2 --db-file $DB --nth $NTH --bootstrap-chrom chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr20,chr21,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr4,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr6,chr7,chr8,chr9,chrX chr1,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr5,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr5,chr6,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr14,chr16,chr17,chr18,chr19,chr2,chr21,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chrX chr1,chr10,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chr9 chr1,chr10,chr11,chr12,chr13,chr14,chr15,chr19,chr2,chr20,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chrX --send-msg-to-admin --send-msg-to-user --team-name-tsv data/team_name_round1.tsv + ``` diff --git a/build_npy_from_bigwig.py b/build_npy_from_bigwig.py deleted file mode 100644 index 630b48e..0000000 --- a/build_npy_from_bigwig.py +++ /dev/null @@ -1,210 +0,0 @@ -#!/usr/bin/env python3 -"""Imputation challenge scoring script -Author: - Jin Lee (leepc12@gmail.com) and Jacob Schreiber (jmschreiber91@gmail.com) -""" - -import sys -import os -import argparse -import numpy -import pyBigWig -import logging -from score import read_annotation_bed - -logging.basicConfig( - format='[%(asctime)s %(levelname)s] %(message)s', - stream=sys.stdout) -log = logging.getLogger(__name__) - - -def parse_arguments(): - py_path = os.path.dirname(os.path.realpath(__file__)) - - parser = argparse.ArgumentParser( - prog='ENCODE Imputation Challenge scoring script') - parser.add_argument('bw', - help='Bigwig file or .npy file (for blacklist filtering)') - parser.add_argument('--out-npy-prefix', - help='Output prefix for .npy or .npz') - p_score = parser.add_argument_group( - title='Scoring parameters') - p_score.add_argument('--chrom', nargs='+', - default=['all'], - help='List of chromosomes to be combined to be ' - 'scored. ' - 'Set as "all" (default) to score for all ' - 'chromosomes. ' - '(e.g. "all" or "chr3 chr21") ' - 'It should be "all" to write scores to DB file') - parser.add_argument('--blacklist-file', - default=os.path.join( - py_path, - 'annot/hg38/hg38.blacklist.bed.gz'), - help='Blacklist BED file. Bootstrap label will be ' - 'generated after removing overlapping regions ' - 'defined in this file.') - p_score.add_argument('--window-size', default=25, type=int, - help='Window size for bigwig in bp') - p_score.add_argument('--validated', action='store_true', - help='For validated submissions ' - 'with fixed interval length of 25 and valid ' - 'chromosome lengths. It will skip interpolation') - parser.add_argument('--log-level', default='INFO', - choices=['NOTSET', 'DEBUG', 'INFO', 'WARNING', - 'CRITICAL', 'ERROR', 'CRITICAL'], - help='Log level') - args = parser.parse_args() - - # some submission files have whitespace in path... - args.bw = args.bw.strip("'") - if args.chrom == ['all']: - args.chrom = ['chr' + str(i) for i in range(1, 23)] + ['chrX'] - - log.setLevel(args.log_level) - log.info(sys.argv) - return args - - -def bw_to_dict(bw, chrs, window_size=25, validated=False): - """ - Returns: - { chr: [] } where [] is a numpy 1-dim array - """ - result = {} - for c in chrs: - log_msg = 'Reading chromosome {} from bigwig...'.format(c) - log.info(log_msg) - result_per_chr = [] - chrom_len = bw.chroms()[c] - - num_step = (chrom_len-1)//window_size+1 - - if validated: - all_steps = bw.intervals(c) - assert(num_step==len(all_steps)) - - for step in range(num_step): - start = step*window_size - end = min((step+1)*window_size, chrom_len) - result_per_chr.append(all_steps[step][2]) - else: - # reshape raw vector as (num_step, window_size) - raw = bw.values(c, 0, chrom_len, numpy=True) - reshaped = numpy.zeros((num_step*window_size,)) - reshaped[:raw.shape[0]] = raw - # pyBigWig returns nan for values out of bounds - # convert nan to zero - x = numpy.nan_to_num(reshaped) - y = numpy.reshape(x, (-1, window_size)) - # bin it - # reduce dimension to (num_step, 0) by averaging - # all values in a step - result_per_chr = y.mean(axis=1) - - # special treatment for last step (where the first nan is) - # above averaging method does not work with the end step - # bw.intervals(c)[-1] is the last interval in bigwig - last_step = bw.intervals(c)[-1][1]//window_size - start = last_step*window_size - end = min((last_step+1)*window_size, chrom_len) - stat = bw.stats(c, start, end, exact=True) - if stat[0] is None: - result_per_chr[last_step]=0.0 - else: - result_per_chr[last_step]=stat[0] - - # # too expensive - # for step in range(num_step): - # start = step*window_size - # end = min((step+1)*window_size, chrom_len) - # stat = bw.stats(c, start, end, exact=True) - # if stat[0] is None: - # result_per_chr.append(0) - # else: - # result_per_chr.append(stat[0]) - - result[c] = numpy.array(result_per_chr) - return result - - -def blacklist_filter(d, blacklist): - result = {} - for c in d: - result_per_chr = d[c] - - # remove bins overlapping blacklisted region - if blacklist is None: - bfilt_result_per_chr = result_per_chr - else: - bfilt_result_per_chr = [] - for i, val in enumerate(result_per_chr): - if i in blacklist[c]: - continue - else: - bfilt_result_per_chr.append(val) - - result[c] = numpy.array(bfilt_result_per_chr) - return result - - -def get_blacklist_bin_ids(blacklist, chroms, window_size=25): - """ - Returns: - { chrom: [] }: label that overlaps with blackstlisted region - """ - # make empty sets per chr - bins = {} - for c in chroms: - bins[c] = set() - - for line in blacklist: - c, start, end = line.split() - start_bin_id = int(start) // window_size - end_bin_id = int(end) // window_size + 1 - bins[c] |= set(range(start_bin_id, end_bin_id)) - - # convert into list and then numpy array - result = {} - for c in chroms: - result[c] = numpy.array(list(bins[c]), dtype=numpy.int64) - - return result - - -def main(): - # read params - args = parse_arguments() - - if args.bw.lower().endswith(('bw', 'bigwig')): - log.info('Opening bigwig file...') - bw = pyBigWig.open(args.bw) - y_dict = bw_to_dict(bw, args.chrom, args.window_size, - args.validated) - elif args.bw.lower().endswith(('npy', 'npz')): - y_dict = numpy.load(args.bw, allow_pickle=True)[()] - - if args.blacklist_file is None: - bfilt_y_dict = y_dict - else: - log.info('Reading from blacklist bed file...') - blacklist_lines = read_annotation_bed(args.blacklist_file) - blacklist_bin_ids = get_blacklist_bin_ids( - blacklist_lines, args.chrom, args.window_size) - bfilt_y_dict = blacklist_filter(y_dict, blacklist_bin_ids) - - log.info('Writing to npy or npz...') - if args.out_npy_prefix is None: - prefix, _ = os.path.splitext(args.bw) - out_npy_prefix = prefix - else: - out_npy_prefix = args.out_npy_prefix - - numpy.save(out_npy_prefix, bfilt_y_dict) - - log.info('All done') - - -if __name__ == '__main__': - main() - diff --git a/build_var_npy.py b/build_var_npy.py index 85e7866..fc7610a 100644 --- a/build_var_npy.py +++ b/build_var_npy.py @@ -1,25 +1,41 @@ #!/usr/bin/env python3 """Imputation challenge scoring script + Author: - Jin Lee (leepc12@gmail.com) and Jacob Schreiber (jmschreiber91@gmail.com) + Jin Lee (leepc12@gmail.com) """ -import sys -import argparse import numpy import pyBigWig -from score import bw_to_dict -import logging +from score_metrics import normalize_dict +from bw_to_npy import write_dict_to_npy, load_npy +from logger import log + + +def build_var_dict(npys, chroms): + y_all = {} + for c in chroms: + y_all[c] = [] + + for f in npys: + y_dict = load_npy(f) + y_dict_norm = normalize_dict(y_dict, chroms) + + for c in chroms: + y_all[c].append(y_dict_norm[c]) + + var = {} + for c in chroms: + var[c] = numpy.std(numpy.array(y_all[c]), axis=0) ** 2 -logging.basicConfig( - format='[%(asctime)s %(levelname)s] %(message)s', - stream=sys.stdout) -log = logging.getLogger(__name__) + return var def parse_arguments(): + import argparse + parser = argparse.ArgumentParser( - prog='ENCODE Imputation Challenge variance .npy builder') + description='ENCODE Imputation Challenge variance .npy builder') parser.add_argument('npy', nargs='+', help='Binned truth .npy file') parser.add_argument('--out-npy-prefix', required=True, @@ -36,10 +52,6 @@ def parse_arguments(): 'It should be "all" to write scores to DB file') p_score.add_argument('--window-size', default=25, type=int, help='Window size for bigwig in bp') - parser.add_argument('--log-level', default='INFO', - choices=['NOTSET', 'DEBUG', 'INFO', 'WARNING', - 'CRITICAL', 'ERROR', 'CRITICAL'], - help='Log level') args = parser.parse_args() # some submission files have whitespace in path... @@ -48,36 +60,17 @@ def parse_arguments(): if args.chrom == ['all']: args.chrom = ['chr' + str(i) for i in range(1, 23)] + ['chrX'] - log.setLevel(args.log_level) - log.info(sys.argv) return args def main(): - # read params args = parse_arguments() - y_all = {} - for c in args.chrom: - y_all[c] = [] - - log.info('Opening npy files...') - for f in args.npy: - log.info('Reading from {}...'.format(f)) - y_dict = numpy.load(f, allow_pickle=True)[()] - for c in args.chrom: - y_all[c].append(y_dict[c]) - - var = {} - for c in args.chrom: - var[c] = numpy.std(numpy.array(y_all[c]), axis=0) ** 2 - - log.info('Writing to npy...') - numpy.save(args.out_npy_prefix, var) + var = build_var_dict(args.npy, args.chrom) + write_dict_to_npy(var, args.out_npy_prefix) log.info('All done') if __name__ == '__main__': main() - diff --git a/bw_to_npy.py b/bw_to_npy.py new file mode 100644 index 0000000..dbe0d53 --- /dev/null +++ b/bw_to_npy.py @@ -0,0 +1,237 @@ +#!/usr/bin/env python3 +"""Imputation challenge bigwig to numpy array converter + +Author: + Jin Lee (leepc12@gmail.com) +""" + +import numpy +import gzip +import pyBigWig +from score_metrics import find_robust_min_max +from logger import log + + +def load_bed(bed): + """Read gzipped/uncompressed BED + """ + log.info('Reading from BED {}...'.format(bed)) + result = [] + if bed.endswith('gz'): + with gzip.open(bed, 'r') as infile: + for line in infile: + result.append(line.decode("ascii")) + else: + with open(bed, 'r') as infile: + for line in infile: + result.append(line) + return result + + +def bw_to_dict(bw_file, chrs, window_size=25, + blacklist_file=None, validated=False): + """ + Build numpy array from bigwig or npy (raw, blacklist unfiltered). + Then blacklist filter it and calculate robust min/max for normalization + + Args: + bw: submission bigwig file (.bigwig, .npy or .npz) + + Returns: + { 'chr1': [], 'chr2': [], ... , robust_min: , robust_max: } + where [] is a numpy 1-dim array + """ + if bw_file.lower().endswith(('npy', 'npz')): + return load_npy(bw_file) + + elif bw_file.lower().endswith(('bw', 'bigwig')): + log.info('Opening bigwig file...') + bw = pyBigWig.open(bw_file) + y_dict = {} + for c in chrs: + log_msg = 'Reading chromosome {} from bigwig...'.format(c) + log.info(log_msg) + y_dict_per_chr = [] + chrom_len = bw.chroms()[c] + + num_step = (chrom_len-1)//window_size+1 + + if validated: + all_steps = bw.intervals(c) + assert(num_step==len(all_steps)) + + for step in range(num_step): + start = step*window_size + end = min((step+1)*window_size, chrom_len) + y_dict_per_chr.append(all_steps[step][2]) + else: + # reshape raw vector as (num_step, window_size) + raw = bw.values(c, 0, chrom_len, numpy=True) + reshaped = numpy.zeros((num_step*window_size,)) + reshaped[:raw.shape[0]] = raw + # pyBigWig returns nan for values out of bounds + # convert nan to zero + x = numpy.nan_to_num(reshaped) + y = numpy.reshape(x, (-1, window_size)) + # bin it + # reduce dimension to (num_step, 0) by averaging + # all values in a step + y_dict_per_chr = y.mean(axis=1) + + # special treatment for last step (where the first nan is) + # above averaging method does not work with the end step + # bw.intervals(c)[-1] is the last interval in bigwig + last_step = bw.intervals(c)[-1][1]//window_size + start = last_step*window_size + end = min((last_step+1)*window_size, chrom_len) + stat = bw.stats(c, start, end, exact=True) + if stat[0] is None: + y_dict_per_chr[last_step]=0.0 + else: + y_dict_per_chr[last_step]=stat[0] + + y_dict[c] = numpy.array(y_dict_per_chr) + + def blacklist_filter(d, blacklist): + result = {} + for c in d: + result_per_chr = d[c] + + # remove bins overlapping blacklisted region + if blacklist is None: + bfilt_result_per_chr = result_per_chr + else: + bfilt_result_per_chr = [] + for i, val in enumerate(result_per_chr): + if i in blacklist[c]: + continue + else: + bfilt_result_per_chr.append(val) + + result[c] = numpy.array(bfilt_result_per_chr) + return result + + def get_blacklist_bin_ids(blacklist, chroms, window_size=25): + """ + Returns: + { chrom: [] }: label that overlaps with blackstlisted region + """ + # make empty sets per chr + bins = {} + for c in chroms: + bins[c] = set() + + for line in blacklist: + c, start, end = line.split() + start_bin_id = int(start) // window_size + end_bin_id = int(end) // window_size + 1 + bins[c] |= set(range(start_bin_id, end_bin_id)) + + # convert into list and then numpy array + result = {} + for c in chroms: + result[c] = numpy.array(list(bins[c]), dtype=numpy.int64) + + return result + + if blacklist_file is None: + bfilt_y_dict = y_dict + else: + blacklist_lines = load_bed(blacklist_file) + blacklist_bin_ids = get_blacklist_bin_ids( + blacklist_lines, chrs, window_size) + bfilt_y_dict = blacklist_filter(y_dict, blacklist_bin_ids) + + #bfilt_y_array = dict_to_arr(bfilt_y_dict, chrs) + #robust_min, robust_max = find_robust_min_max(bfilt_y_array) + #bfilt_y_dict['robust_min'] = robust_min + #bfilt_y_dict['robust_max'] = robust_max + + else: + raise NotImplementedError('Unsupported file type') + + return bfilt_y_dict + + +def dict_to_arr(d, chroms): + """Concat vectors in d + """ + result = [] + for c in chroms: + result.extend(d[c]) + return numpy.array(result) + + +def load_npy(npy_file): + return numpy.load(npy_file, allow_pickle=True)[()] + + +def write_dict_to_npy(d, npy_prefix): + log.info('Writing dict to npy or npz...') + return numpy.save(npy_prefix, d) + + +def parse_arguments(): + import argparse + import os + py_path = os.path.dirname(os.path.realpath(__file__)) + + parser = argparse.ArgumentParser( + description='ENCODE Imputation Challenge bigwig to npy') + parser.add_argument('bw', + help='Bigwig file or .npy file (for blacklist filtering)') + parser.add_argument('--out-npy-prefix', + help='Output prefix for .npy or .npz') + p_score = parser.add_argument_group( + title='Scoring parameters') + p_score.add_argument('--chrom', nargs='+', + default=['all'], + help='List of chromosomes to be combined to be ' + 'scored. ' + 'Set as "all" (default) to score for all ' + 'chromosomes. ' + '(e.g. "all" or "chr3 chr21") ' + 'It should be "all" to write scores to DB file') + parser.add_argument('--blacklist-file', + default=os.path.join( + py_path, + 'annot/hg38/hg38.blacklist.bed.gz'), + help='Blacklist BED file. Bootstrap label will be ' + 'generated after removing overlapping regions ' + 'defined in this file.') + p_score.add_argument('--window-size', default=25, type=int, + help='Window size for bigwig in bp') + p_score.add_argument('--validated', action='store_true', + help='For validated submissions ' + 'with fixed interval length of 25 and valid ' + 'chromosome lengths. It will skip interpolation') + args = parser.parse_args() + + # some submission files have whitespace in path... + args.bw = args.bw.strip("'") + if args.chrom == ['all']: + args.chrom = ['chr' + str(i) for i in range(1, 23)] + ['chrX'] + + return args + + +def main(): + import os + + # read params + args = parse_arguments() + + bfilt_y_dict = bw_to_dict(args.bw, args.chrom, + args.window_size, args.blacklist_file) + if args.out_npy_prefix is None: + npy_prefix, _ = os.path.splitext(args.bw) + else: + npy_prefix = args.out_npy_prefix + + write_dict_to_npy(bfilt_y_dict, npy_prefix) + + log.info('All done') + + +if __name__ == '__main__': + main() diff --git a/create_db.py b/create_db.py deleted file mode 100644 index 5da2d3b..0000000 --- a/create_db.py +++ /dev/null @@ -1,84 +0,0 @@ -#!/usr/bin/env python3 -"""Imputation challenge SQLite3 DB creation script -Author: - Jin Lee (leepc12@gmail.com) -""" - -import os -import sys -import argparse -import sqlite3 -import logging -from score import DB_TABLE_SCORE, ScoreDBRecord - -logging.basicConfig( - format='[%(asctime)s %(levelname)s] %(message)s', - stream=sys.stdout) -log = logging.getLogger(__name__) - - -SCORE_DB_RECORD_VAR_TYPE = ScoreDBRecord( - submission_id='integer NOT NULL', - team_id='integer NOT NULL', - submission_fname='text NOT NULL', - cell='text NOT NULL', - assay='text NOT NULL', - bootstrap_id='integer NOT NULL', - mse='double NOT NULL', - gwcorr='double NOT NULL', - gwspear='double NOT NULL', - mseprom='double NOT NULL', - msegene='double NOT NULL', - mseenh='double NOT NULL', - msevar='double NOT NULL', - mse1obs='double NOT NULL', - mse1imp='double NOT NULL' -) - - -def parse_arguments(): - import os - py_path = os.path.dirname(os.path.realpath(__file__)) - - parser = argparse.ArgumentParser( - prog='ENCODE Imputation Challenge SQLite3 Database creator.') - parser.add_argument('db_file', - help='DB file.') - parser.add_argument('--log-level', default='INFO', - choices=['NOTSET', 'DEBUG', 'INFO', 'WARNING', - 'CRITICAL', 'ERROR', 'CRITICAL'], - help='Log level') - args = parser.parse_args() - - if os.path.exists(args.db_file): - raise ValueError('DB file already exists.') - - log.setLevel(args.log_level) - log.info(sys.argv) - return args - - -def main(): - # read params - args = parse_arguments() - - log.info('Creating database...') - - try: - conn = sqlite3.connect(args.db_file) - c = conn.cursor() - c.execute('CREATE TABLE IF NOT EXISTS {} ({});'.format( - DB_TABLE_SCORE, - ','.join([attr + ' ' + getattr(SCORE_DB_RECORD_VAR_TYPE, attr) - for attr in SCORE_DB_RECORD_VAR_TYPE._fields]))) - except Exception as e: - print(e) - sys.exit(1) - finally: - conn.close() - - log.info('All done.') - - -if __name__ == '__main__': - main() diff --git a/data/team_name_round1.tsv b/data/team_name_round1.tsv new file mode 100644 index 0000000..4624e8a --- /dev/null +++ b/data/team_name_round1.tsv @@ -0,0 +1,21 @@ +0 Avocado_p0 +1 Avocado_p1 +2 Avocado_p2 +3 Avocado_p3 +4 Avocado_p4 +5 Avocado_p5 +6 Avocado_p6 +7 Avocado_p7 +8 Avocado_p8 +9 Avocado_p9 +10 Avocado_p10 +20 Average +1000 BrokenNodes +1010 ENCODE_DCC_Imputes +1020 Hongyang_Li_and_Yuanfang_Guan +1030 KKT-ENCODE-Impute-model_1 +1031 KKT-ENCODE-Impute-model_2 +1040 LiPingChun +1050 MLeipzig +1060 noml + diff --git a/db.py b/db.py new file mode 100644 index 0000000..f3e810d --- /dev/null +++ b/db.py @@ -0,0 +1,168 @@ +#!/usr/bin/env python3 +"""Imputation challenge sqlite3 DB functions + +Author: + Jin Lee (leepc12@gmail.com) +""" + +import sys +import sqlite3 +from collections import namedtuple +from score_metrics import Score +from logger import log + + +ScoreDBRecord = namedtuple( + 'ScoreDBRecord', + ('submission_id', 'team_id', 'submission_fname', 'cell', 'assay', + 'bootstrap_id') + Score._fields +) + +DB_TABLE_SCORE = 'score' +DB_QUERY_INSERT = 'INSERT INTO {table} ({cols}) VALUES ({values});' +#DB_QUERY_GET = 'SELECT * FROM {table} ORDER BY bootstrap_id, submission_id;' +# to select latest submission +DB_QUERY_GET = 'SELECT t.* FROM {table} t \ +INNER JOIN ( \ + SELECT team_id, cell, assay, bootstrap_id, max(submission_id) as MaxSID \ + FROM {table} \ + GROUP BY team_id, cell, assay, bootstrap_id \ +) tm WHERE t.team_id = tm.team_id AND t.cell = tm.cell \ +AND t.assay = tm.assay AND t.bootstrap_id = tm.bootstrap_id \ +AND t.submission_id = tm.MaxSID \ +ORDER BY t.bootstrap_id, t.submission_id;' + +SCORE_DB_RECORD_VAR_TYPE = ScoreDBRecord( + submission_id='integer NOT NULL', + team_id='integer NOT NULL', + submission_fname='text NOT NULL', + cell='text NOT NULL', + assay='text NOT NULL', + bootstrap_id='integer NOT NULL', + mse='double NOT NULL', + gwcorr='double NOT NULL', + gwspear='double NOT NULL', + mseprom='double NOT NULL', + msegene='double NOT NULL', + mseenh='double NOT NULL', + msevar='double NOT NULL', + mse1obs='double NOT NULL', + mse1imp='double NOT NULL' +) + + +def write_to_db(score_db_record, db_file): + cols = [] + values = [] + for attr in score_db_record._fields: + cols.append(str(attr)) + val = getattr(score_db_record, attr) + if isinstance(val, str): + val = '"' + val + '"' + else: + val = str(val) + values.append(val) + + query = DB_QUERY_INSERT.format( + table=DB_TABLE_SCORE, cols=','.join(cols), values=','.join(values)) + log.info('SQL query: {}'.format(query)) + while True: + try: + conn = sqlite3.connect(db_file) + c = conn.cursor() + c.execute(query) + c.close() + conn.commit() + conn.close() + except sqlite3.OperationalError as e: + print(e) + conn.close() + time.sleep(1) + continue + else: + break + + +def read_scores_from_db(db_file, chroms): + """Read all rows by matching chromosomes + Args: + chroms: + List of chromosome used for scoring. This will be + converted into a comma-separated string and only + rows with matching "chroms" field will be retrieved. + This is to filter out records scored with different + set of chromosomes. + Returns: + All rows ordered by bootstrap_id and team_id + """ + def score_record_factory(cursor, row): + return ScoreDBRecord(*row) + + query = DB_QUERY_GET.format(table=DB_TABLE_SCORE) #, chroms=valid_chrs_str) + log.info(query) + + while True: + try: + conn = sqlite3.connect(db_file) + conn.row_factory = score_record_factory + c = conn.cursor() + c.execute(query) + result = c.fetchall() + c.close() + conn.close() + except sqlite3.OperationalError as e: + print(e) + conn.close() + time.sleep(1) + continue + else: + break + + #if len(result) == 0: + # print('No records found. ' + # 'Did you forget to specify "--chrom"?') + + return result + + +def create_db(db_file): + log.info('Creating database...') + + try: + conn = sqlite3.connect(db_file) + c = conn.cursor() + c.execute('CREATE TABLE IF NOT EXISTS {} ({});'.format( + DB_TABLE_SCORE, + ','.join([attr + ' ' + getattr(SCORE_DB_RECORD_VAR_TYPE, attr) + for attr in SCORE_DB_RECORD_VAR_TYPE._fields]))) + except Exception as e: + print(e) + sys.exit(1) + finally: + conn.close() + + log.info('All done.') + + +def parse_arguments(): + import argparse + import os + + parser = argparse.ArgumentParser( + description='ENCODE Imputation Challenge SQLite3 Database creator.') + parser.add_argument('db_file', help='DB file.') + args = parser.parse_args() + + if os.path.exists(args.db_file): + raise ValueError('DB file already exists.') + + return args + + +def main(): + args = parse_arguments() + create_db(args.db_file) + + +if __name__ == '__main__': + main() diff --git a/logger.py b/logger.py new file mode 100644 index 0000000..9fab2a6 --- /dev/null +++ b/logger.py @@ -0,0 +1,11 @@ +#!/usr/bin/env python3 + +import sys +import logging +logging.basicConfig( + format='[%(asctime)s %(levelname)s] %(message)s', + stream=sys.stdout) +log = logging.getLogger(__name__) +log.setLevel(logging.INFO) + + diff --git a/rank.py b/rank.py index c7f7a8b..e7cac9c 100644 --- a/rank.py +++ b/rank.py @@ -5,23 +5,19 @@ Jin Lee (leepc12@gmail.com) """ -import sys -import argparse -import sqlite3 import numpy -import time -from collections import namedtuple, defaultdict -from score import DB_TABLE_SCORE, ScoreDBRecord +from collections import namedtuple, defaultdict, OrderedDict from scipy.stats import rankdata -import logging +from score_metrics import RANK_METHOD_FOR_EACH_METRIC +from db import DB_QUERY_GET, read_scores_from_db +from logger import log -logging.basicConfig( - format='[%(asctime)s %(levelname)s] %(message)s', - stream=sys.stdout) -log = logging.getLogger(__name__) + +GlobalScore = namedtuple( + 'GlobalScore', + ['team_id', 'name', 'score_lb', 'score_mean', 'score_ub', 'rank']) -DB_QUERY_GET = 'SELECT * FROM {table} ORDER BY bootstrap_id, submission_id;' CELL_NAME = { 'C02': 'adrenal_gland', 'C20': 'heart_left_ventricle', @@ -75,6 +71,7 @@ 'C14': 'ES-I3', 'C26': 'lower_leg_skin' } + ASSAY_NAME = { 'M01': 'ATAC-seq', 'M02': 'DNase-seq', @@ -113,52 +110,6 @@ 'M35': 'H4K91ac' } -# Ascending (the bigger the better) or descending order for each metric -RANK_METHOD_FOR_EACH_METRIC = { - 'mse': 'DESCENDING', - 'gwcorr': 'ASCENDING', - 'gwspear': 'ASCENDING', - 'mseprom': 'DESCENDING', - 'msegene': 'DESCENDING', - 'mseenh': 'DESCENDING', - 'msevar': 'DESCENDING', - 'mse1obs': 'DESCENDING', - 'mse1imp': 'DESCENDING' -} - -GlobalScore = namedtuple( - 'GlobalScore', - ['team_id', 'name', 'score_lb', 'score_mean', 'score_ub', 'rank']) - - - -TEAM_ID_FOR_ROUND1 = { - 0: 'Avocado_p0', - 1: 'Avocado_p1', - 2: 'Avocado_p2', - 3: 'Avocado_p3', - 4: 'Avocado_p4', - 5: 'Avocado_p5', - 6: 'Avocado_p6', - 7: 'Avocado_p7', - 8: 'Avocado_p8', - 9: 'Avocado_p9', - 10: 'Avocado_p10', - 20: 'Average', - 1000 : 'BrokenNodes', - 1010 : 'ENCODE_DCC_Imputes', - 1020 : 'Hongyang_Li_and_Yuanfang_Guan', - 1030 : 'KKT-ENCODE-Impute-model_1', - 1031 : 'KKT-ENCODE-Impute-model_2', - 1040 : 'LiPingChun', - 1050 : 'MLeipzig', - 1060 : 'noml' -} - - -def get_team_name(team_id): - return TEAM_ID_FOR_ROUND1[team_id] - def get_cell_name(cell_id): if cell_id in CELL_NAME: @@ -174,48 +125,27 @@ def get_assay_name(assay_id): return str(assay_id) -def score_record_factory(cursor, row): - return ScoreDBRecord(*row) +def get_team_name(syn, team_name_dict, team_id): + if team_name_dict is not None and team_id in team_name_dict and int(team_id)<=100: + return team_name_dict[team_id] + if syn is not None: + team = syn.restGET('/team/{id}'.format(id=team_id)) + if 'name' in team: + return team['name'] + if team_name_dict is not None: + return team_name_dict[team_id] + return team_id -def read_scores_from_db(db_file, chroms): - """Read all rows by matching chromosomes - Args: - chroms: - List of chromosome used for scoring. This will be - converted into a comma-separated string and only - rows with matching "chroms" field will be retrieved. - This is to filter out records scored with different - set of chromosomes. - Returns: - All rows ordered by bootstrap_id and team_id - """ - # valid_chrs_str = ','.join(sorted(chroms)) - query = DB_QUERY_GET.format(table=DB_TABLE_SCORE) #, chroms=valid_chrs_str) - log.info(query) - - while True: - try: - conn = sqlite3.connect(db_file) - conn.row_factory = score_record_factory - c = conn.cursor() - c.execute(query) - result = c.fetchall() - c.close() - conn.close() - except sqlite3.OperationalError as e: - print(e) - conn.close() - time.sleep(1) - continue - else: - break - - if len(result) == 0: - print('No records found. ' - 'Did you forget to specify "--chrom"?') - - return result +def parse_team_name_tsv(tsv): + team_name_dict = {} + with open(tsv, 'r') as fp: + for line in fp.read().strip('\n').split('\n'): + arr = line.split('\t') + team_id = int(arr[0]) + team_name = arr[1] + team_name_dict[team_id] = team_name + return team_name_dict def calc_combined_ranks(rows, measures_to_use): @@ -244,13 +174,15 @@ def calc_combined_ranks(rows, measures_to_use): return dict(zip(zip(team_ids, submission_ids), ranks)) -def calc_global_ranks(rows, measures_to_use): +def calc_global_ranks(rows, measures_to_use, team_name_dict=None, syn=None): """Calculate global ranks Outputs: Markdown table for ranks """ + markdown_per_cell_assay = defaultdict(OrderedDict) + sample_grpd_results = defaultdict(lambda: defaultdict(list)) all_users = set() for x in rows: @@ -277,13 +209,15 @@ def calc_global_ranks(rows, measures_to_use): for team_id in all_users - obs_users: global_scores[index][team_id].append(0.5) - print('# {} {} ({} {})'.format(cell, get_cell_name(cell), assay, get_assay_name(assay))) - print(' | '.join(('Team', 'name', 'rank'))) - print('|'.join(('----',)*3)) + markdown = '# {} {} ({} {})\n'.format(cell, get_cell_name(cell), assay, get_assay_name(assay)) + markdown += ' | '.join(('Team', 'name', 'rank')) + '\n' + markdown += '|'.join(('----',)*3) + '\n' for (team_id, submission_id), ranks in sorted( user_ranks.items(), key=lambda x: sorted(x[1])[1]): - print('%d | %s | %.2f' % (team_id, get_team_name(team_id), sorted(ranks)[1])) - print() + markdown += '%d | %s | %.2f' % ( + team_id, get_team_name(syn, team_name_dict, team_id), sorted(ranks)[1]) + '\n' + markdown += '\n\n' + markdown_per_cell_assay[cell][assay] = markdown # group the scores by user user_grpd_global_scores = defaultdict(list) @@ -299,52 +233,24 @@ def calc_global_ranks(rows, measures_to_use): for team_id, scores in sorted( user_grpd_global_scores.items(), key=lambda x: sum(x[1])): global_data.append(GlobalScore(*[ - team_id, get_team_name(team_id), + team_id, get_team_name(syn, team_name_dict, team_id), min(scores), sum(scores)/len(scores), max(scores), sorted(user_grpd_global_ranks[team_id])[1] ])) global_data = sorted(global_data, key=lambda x: (x.rank, x.score_mean)) - print() - print('# Overall Results') - print(' | '.join(('Team name', 'rank', 'Lower bound', - 'Mean', 'Upperbound'))) - print('|'.join(('----',)*6)) + markdown_overall = '# Overall Results\n' + markdown_overall += ' | '.join(('Team name', 'rank', 'Lower bound', + 'Mean', 'Upperbound')) + '\n' + markdown_overall += '|'.join(('----',)*6) + '\n' for x in global_data: - print('%s | %.2f | %.2f | %.2f | %.2f' % ( - x.name, x.rank, x.score_lb, x.score_mean, x.score_ub)) + markdown_overall += '%s | %.2f | %.2f | %.2f | %.2f' % ( + x.name, x.rank, x.score_lb, x.score_mean, x.score_ub) + '\n' - return rv, global_data + return rv, global_data, markdown_per_cell_assay, markdown_overall -def parse_arguments(): - parser = argparse.ArgumentParser(prog='ENCODE Imputation Challenge ranking' - 'script.') - parser.add_argument('db_file', - help='SQLite3 DB file with all scores.') - parser.add_argument('--show-score-only', action='store_true', - help='Show score (from DB) only') - parser.add_argument('--chrom', nargs='+', - default=['all'], - help='List of chromosomes to be used for ranking') - parser.add_argument('--measures-to-use', nargs='+', - default=['mse', 'gwcorr', 'gwspear', 'mseprom', - 'msegene', 'mseenh', 'msevar', 'mse1obs', - 'mse1imp'], - help='List of performance measures to be used for ranking') - args = parser.parse_args() - - if args.chrom == ['all']: - args.chrom = ['chr' + str(i) for i in range(1, 23)] + ['chrX'] - - args.log_level = 'CRITICAL' - - log.setLevel(args.log_level) - log.info(sys.argv) - return args - - -def show_score(rows): +def show_score(rows, team_name_dict=None): print('\t'.join(['submission_id', 'team', 'cell_id', 'cell', 'assay_id', 'assay', 'bootstraip_id', 'mse', 'gwcorr', 'gwspear', 'mseprom', 'msegene', 'mseenh', 'msevar', 'mse1obs', 'mse1imp'])) @@ -360,7 +266,7 @@ def show_score(rows): mse1imp = x.mse1imp submission_id= x.submission_id - team= get_team_name(x.team_id) + team= get_team_name(None, team_name_dict, x.team_id) cell_id = x.cell cell= get_cell_name(x.cell) assay_id = x.assay @@ -373,6 +279,33 @@ def show_score(rows): msevar, mse1obs, mse1imp]])) +def parse_arguments(): + import argparse + + parser = argparse.ArgumentParser( + description='ENCODE Imputation Challenge ranking script.') + parser.add_argument('db_file', + help='SQLite3 DB file with all scores.') + parser.add_argument('--team-name-tsv', + help='TSV file with team_id/team_name (1st col/2nd col).') + parser.add_argument('--show-score-only', action='store_true', + help='Show score (from DB) only') + parser.add_argument('--chrom', nargs='+', + default=['all'], + help='List of chromosomes to be used for ranking') + parser.add_argument('--measures-to-use', nargs='+', + default=['mse', 'gwcorr', 'gwspear', 'mseprom', + 'msegene', 'mseenh', 'msevar', 'mse1obs', + 'mse1imp'], + help='List of performance measures to be used for ranking') + args = parser.parse_args() + + if args.chrom == ['all']: + args.chrom = ['chr' + str(i) for i in range(1, 23)] + ['chrX'] + + return args + + def main(): # read params args = parse_arguments() @@ -380,12 +313,24 @@ def main(): log.info('Reading from DB file...') rows = read_scores_from_db(args.db_file, args.chrom) + if args.team_name_tsv is not None: + team_name_dict = parse_team_name_tsv(args.team_name_tsv) + else: + team_name_dict = None + print(team_name_dict) + if args.show_score_only: log.info('List all scores...') - show_score(rows) + show_score(rows, team_name_dict) else: log.info('Calculate ranks...') - rv, global_data = calc_global_ranks(rows, args.measures_to_use) + rv, global_data, markdown_per_cell_assay, markdown_overall = \ + calc_global_ranks( + rows, args.measures_to_use, team_name_dict) + print(markdown_overall) + for _, markdown_per_assay in markdown_per_cell_assay.items(): + for _, markdown in markdown_per_assay.items(): + print(markdown) log.info('All done.') diff --git a/score.py b/score.py index 9d7b494..39f8d32 100644 --- a/score.py +++ b/score.py @@ -1,208 +1,38 @@ #!/usr/bin/env python3 """Imputation challenge scoring script Author: - Jin Lee (leepc12@gmail.com) and Jacob Schreiber (jmschreiber91@gmail.com) + Jin Lee (leepc12@gmail.com) """ -import sys -import argparse -import numpy -import pyBigWig import os -import gzip -import sqlite3 import time +import re import gc -from collections import namedtuple -from sklearn.metrics import roc_auc_score -from scipy.stats import norm, spearmanr -import logging +from score_metrics import Score, normalize_dict +from score_metrics import mse, mseprom, msegene, mseenh, msevar, mse1obs, mse1imp +from score_metrics import gwcorr, gwspear +from db import write_to_db, ScoreDBRecord +from bw_to_npy import load_bed, load_npy, bw_to_dict, dict_to_arr +from logger import log -logging.basicConfig( - format='[%(asctime)s %(levelname)s] %(message)s', - stream=sys.stdout) -log = logging.getLogger(__name__) +RE_PATTERN_SUBMISSION_FNAME = r'^C\d\dM\d\d\..*(bw|bigwig|bigWig|BigWig|npy|npz|np)' -Score = namedtuple( - 'Score', - ('mse', 'gwcorr', 'gwspear', 'mseprom', 'msegene', 'mseenh', - 'msevar', 'mse1obs', 'mse1imp') -) - -ScoreDBRecord = namedtuple( - 'ScoreDBRecord', - ('submission_id', 'team_id', 'submission_fname', 'cell', 'assay', - 'bootstrap_id') + Score._fields -) - -DB_TABLE_SCORE = 'score' -DB_QUERY_INSERT = 'INSERT INTO {table} ({cols}) VALUES ({values});' - - -def mse(y_true, y_pred): - return ((y_true - y_pred) ** 2.).mean() - - -def gwcorr(y_true, y_pred): - return numpy.corrcoef(y_true, y_pred)[0, 1] - - -def gwspear(y_true, y_pred): - return spearmanr(y_true, y_pred)[0] - - -def mseprom(y_true_dict, y_pred_dict, chroms, - gene_annotations, - window_size=25, prom_loc=80): +def parse_submission_filename(bw_file): + """Filename should be CXXMYY.bigwig or CXXMYY.bw """ - Args: - y_true_dict: truth vector per chromosome - { chr: y_true } where y_true is a numpy 1-dim array. - - y_pre_dict: predicted vector per chromosome - { chr: y_pred } where y_pred is a numpy 1-dim array. - """ - sse, n = 0., 0. - - for chrom in chroms: - y_true = y_true_dict[chrom] - y_pred = y_pred_dict[chrom] - - for line in gene_annotations: - chrom_, start, end, _, _, strand = line.split() - start = int(start) // window_size - end = int(end) // window_size + 1 - - # if chrom_ in ('chrX', 'chrY', 'chrM'): - # continue - - if chrom_ != chrom: - continue - - if strand == '+': - sse += ((y_true[start-prom_loc: start] - - y_pred[start-prom_loc: start]) ** 2).sum() - n += y_true[start-prom_loc: start].shape[0] - - else: - sse += ((y_true[end: end+prom_loc] - - y_pred[end: end+prom_loc]) ** 2).sum() - n += y_true[end: end+prom_loc].shape[0] - - return sse / n - - -def msegene(y_true_dict, y_pred_dict, chroms, - gene_annotations, - window_size=25): - sse, n = 0., 0. - - for chrom in chroms: - y_true = y_true_dict[chrom] - y_pred = y_pred_dict[chrom] - - for line in gene_annotations: - chrom_, start, end, _, _, strand = line.split() - start = int(start) // window_size - end = int(end) // window_size + 1 - - # if chrom_ in ('chrX', 'chrY', 'chrM'): - # continue - - if chrom_ != chrom: - continue - sse += ((y_true[start:end] - y_pred[start:end]) ** 2).sum() - n += end - start - - return sse / n - - -def mseenh(y_true_dict, y_pred_dict, chroms, - enh_annotations, - window_size=25): - sse, n = 0., 0. - - for chrom in chroms: - y_true = y_true_dict[chrom] - y_pred = y_pred_dict[chrom] - - for line in enh_annotations: - chrom_, start, end, _, _, _, _, _, _, _, _, _ = line.split() - start = int(start) // window_size - end = int(end) // window_size + 1 - - if chrom_ != chrom: - continue - sse += ((y_true[start:end] - y_pred[start:end]) ** 2).sum() - n += end - start - - return sse / n - - -def msevar(y_true, y_pred, y_all=None, var=None): - """Calculates the MSE weighted by the cross-cell-type variance. - - According to the wiki: Computing this measure involves computing, - for an assay carried out in cell type x and assay type y, a vector of - variance values across all assays of type y. The squared error between - the predicted and true value at each genomic position is multiplied by - this variance (normalized to sum to 1 across all bins) before being - averaged across the genome. - - Parameters - ---------- - y_true: numpy.ndarray, shape=(n_positions,) - The true signal - - y_pred: numpy.ndarray, shape=(n_positions,) - The predicted signal - - y_all: numpy.ndarray, shape=(n_celltypes, n_positions) - The true signal from all the cell types to calculate the variance over. - mutually exclusive with var - - var: numpy.ndarray, shape=(n_positions,) - pre-computed var vector - mutually exclusive with y_all - - Returns - ------- - mse: float - The mean-squared error that's weighted at each position by the - variance. - """ - - if var is None and y_all is None: - return 0.0 - if var is None: - var = numpy.std(y_all, axis=0) ** 2 - - return ((y_true - y_pred) ** 2).dot(var)/var.sum() - - -def mse1obs(y_true, y_pred): - n = int(y_true.shape[0] * 0.01) - y_true_sorted = numpy.sort(y_true) - y_true_top1 = y_true_sorted[-n] - idx = y_true >= y_true_top1 - - return mse(y_true[idx], y_pred[idx]) - - -def mse1imp(y_true, y_pred): - n = int(y_true.shape[0] * 0.01) - y_pred_sorted = numpy.sort(y_pred) - y_pred_top1 = y_pred_sorted[-n] - idx = y_pred >= y_pred_top1 - - return mse(y_true[idx], y_pred[idx]) + basename = os.path.basename(bw_file) + if len(re.findall(RE_PATTERN_SUBMISSION_FNAME, basename)) == 0: + raise Exception('Wrong submission filename. It should be CXXMYY.*bigwig') + cell = basename[0:3] + assay = basename[3:6] + return cell, assay def score(y_pred_dict, y_true_dict, chroms, gene_annotations, enh_annotations, window_size=25, prom_loc=80, - y_var_true_dict=None): + y_var_dict=None): """Calculate score Args: @@ -218,85 +48,56 @@ def score(y_pred_dict, y_true_dict, chroms, random seed. """ # concat all chromosomes + y_pred_dict_norm = normalize_dict(y_pred_dict, chroms) + y_true_dict_norm = normalize_dict(y_true_dict, chroms) + y_pred = dict_to_arr(y_pred_dict, chroms) y_true = dict_to_arr(y_true_dict, chroms) - if y_var_true_dict is None: + + y_pred_norm = dict_to_arr(y_pred_dict_norm, chroms) + y_true_norm = dict_to_arr(y_true_dict_norm, chroms) + + if y_var_dict is None: y_var_true = None else: - y_var_true = dict_to_arr(y_var_true_dict, chroms) + y_var_true = dict_to_arr(y_var_dict, chroms) output = Score( - mse=mse(y_true, y_pred), + mse=mse(y_true_norm, y_pred_norm), gwcorr=gwcorr(y_true, y_pred), gwspear=gwspear(y_true, y_pred), - mseprom=mseprom(y_true_dict, y_pred_dict, chroms, + mseprom=mseprom(y_true_dict_norm, y_pred_dict_norm, chroms, gene_annotations, window_size, prom_loc), - msegene=msegene(y_true_dict, y_pred_dict, chroms, + msegene=msegene(y_true_dict_norm, y_pred_dict_norm, chroms, gene_annotations, window_size), - mseenh=mseenh(y_true_dict, y_pred_dict, chroms, + mseenh=mseenh(y_true_dict_norm, y_pred_dict_norm, chroms, enh_annotations, window_size), - msevar=msevar(y_true, y_pred, var=y_var_true), - mse1obs=mse1obs(y_true, y_pred), - mse1imp=mse1imp(y_true, y_pred), + msevar=msevar(y_true_norm, y_pred_norm, var=y_var_true), + mse1obs=mse1obs(y_true_norm, y_pred_norm), + mse1imp=mse1imp(y_true_norm, y_pred_norm), ) return output -def dict_to_arr(d, chroms): - """Concat vectors in d - """ - result = [] - for c in chroms: - result.extend(d[c]) - return numpy.array(result) - - -def write_to_db(score_db_record, db_file): - cols = [] - values = [] - for attr in score_db_record._fields: - cols.append(str(attr)) - val = getattr(score_db_record, attr) - if isinstance(val, str): - val = '"' + val + '"' - else: - val = str(val) - values.append(val) - - query = DB_QUERY_INSERT.format( - table=DB_TABLE_SCORE, cols=','.join(cols), values=','.join(values)) - log.info('SQL query: {}'.format(query)) - while True: - try: - conn = sqlite3.connect(db_file) - c = conn.cursor() - c.execute(query) - c.close() - conn.commit() - conn.close() - except sqlite3.OperationalError as e: - print(e) - conn.close() - time.sleep(1) - continue - else: - break - - def parse_arguments(): + import argparse import os + py_path = os.path.dirname(os.path.realpath(__file__)) parser = argparse.ArgumentParser( - prog='ENCODE Imputation Challenge scoring script. .npy or .npz must be built ' + description='ENCODE Imputation Challenge scoring script. .npy or .npz must be built ' 'with a correct --window-size. i.e. containing a value for each bin') - parser.add_argument('npy_pred', - help='Submission .npy file to be scored') - parser.add_argument('npy_true', - help='Truth .npy file') + parser.add_argument('pred_npy_or_bw', + help='Submission .npy or .bigwig file to be scored') + parser.add_argument('true_npy_or_bw', + help='Truth .npy or .bigwig file') + parser.add_argument('--download-submissions-from-syn-eval-queue', action='store_true', + help='Download RECEIVED submissions from Synapse ' + 'evaluation queue.') parser.add_argument('--var-npy', help='Truth .npy file filled with a variance for each bin ' 'instead of a raw signal value. ' @@ -332,71 +133,44 @@ def parse_arguments(): py_path, 'annot/hg38/F5.hg38.enhancers.bed.gz'), help='Enhancer annotations BED file ') + p_score.add_argument('--blacklist-file', + default=os.path.join( + py_path, + 'annot/hg38/hg38.blacklist.bed.gz'), + help='Blacklist BED file. Bootstrap label will be ' + 'generated after removing overlapping regions ' + 'defined in this file.') p_score.add_argument('--window-size', default=25, type=int, help='Window size for bigwig in bp') p_score.add_argument('--prom-loc', default=80, type=int, help='Promoter location in a unit of window size ' '(--window-size). This is not in bp') p_score.add_argument('--validated', action='store_true', - help='For validated submissions (not for truth bigwigs)' + help='For validated submissions (not for truth bigwigs) ' 'with fixed interval length of 25 and valid ' 'chromosome lengths. It will skip interpolation. ' 'For truth bigwigs, it is recommended to convert ' 'them into npy\'s or npz\'s by using ' - 'build_npy_from_bigwig.py') + 'bw_to_npy.py') + #p_score.add_argument('--normalize-with-robust-min-max', action='store_true', + # help='Normalize with robust min max.') p_out = parser.add_argument_group( title='Output to file (TSV or DB)') - p_out.add_argument('--out-file', default='output.tsv', - help='Write scores to TSV file') - p_out.add_argument('--out-db-file', + p_out.add_argument('--db-file', help='Write metadata/scores to SQLite DB file') p_meta = parser.add_argument_group( title='Submission metadata, which will be written to ' 'DB together with scores. This will be used for ' 'ranking later') - p_meta.add_argument('--cell', '-c', - help='Cell type. e.g. C01') - p_meta.add_argument('--assay', '-a', - help='Assay or histone mark. e.g. M01') p_meta.add_argument('--team-id', '-t', type=int, help='Team ID (unique ID from Synapse)') p_meta.add_argument('--submission-id', '-s', type=int, help='Submission ID (unique ID from Synapse)') - parser.add_argument('--log-level', default='INFO', - choices=['NOTSET', 'DEBUG', 'INFO', 'WARNING', - 'CRITICAL', 'ERROR', 'CRITICAL'], - help='Log level') args = parser.parse_args() - # some submission files have whitespace in path... - if args.npy_pred is not None: - args.npy_pred = args.npy_pred.strip("'") - if args.npy_true is not None: - args.npy_true = args.npy_true.strip("'") - if args.var_npy is not None: - args.var_npy = args.var_npy.strip("'") - if args.out_file is not None: - args.out_file = args.out_file.strip("'") - - if args.out_db_file is not None: - args.out_db_file = args.out_db_file.strip("'") - if not os.path.exists(args.out_db_file): + if args.db_file is not None: + if not os.path.exists(args.db_file): raise ValueError('DB file does not exists') - # if args.chrom != ['all']: - # raise ValueError( - # 'Define "--chrom all" to write scores to DB file') - if args.cell is None: - raise ValueError( - 'Define "--cell CXX" to write scores to DB file') - if args.assay is None: - raise ValueError( - 'Define "--assay MXX" to write scores to DB file') - if args.team_id is None: - raise ValueError( - 'Define "--team-id" to write scores to DB file') - if args.submission_id is None: - raise ValueError( - 'Define "--submission-id" to write scores to DB file') if args.chrom == ['all']: args.chrom = ['chr' + str(i) for i in range(1, 23)] + ['chrX'] @@ -409,111 +183,53 @@ def parse_arguments(): args.bootstrap_chrom[i] = (i, args.bootstrap_chrom[i].split(',')) print(args.bootstrap_chrom) - log.setLevel(args.log_level) - log.info(sys.argv) return args -def read_annotation_bed(bed): - result = [] - if bed.endswith('gz'): - with gzip.open(bed, 'r') as infile: - for line in infile: - result.append(line.decode("ascii")) - else: - with open(bed, 'r') as infile: - for line in infile: - result.append(line) - return result - - -def del_unused_chroms_from_dict(d, chroms): - """Remove reference to data for unused chroms from dict and - do GC - """ - if d is None: - return - unused_chroms = set(d.keys()) - set(chroms) - for c in unused_chroms: - del d[c] - - gc.collect() - return - - def main(): - # read params args = parse_arguments() - gc.disable() - - if args.npy_pred.endswith(('.npy', '.npz')): - log.info('Opening prediction numpy array file...') - y_pred_dict = numpy.load(args.npy_pred, allow_pickle=True)[()] - elif args.npy_pred.lower().endswith(('.bw','.bigwig')): - log.info('Opening prediction bigwig file...') - bw_pred = pyBigWig.open(args.npy_pred) - y_pred_dict = bw_to_dict(bw_pred, args.chrom, args.window_size, - args.validated) - else: - raise NotImplementedError('Unsupported file type') - - # free unused chroms from memory (GC) - del_unused_chroms_from_dict(y_pred_dict, args.chrom) - - if args.npy_true.endswith(('.npy', '.npz')): - log.info('Opening truth numpy array file...') - y_true_dict = numpy.load(args.npy_true, allow_pickle=True)[()] - elif args.npy_true.lower().endswith(('.bw','.bigwig')): - log.info('Opening truth bigwig file...') - bw_true = pyBigWig.open(args.npy_true) - y_true_dict = bw_to_dict(bw_true, args.chrom, args.window_size) - else: - raise NotImplementedError('Unsupported file type') - - del_unused_chroms_from_dict(y_true_dict, args.chrom) - + y_pred_dict = bw_to_dict(args.pred_npy_or_bw, args.chrom, + args.window_size, args.blacklist_file) + y_true_dict = bw_to_dict(args.true_npy_or_bw, args.chrom, + args.window_size, args.blacklist_file) if args.var_npy is None: - y_var_true_dict = None + y_var_dict = None elif args.var_npy.endswith(('.npy', '.npz')): log.info('Opening truth var numpy array file...') - y_var_true_dict = numpy.load(args.var_npy, allow_pickle=True)[()] + y_var_dict = load_npy(args.var_npy) else: raise ValueError('Var true file should be a binned .npy or .npz.') - del_unused_chroms_from_dict(y_var_true_dict, args.chrom) - - log.info('Reading from enh_annotations...') - enh_annotations = read_annotation_bed(args.enh_annotations) - - log.info('Reading from gene_annotations...') - gene_annotations = read_annotation_bed(args.gene_annotations) - - with open(args.out_file, 'w') as fp: - for k, bootstrap_chrom in args.bootstrap_chrom: - log.info('Calculating score for bootstrap {} case...'.format(k)) - - score_output = score(y_pred_dict, y_true_dict, bootstrap_chrom, - gene_annotations, enh_annotations, - args.window_size, args.prom_loc, - y_var_true_dict) - # write to TSV - s = "\t".join(['bootstrap_'+str(k)]+[str(o) for o in score_output]) - fp.write(s+'\n') - print(s) - - # write to DB - if args.out_db_file is not None: - score_db_record = ScoreDBRecord( - args.submission_id, - args.team_id, - os.path.basename(args.npy_pred), - args.cell, - args.assay, - k, - *score_output) - write_to_db(score_db_record, args.out_db_file) - gc.collect() + enh_annotations = load_bed(args.enh_annotations) + gene_annotations = load_bed(args.gene_annotations) + + gc.disable() + + cell, assay = parse_submission_filename(args.pred_npy_or_bw) + + for k, bootstrap_chrom in args.bootstrap_chrom: + log.info('Calculating score for bootstrap {} case...'.format(k)) + + score_output = score(y_pred_dict, y_true_dict, bootstrap_chrom, + gene_annotations, enh_annotations, + args.window_size, args.prom_loc, + y_var_dict) + s = "\t".join(['bootstrap_'+str(k)]+[str(o) for o in score_output]) + print(s) + + # write to DB + if args.db_file is not None: + score_db_record = ScoreDBRecord( + args.submission_id, + args.team_id, + os.path.basename(bw), + cell, + assay, + k, + *score_output) + write_to_db(score_db_record, args.db_file) + gc.collect() log.info('All done') diff --git a/score_leaderboard.py b/score_leaderboard.py new file mode 100644 index 0000000..1133c17 --- /dev/null +++ b/score_leaderboard.py @@ -0,0 +1,515 @@ +#!/usr/bin/env python3 +"""Imputation challenge scoring script for leaderboard +Author: + Jin Lee (leepc12@gmail.com) +""" + +import os +import re +import time +import shutil +#import gc +import traceback +import synapseclient +import multiprocessing +from bw_to_npy import load_bed, load_npy, bw_to_dict +from score import parse_submission_filename, score +from score_metrics import Score +from rank import calc_global_ranks, get_cell_name, get_assay_name, get_team_name, parse_team_name_tsv +from db import write_to_db, ScoreDBRecord, DB_QUERY_GET, read_scores_from_db +from io import StringIO +from logger import log + + +BIG_INT = 99999999 # for multiprocessing + + +LEADERBOARD_ROUND_VALID_CELL_ASSAY = [ +'C02M22', +'C03M02', +'C04M16', +'C09M20', +'C10M17', +'C12M16', +'C12M32', +'C13M20', +'C16M17', +'C17M04', +'C17M19', +'C17M29', +'C17M32', +'C18M21', +'C18M25', +'C20M22', +'C23M03', +'C23M07', +'C23M26', +'C23M34', +'C24M17', +'C24M25', +'C25M21', +'C25M26', +'C27M03', +'C27M13', +'C27M24', +'C27M26', +'C29M29', +'C31M25', +'C32M08', +'C32M12', +'C32M20', +'C34M02', +'C34M32', +'C36M18', +'C37M29', +'C45M22', +'C46M10', +'C46M18', +'C46M21', +'C46M35', +'C47M18', +'C48M16', +'C50M02' +] + +def is_valid_leaderboard_cell_assay(cell, assay): + return (cell + assay) in LEADERBOARD_ROUND_VALID_CELL_ASSAY + + +def mkdir_p(path): + import errno + import os + + try: + os.makedirs(path) + except OSError as exc: # Python >2.5 + if exc.errno == errno.EEXIST and os.path.isdir(path): + pass + else: + raise + + +def send_message(syn, user_ids, subject, message): + if len(user_ids) == 0: + return None + try: + response = syn.sendMessage( + userIds=user_ids, + messageSubject=subject, + messageBody=message, + contentType="text/html") + log.info("Message sent: {}".format(str(response).encode('utf-8'))) + except Exception as ex0: + log.error("Failed to send message: {}".format(ex0)) + response = None + + return response + + +# markdown supertable to show submission status +WIKI_TEMPLATE_SUBMISSION_STATUS = \ +'${{supertable?path=%2Fevaluation%2Fsubmission%2Fquery%3Fquery%3D\ +select+%2A+from+evaluation_{eval_queue_id}+\ +&paging=true&queryTableResults=true&showIfLoggedInOnly=false&pageSize=100\ +&showRowNumber=false&jsonResultsKeyName=rows\ +&columnConfig0=none%2CID%2CobjectId%3B%2CNONE\ +&columnConfig1=none%2CteamId%2CteamId%3B%2CNONE\ +&columnConfig2=none%2Cteam%2Cteam%3B%2CNONE\ +&columnConfig3=epochdate%2CDate%2CcreatedOn%3B%2CNONE\ +&columnConfig4=none%2Cname%2Cname%3B%2CNONE\ +&columnConfig5=none%2Cstatus%2Cstatus%3B%2CNONE%2C4\ +}}\n\n' + +# markdown supertable to show submission status +WIKI_TEMPLATE_SUBMISSION_SCORE = \ +'${{supertable?path=%2Fevaluation%2Fsubmission%2Fquery%3Fquery%3D\ +select+%2A+from+evaluation_{eval_queue_id}+\ +where%2Bstatus%3D%3D%2522SCORED%2522%2Band%2Bcell%3D%3D%2522{cell}%2522%2B\ +and%2Bassay%3D%3D%2522{assay}%2522%2B\ +&paging=true&queryTableResults=true&showIfLoggedInOnly=false&pageSize=100\ +&showRowNumber=false&jsonResultsKeyName=rows\ +&columnConfig0=none%2CID%2CobjectId%3B%2CNONE\ +&columnConfig1=none%2CteamId%2CteamId%3B%2CNONE\ +&columnConfig2=none%2Cteam%2Cteam%3B%2CNONE\ +&columnConfig3=epochdate%2CDate%2CcreatedOn%3B%2CNONE\ +&columnConfig4=none%2Cmse%2Cmse%3B%2CNONE\ +&columnConfig5=none%2Cgwcorr%2Cgwcorr%3B%2CNONE\ +&columnConfig6=none%2Cgwspear%2Cgwspear%3B%2CNONE\ +&columnConfig7=none%2Cmseprom%2Cmseprom%3B%2CNONE\ +&columnConfig8=none%2Cmsegene%2Cmsegene%3B%2CNONE\ +&columnConfig9=none%2Cmseenh%2Cmseenh%3B%2CNONE\ +&columnConfig10=none%2Cmsevar%2Cmsevar%3B%2CNONE\ +&columnConfig11=none%2Cmse1obs%2Cmse1obs%3B%2CNONE\ +&columnConfig12=none%2Cmse1imp%2Cmse1imp%3B%2CNONE\ +}}\n\n' + +RE_PATTERN_SUBMISSION_FNAME = r'^C\d\dM\d\d.*(bw|bigwig|bigWig|BigWig)' + +def update_wiki(syn, team_name_dict, args): + # calculate ranks and update leaderboard wiki + log.info('Updating wiki...') + rows = read_scores_from_db(args.db_file, args.chrom) + _, _, markdown_per_cell_assay, markdown_overall = calc_global_ranks( + rows, args.measures_to_use, team_name_dict, syn) + + wiki_id_map = { + k.split(':')[0]: k.split(':')[1] for k in args.leaderboard_wiki_id.split(',') + } + for k, wiki_id in wiki_id_map.items(): + w = syn.getWiki(args.project_id, wiki_id) + + if k == 'submission_status': + title = 'Submission status' + markdown = WIKI_TEMPLATE_SUBMISSION_STATUS.format( + eval_queue_id=args.eval_queue_id) + + elif k == 'overall': + title = 'Overall ranks' + markdown = markdown_overall + + elif k.startswith('C'): + title = '{} {}'.format(k, get_cell_name(k)) + markdown = '' + if k in markdown_per_cell_assay: + for assay, m in markdown_per_cell_assay[k].items(): + markdown += m + '\n' + markdown += WIKI_TEMPLATE_SUBMISSION_SCORE.format( + eval_queue_id=args.eval_queue_id, + cell=k, assay=assay) + else: + continue + + w.markdown = markdown + w.title = title + w = syn.store(w) + + return None + +def parse_arguments(): + import argparse + import os + + py_path = os.path.dirname(os.path.realpath(__file__)) + + parser = argparse.ArgumentParser( + description='ENCODE Imputation Challenge leaderboard script. ') + parser.add_argument('eval_queue_id', + help='Synapse evaluation queue ID to retreive submissions from.') + parser.add_argument('true_npy_dir', + help='Directory for truth .npy files. ' + 'All .npy files will be used for scoring against the submission') + parser.add_argument('--var-npy-dir', + help='Directory for var .npy files. ' + 'All var_CXX.npy files will be used for scoring.') + p_score = parser.add_argument_group( + title='Scoring parameters') + p_score.add_argument('--chrom', nargs='+', + default=['all'], + help='List of chromosomes to be combined to be ' + 'scored. ' + 'Set as "all" (default) to score for all ' + 'chromosomes. ' + '(e.g. "all" or "chr3 chr21") ' + 'It should be "all" to write scores to DB file') + p_score.add_argument('--bootstrap-chrom', nargs='*', default=[], + help='Bootstrapped chromosome groups. ' + 'Delimiter is whitespace for groups and ' + 'comma(,) in each group. Order is important.' + 'e.g. "chr1,chr2 chr1,chrX chr2,chrX" means ' + 'three groups: (chr1,chr2), (chr1,chrX), (chr2,chrX)') + p_score.add_argument('--gene-annotations', + default=os.path.join( + py_path, + 'annot/hg38/gencode.v29.genes.gtf.bed.gz'), + help='Gene annotations BED file') + p_score.add_argument('--enh-annotations', + default=os.path.join( + py_path, + 'annot/hg38/F5.hg38.enhancers.bed.gz'), + help='Enhancer annotations BED file ') + p_score.add_argument('--blacklist-file', + default=os.path.join( + py_path, + 'annot/hg38/hg38.blacklist.bed.gz'), + help='Blacklist BED file. Bootstrap label will be ' + 'generated after removing overlapping regions ' + 'defined in this file.') + p_score.add_argument('--window-size', default=25, type=int, + help='Window size for bigwig in bp') + p_score.add_argument('--prom-loc', default=80, type=int, + help='Promoter location in a unit of window size ' + '(--window-size). This is not in bp') + p_score.add_argument('--measures-to-use', nargs='+', + default=['mse', 'gwcorr', 'gwspear', 'mseprom', + 'msegene', 'mseenh', 'msevar', 'mse1obs', + 'mse1imp'], + help='List of performance measures to be used for ranking') + p_score.add_argument('--validated', action='store_true', + help='For validated submissions ' + 'with fixed interval length of 25 and valid ' + 'chromosome lengths. It will skip interpolation') + p_score.add_argument('--update-wiki-only', action='store_true', + help='Update wiki based on DB file (--db-file) without ' + 'scoring submissions') + #p_score.add_argument('--normalize-with-robust-min-max', action='store_true', + # help='Normalize with robust min max.') + p_out = parser.add_argument_group( + title='Output database file') + p_out.add_argument('--db-file', + help='Write metadata/scores to SQLite DB file') + p_sys = parser.add_argument_group( + title='System and resource settings') + p_sys.add_argument('--nth', type=int, default=1, + help='Number of threads to parallelize scoring (per) submission') + p_sys.add_argument('--team-name-tsv', + help='TSV file with team_id/team_name (1st col/2nd col).') + p_syn = parser.add_argument_group( + title='Communitation with synapse') + p_syn.add_argument('--dry-run', action='store_true', + help='Do not update submission\'s status on synapse.') + p_syn.add_argument('--project-id', default='syn17083203', + help='Synapse project ID.') + p_syn.add_argument('--leaderboard-wiki-id', + default='overall:594046,submission_status:594047,' + 'C02:594048,C03:594049,C04:594050,C09:594051,C10:594052,' + 'C12:594053,C13:594054,C16:594055,C17:594056,C18:594057,' + 'C20:594058,C23:594059,C24:594060,C25:594061,C27:594062,' + 'C29:594063,C31:594064,C32:594065,C34:594068,C36:594069,' + 'C37:594070,C45:594071,C46:594072,C47:594073,C48:594074,' + 'C50:594075', + help='Comma-delimited Synapse wiki ID for leaderboard. ' + 'Required items: overall, submission_status, C??' + 'Format example: "overall:594046,' + 'submission_status:594047"') + p_syn.add_argument('--submission-dir', default='./submissions', + help='Download submissions here.') + p_syn.add_argument('--send-msg-to-admin', action='store_true', + help='Send message to admin.') + p_syn.add_argument('--send-msg-to-user', action='store_true', + help='Send message to user.') + p_syn.add_argument('--period', default=1800, + help='Time period in second to download submissions from synapse ' + 'and score them') + p_syn.add_argument('--admin-id', nargs='+', default=['3345120'], + help='Admin\'s Synapse ID (as string) ') + args = parser.parse_args() + + if args.db_file is not None: + args.db_file = args.db_file.strip("'") + if not os.path.exists(args.db_file): + raise ValueError('DB file does not exists') + + if args.chrom == ['all']: + args.chrom = ['chr' + str(i) for i in range(1, 23)] + ['chrX'] + args.chrom = sorted(args.chrom) + + if len(args.bootstrap_chrom) == 0: + args.bootstrap_chrom = [(-1, args.chrom)] + else: + for i, _ in enumerate(args.bootstrap_chrom): + args.bootstrap_chrom[i] = (i, args.bootstrap_chrom[i].split(',')) + log.info(args.bootstrap_chrom) + + return args + + +def score_submission(submission, status, args, syn, + gene_annotations, enh_annotations): + status['status'] = 'INVALID' + + submission_dir = os.path.join( + os.path.abspath(args.submission_dir), submission.id) + mkdir_p(submission_dir) + + chosen_score = None # first bootstrap score + metadata = { + 'id': submission.id, + 'team': get_team_name(syn, None, submission.teamId) + } + + try: + log.info('Downloading submission... {}'.format(submission.id)) + submission = syn.getSubmission( + submission, + downloadLocation=submission_dir, + ifcollision='overwrite.local' + ) + print() + submission_fname = submission.filePath + cell, assay = parse_submission_filename(submission_fname) + if not is_valid_leaderboard_cell_assay(cell, assay): + raise Exception('Invalid cell/assay combination for ' + 'leaderboard round') + + log.info('Downloading done {}, {}, {}, {}, {}'.format( + submission_fname, submission.id, + submission.teamId, cell, assay)) + + # read pred npy (submission) + log.info('Converting to dict...{}'.format(submission.id)) + y_pred_dict = bw_to_dict(submission_fname, args.chrom, + args.window_size, args.blacklist_file, + args.validated) + #gc.collect() + # read truth npy + npy_true = os.path.join( + args.true_npy_dir, + '{}{}.npy'.format(cell, assay)) + y_true_dict = bw_to_dict(npy_true, args.chrom, + args.window_size, args.blacklist_file) + #gc.collect() + # read var npy + if args.var_npy_dir is not None: + var_npy = os.path.join( + args.var_npy_dir, + 'var_{}.npy'.format(assay)) + y_var_dict = load_npy(var_npy) + else: + y_var_dict = None + #gc.collect() + + score_outputs = [] + for k, bootstrap_chrom in args.bootstrap_chrom: + # score it for each bootstrap chroms + log.info('Scoring... k={}, submission_id={}'.format(k, submission.id)) + r = score(y_pred_dict, y_true_dict, bootstrap_chrom, + gene_annotations, enh_annotations, + args.window_size, args.prom_loc, + y_var_dict) + #gc.collect() # free memory for bootstrapped arrays + log.info('Scored: {}, {}, {}, {}'.format( + submission.id, submission.teamId, k, r)) + score_outputs.append((k, r)) + + # score to be shown on wiki (first bootstrap score) + chosen_score = score_outputs[0][1] + + # write to db and report + for k, score_output in score_outputs: + if not args.dry_run: + score_db_record = ScoreDBRecord( + int(submission.id), + int(submission.teamId), + submission_fname, + cell, + assay, + k, + *score_output) + write_to_db(score_db_record, args.db_file) + # mark is as scored + status['status'] = 'SCORED' + + # free memory + y_pred_dict = None + y_true_dict = None + y_var_dict = None + #gc.collect() + + subject = 'Successfully scored submission %s %s %s:\n' % ( + submission.name, submission.id, submission.teamId) + message = 'Score (bootstrap_idx: score)\n' + message += '\n'.join( + ['{}: {}'.format(k, s) for k, s in score_outputs]) + log.info(subject + message) + + except Exception as ex1: + subject = 'Error scoring submission %s %s %s:\n' % ( + submission.name, submission.id, submission.teamId) + st = StringIO() + traceback.print_exc(file=st) + message = st.getvalue() + log.error(subject + message) + + finally: + # remove submissions (both bigwig, npy) to save disk space + shutil.rmtree(submission_dir) + pass + + # send message + users_to_send_msg = [] + if args.send_msg_to_user: + users_to_send_msg.append(submission.userId) + if args.send_msg_to_admin: + users_to_send_msg.extend(args.admin_id) + send_message(syn, users_to_send_msg, subject, message) + + if not args.dry_run: + # update metadata with score + if chosen_score is not None: + for field in Score._fields: + metadata[field] = getattr(chosen_score, field) + metadata['cell'] = cell + metadata['assay'] = assay + + status['annotations'] = synapseclient.annotations.to_submission_status_annotations( + metadata, is_private=False) + status = syn.store(status) + + return status + + +def main(): + log.info('Parsing arguments...') + + args = parse_arguments() + + if args.team_name_tsv is not None: + team_name_dict = parse_team_name_tsv(args.team_name_tsv) + else: + team_name_dict = None + print(team_name_dict) + + enh_annotations = load_bed(args.enh_annotations) + gene_annotations = load_bed(args.gene_annotations) + + # do GC manually + #gc.disable() + + syn = synapseclient.login() + t0 = time.perf_counter() + + while True: + try: + if not args.update_wiki_only: + evaluation = syn.getEvaluation(args.eval_queue_id) + + # init multiprocessing + pool = multiprocessing.Pool(args.nth) + + # distribute jobs + ret_vals = [] + for submission, status in syn.getSubmissionBundles(evaluation, status='RECEIVED'): + ret_vals.append( + pool.apply_async(score_submission, + (submission, status, args, syn, + gene_annotations, enh_annotations))) + # gather + for r in ret_vals: + r.get(BIG_INT) + + pool.close() + pool.join() + + update_wiki(syn, team_name_dict, args) + + except Exception as ex1: + st = StringIO() + traceback.print_exc(file=st) + message = st.getvalue() + + subject = 'Server error:' + if args.send_msg_to_admin: + send_message(syn, args.admin_id, subject, message) + log.error(message) + + log.info('Waiting for new submissions...') + while time.perf_counter() - t0 < args.period: + time.sleep(60) + t0 = time.perf_counter() + + log.info('All done') + + +if __name__ == '__main__': + main() + diff --git a/score_metrics.py b/score_metrics.py new file mode 100644 index 0000000..5dcb343 --- /dev/null +++ b/score_metrics.py @@ -0,0 +1,218 @@ +#!/usr/bin/env python3 +"""Imputation challenge metrics + +Author: + Jacob Schreiber (jmschreiber91@gmail.com) + Jin Lee (leepc12@gmail.com) +""" + +import numpy +from collections import namedtuple +from sklearn.metrics import roc_auc_score +from scipy.stats import norm, spearmanr +from logger import log + + +Score = namedtuple( + 'Score', + ('mse', 'gwcorr', 'gwspear', 'mseprom', 'msegene', 'mseenh', + 'msevar', 'mse1obs', 'mse1imp') +) + +# Ascending (the bigger the better) or descending order for each metric +RANK_METHOD_FOR_EACH_METRIC = { + 'mse': 'DESCENDING', + 'gwcorr': 'ASCENDING', + 'gwspear': 'ASCENDING', + 'mseprom': 'DESCENDING', + 'msegene': 'DESCENDING', + 'mseenh': 'DESCENDING', + 'msevar': 'DESCENDING', + 'mse1obs': 'DESCENDING', + 'mse1imp': 'DESCENDING' +} + + +def mse(y_true, y_pred): + return ((y_true - y_pred) ** 2.).mean() + + +def gwcorr(y_true, y_pred): + return numpy.corrcoef(y_true, y_pred)[0, 1] + + +def gwspear(y_true, y_pred): + return spearmanr(y_true, y_pred)[0] + + +def mseprom(y_true_dict, y_pred_dict, chroms, + gene_annotations, + window_size=25, prom_loc=80): + """ + Args: + y_true_dict: truth vector per chromosome + { chr: y_true } where y_true is a numpy 1-dim array. + + y_pre_dict: predicted vector per chromosome + { chr: y_pred } where y_pred is a numpy 1-dim array. + """ + sse, n = 0., 0. + + for chrom in chroms: + y_true = y_true_dict[chrom] + y_pred = y_pred_dict[chrom] + + for line in gene_annotations: + chrom_, start, end, _, _, strand = line.split() + start = int(start) // window_size + end = int(end) // window_size + 1 + + # if chrom_ in ('chrX', 'chrY', 'chrM'): + # continue + + if chrom_ != chrom: + continue + + if strand == '+': + sse += ((y_true[start-prom_loc: start] - + y_pred[start-prom_loc: start]) ** 2).sum() + n += y_true[start-prom_loc: start].shape[0] + + else: + sse += ((y_true[end: end+prom_loc] - + y_pred[end: end+prom_loc]) ** 2).sum() + n += y_true[end: end+prom_loc].shape[0] + + return sse / n + + +def msegene(y_true_dict, y_pred_dict, chroms, + gene_annotations, + window_size=25): + sse, n = 0., 0. + + for chrom in chroms: + y_true = y_true_dict[chrom] + y_pred = y_pred_dict[chrom] + + for line in gene_annotations: + chrom_, start, end, _, _, strand = line.split() + start = int(start) // window_size + end = int(end) // window_size + 1 + + # if chrom_ in ('chrX', 'chrY', 'chrM'): + # continue + + if chrom_ != chrom: + continue + sse += ((y_true[start:end] - y_pred[start:end]) ** 2).sum() + n += end - start + + return sse / n + + +def mseenh(y_true_dict, y_pred_dict, chroms, + enh_annotations, + window_size=25): + sse, n = 0., 0. + + for chrom in chroms: + y_true = y_true_dict[chrom] + y_pred = y_pred_dict[chrom] + + for line in enh_annotations: + chrom_, start, end, _, _, _, _, _, _, _, _, _ = line.split() + start = int(start) // window_size + end = int(end) // window_size + 1 + + if chrom_ != chrom: + continue + sse += ((y_true[start:end] - y_pred[start:end]) ** 2).sum() + n += end - start + + return sse / n + + +def msevar(y_true, y_pred, y_all=None, var=None): + """Calculates the MSE weighted by the cross-cell-type variance. + + According to the wiki: Computing this measure involves computing, + for an assay carried out in cell type x and assay type y, a vector of + variance values across all assays of type y. The squared error between + the predicted and true value at each genomic position is multiplied by + this variance (normalized to sum to 1 across all bins) before being + averaged across the genome. + + Parameters + ---------- + y_true: numpy.ndarray, shape=(n_positions,) + The true signal + + y_pred: numpy.ndarray, shape=(n_positions,) + The predicted signal + + y_all: numpy.ndarray, shape=(n_celltypes, n_positions) + The true signal from all the cell types to calculate the variance over. + mutually exclusive with var + + var: numpy.ndarray, shape=(n_positions,) + pre-computed var vector + mutually exclusive with y_all + + Returns + ------- + mse: float + The mean-squared error that's weighted at each position by the + variance. + """ + + if var is None and y_all is None: + return 0.0 + if var is None: + var = numpy.std(y_all, axis=0) ** 2 + + return ((y_true - y_pred) ** 2).dot(var)/var.sum() + + +def mse1obs(y_true, y_pred): + n = int(y_true.shape[0] * 0.01) + y_true_sorted = numpy.sort(y_true) + y_true_top1 = y_true_sorted[-n] + idx = y_true >= y_true_top1 + + return mse(y_true[idx], y_pred[idx]) + + +def mse1imp(y_true, y_pred): + n = int(y_true.shape[0] * 0.01) + y_pred_sorted = numpy.sort(y_pred) + y_pred_top1 = y_pred_sorted[-n] + idx = y_pred >= y_pred_top1 + + return mse(y_true[idx], y_pred[idx]) + + +def find_robust_min_max(x, pct_thresh=0.05, top_bottom_bin_range=2000000): + y = x[x > 0] + idxs = numpy.argsort(y) + abs_max = y[idxs[-1]] + abs_min = y[idxs[0]] + robust_max = y[idxs[-int(pct_thresh * top_bottom_bin_range)]] + robust_min = y[idxs[int(pct_thresh * top_bottom_bin_range)]] + log.info('Array length original, non-zero: {}, {}'.format(len(x), len(y))) + log.info('Absolute min, max: {}, {}'.format(abs_min, abs_max)) + log.info('Robust min, max: {}, {}'.format(robust_min, robust_max)) + return robust_min, robust_max + + +def normalize_dict(y_dict, chroms): + #robust_min = y_dict['robust_min'] + #robust_max = y_dict['robust_max'] + #y_dict_norm = {} + #for c in chroms: + # y = numpy.array(y_dict[c]) + # y[y <= robust_min] = robust_min + # y[y >= robust_max] = robust_max + # y_dict_norm[c] = (y - robust_min) / robust_max + #return y_dict_norm + return y_dict diff --git a/validate.py b/validate.py index 954fbd9..4d15008 100644 --- a/validate.py +++ b/validate.py @@ -1,18 +1,15 @@ -# Imputed track evaluations -# Author: Jacob Schreiber, Jin Lee -# Contact: jmschreiber91@gmail.com, leepc12@gmail.com +#!/usr/bin/env python3 +"""Imputation challenge submission bigwig validator + +Author: + Jin Lee (leepc12@gmail.com) +""" -import sys -import argparse import pyBigWig import json -import logging import traceback +from logger import log -logging.basicConfig( - format='[%(asctime)s %(levelname)s] %(message)s', - stream=sys.stdout) -log = logging.getLogger(__name__) # cat hg38.chrom.sizes | grep -P "chr[\dX]" | grep -v _ CHRSZ = { @@ -41,10 +38,10 @@ 'chrX': 156040895 } -WINDOW_SIZE = 25 - +def validate(bw_file, window_size=25): + log.info('Opening bigwig file...') + bw = pyBigWig.open(bw_file.strip("'")) -def validate(bw): try: valid = True # print chrsz @@ -79,7 +76,7 @@ def validate(bw): for start, end, v in bw.intervals(c): if end == s: continue - if end-start != WINDOW_SIZE: + if end-start != window_size: print('Invalid window size for chromosome {}. ' 'start: {}, end: {}, value: {}'.format( c, start, end, v)) @@ -96,18 +93,14 @@ def validate(bw): return 1 def parse_arguments(): - parser = argparse.ArgumentParser(prog='ENCODE Imputation Challenge' + import argparse + parser = argparse.ArgumentParser(description='ENCODE Imputation Challenge' 'validation script.') parser.add_argument('bw', type=str, help='Bigwig file to be validated.') - parser.add_argument('--log-level', default='INFO', - choices=['NOTSET', 'DEBUG', 'INFO', 'WARNING', - 'CRITICAL', 'ERROR', 'CRITICAL'], - help='Log level') + parser.add_argument('--window-size', default=25, type=int, + help='Window size for bigwig in bp') args = parser.parse_args() - - log.setLevel(args.log_level) - log.info(sys.argv) return args @@ -115,10 +108,7 @@ def main(): # read params args = parse_arguments() - log.info('Opening bigwig file...') - bw = pyBigWig.open(args.bw.strip("'")) - - return validate(bw) + return validate(args.bw, args.window_size) if __name__ == '__main__':