Merge pull request #6 from ENCODE-DCC/dev_normalization_robust_min_max

leaderboard
ENCODE-DCC · Jul 23, 2019 · a1a8943 · a1a8943
2 parents 90d1d4f + 73ee068
commit a1a8943
Show file tree

Hide file tree

Showing 14 changed files with 1,462 additions and 925 deletions.
diff --git a/.gitignore b/.gitignore
@@ -12,6 +12,7 @@ join.log
 *.npy
 *.npz
 *.db
-*.tsv
 __pycache__
 round1
+*.pyc
+
diff --git a/README.md b/README.md
@@ -1,15 +1,25 @@
-# ENCODE Imputation Challenge Scoring and Validation Scripts
+# ENCODE Imputation Challenge Scoring/Ranking and Validation Scripts
 
 ## Installation
 
-1) [Install Conda](https://docs.conda.io/en/latest/miniconda.html) first.
+1) [Install Conda 4.6.14](https://docs.conda.io/en/latest/miniconda.html) first. Answer `yes` to all Y/N questions. Use default installation paths. Re-login after installation.
+	```bash
+	$ wget https://repo.anaconda.com/miniconda/Miniconda3-4.6.14-Linux-x86_64.sh
+	$ bash Miniconda3-4.6.14-Linux-x86_64.sh
+	```
 
 2) Install `numpy`, `scikit-learn` and `pyBigWig`.
 	```bash
-	$ conda install -c bioconda numpy scikit-learn pyBigWig sqlite scipy
+	$ conda install -y -c bioconda numpy scikit-learn pyBigWig sqlite scipy
 	```
 
-## Example (hg38)
+## Validating a submission
+
+```bash
+$ python validate.py [YOUR_SUBMISSION_BIGWIG]
+```
+
+## Scoring a submission
 
 1) Download ENCFF622DXZ and ENCFF074VQD from ENCODE portal.
 	```bash
@@ -18,78 +28,84 @@
 	$ wget https://www.encodeproject.org/files/ENCFF074VQD/@@download/ENCFF074VQD.bigWig
 	```
 
-2) Convert it to numpy array.
+2) Convert it to numpy array. This is to speed up scoring multiple submissions. `score.py` can also take bigwigs so you can skip this step.
 	```bash
-	$ python build_npy_from_bigwig.py test/hg38/ENCFF622DXZ.bigWig
-	$ python build_npy_from_bigwig.py test/hg38/ENCFF074VQD.bigWig
+	$ python bw_to_npy.py test/hg38/ENCFF622DXZ.bigWig
+	$ python bw_to_npy.py test/hg38/ENCFF074VQD.bigWig
 	```
 
 3) Run it. If you score without a variance `.npy` file specified as `--var-npy`, then `msevar` metric will be `0.0`.
 	```bash
-	$ python score.py test/hg38/ENCFF622DXZ.npy test/hg38/ENCFF074VQD.npy \
-		--chrom chr20 --out-file test/hg38/ENCFF622DXZ.ENCFF074VQD.score.txt
-	```
-
-4) Output looks like: (header: bootstrap_index, mse, mse1obs, mse1imp, gwcorr, match1, catch1obs, catch1imp, aucobs1, aucimp1, mseprom, msegene, mseenh).
-	```bash
-	bootstrap_-1	20.45688606636623	1730.3503548526915	195.52252657980728	0.01705378703206674	848	3462	2976	0.5852748736100822	0.590682173511888	376.1018309950674	31.24613030186926	94.01719916101615
+	$ python score.py test/hg38/ENCFF622DXZ.npy test/hg38/ENCFF074VQD.npy --chrom chr20
 	```
 
-
-## Validation for submissions
-
-In order to validate your BIGWIG. Use `validate.py`.
-
-```bash
-$ python validate.py [YOUR_SUBMISSION_BIGWIG]
-```
-
 ## Ranking for submissions
 
-1) [Generate bootstrap label](#how-to-generate-bootstrap-labels)
-
-2) In order to speed up scoring, convert `TRUTH_BIGWIG` into numpy array/object (binned at `25`). Repeat this for each pair of cell type and assay.
+1) Create a score database.
 	```bash
-	$ python build_npy_from_bigwig.py [TRUTH_BIGWIG] --out-npy-prefix [TRUTH_NPY_PREFIX]
+	$ python db.py [NEW_SCORE_DB_FILE]
 	```
 
-3) Create a score database.
+2) In order to speed up scoring, convert `TRUTH_BIGWIG` into numpy array/object (binned at `25`). Repeat this for each pair of cell type and assay. `--out-npy-prefix [TRUTH_NPY_PREFIX]` is optional. Repeat this for all truth bigwigs.
 	```bash
-	$ python create_db.py [SCORE_DB_FILE]
+	$ python bw_to_npy.py [TRUTH_BIGWIG] --out-npy-prefix [TRUTH_NPY_PREFIX]
 	```
 
-4) For each assay type, build a variance `.npy` file, which calculates a variance for each bin for each chromosome across all cell types. Without this variance file, `msevar` will be `0.0`.
-	$ python build_var_npy.py [TRUTH_NPY_CELL1] [TRUTH_NPY_CELL2] ... \
-		--out-npy-prefix var_[ASSAY_OR_MARK_ID]
+3) For each assay type, build a variance `.npy` file, which calculates a variance for each bin for each chromosome across all cell types. Without this variance file, `msevar` will be `0.0`.
+	```bash
+	$ python build_var_npy.py [TRUTH_NPY_CELL1] [TRUTH_NPY_CELL2] ... --out-npy-prefix var_[ASSAY_OR_MARK_ID]
+	```
 
-5) Score each submission with bootstrap labels. `--validated` is only for validated submissions binned at `25`. With this flag turned on, `score.py` will skip interpolation of intervals in a bigwig. For ranking, you need to define all metadata for a submission like `--cell [CELL_ID] --assay [ASSAY_OR_MARK_ID] -t [TEAM_ID_INT] -s [SUBMISSION_ID_INT]`. These values will be written to a database file together with bootstrap scores. Repeat this for each submission (one submission per team for each pair of cell type and assay).
+4) Score each submission. `--validated` is only for a validated bigwig submission binned at `25`. With this flag turned on, `score.py` will skip interpolation of intervals in a bigwig. For ranking, you need to define metadata for a submission like -t [TEAM_ID_INT] -s [SUBMISSION_ID_INT]`. These values will be written to a database file together with bootstrap scores. Repeat this for each submission (one submission per team for each pair of cell type and assay).
 	```bash
-	$ python score.py [YOUR_VALIDATED_SUBMISSION] [TRUTH_NPY] \
-	    --bootstrapped-label-npy [BOOTSTRAP_LABEL_NPY] \
-	    --var-npy var_[ASSAY_OR_MARK_ID].npy
-		--out-db-file [SCORE_DB_FILE] \
-		--cell [CELL_ID] --assay [ASSAY_OR_MARK_ID] \
-		-t [TEAM_ID_INT] -s [SUBMISSION_ID_INT] \
-		--validated
+	$ python score.py [YOUR_VALIDATED_SUBMISSION_BIGWIG_OR_NPY] [TRUTH_NPY] \
+	    --var-npy var_[ASSAY_OR_MARK_ID].npy \
+		--db-file [SCORE_DB_FILE] \
+		--validated \
+		-t [TEAM_ID_INT] -s [SUBMISSION_ID_INT]
 	```
 
 5) Calculate ranks based on DB file
 	```bash
 	$ python rank.py [SCORE_DB_FILE]
 	```
 
+## Setting up a leaderboard server (admins only)
 
-## For challenge admins
+1) Create a server instance on AWS.
 
-### How to generate bootstrap labels?
+2) Install Synapse client.
+	```bash
+	$ pip install synapseclient
+	```
 
-Download `submission_template.bigwig` from Synapse imputation challenge site. The following command will make 10-fold (default) bootstrap index for each chromosome. Output is a single `.npy` file which have all bootstrap labels for corresponding bootstrap index and chromosomes.
+3) Authenticate yourself on the server
+	```bash
+	$ synapse login --remember-me -u [USERNAME] -p [PASSWORD]
+	```
 
-```bash
-$ python build_bootstrapped_label.py submission_template.bigwig
-```
+4) Create a score database.
+	```bash
+	$ python db.py [NEW_SCORE_DB_FILE]
+	```
 
-### How to use bootstrapped label?
+5) Run `score_leaderboard.py`. Files on `TRUTH_NPY_DIR` should be like `CXXMYY.npy`. Files on `VAR_NPY_DIR` should be like `var_MYY.npy`. Submissions will be downloaded on `SUBMISSION_DOWNLOAD_DIR`.
+	```bash
+	$ NTH=3  # number of threads to parallelize bootstrap scoring
+	$ python score_leaderboard.py [EVALUATION_QUEUE_ID] [TRUTH_NPY_DIR] \
+	    --var-npy-dir [VAR_NPY_DIR] \
+	    --submission-dir [SUBMISSION_DOWNLOAD_DIR] \
+	    --send-msg-to-admin \
+	    --send-msg-to-user \
+	    --db-file [SCORE_DB_FILE] \
+	    --nth $NTH \
+	    --project-id [SYNAPSE_PROJECT_ID] \
+	    --leaderboard-wiki-id [LEADERBOARD_WIKI_ID] \
+	    --bootstrap-chrom chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr20,chr21,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr4,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr6,chr7,chr8,chr9,chrX chr1,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr5,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr5,chr6,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr14,chr16,chr17,chr18,chr19,chr2,chr21,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chrX chr1,chr10,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chr9 chr1,chr10,chr11,chr12,chr13,chr14,chr15,chr19,chr2,chr20,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chrX
+	```
 
-Simply run `score.py` with `--bootstrapped-label-npy bootstrapped_label.npy`.
+	Example:
+	```bash
+	$ python score_leaderboard.py $EVAL_Q_ID /mnt/imputation-challenge/output/score_robust_min_max/validation_data_npys --var-npy-dir /mnt/imputation-challenge/output/score_robust_min_max/var_npys --submission-dir /mnt/imputation-challenge/data/submissions/round2 --db-file $DB --nth $NTH --bootstrap-chrom chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr20,chr21,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr4,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr5,chr6,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr6,chr7,chr8,chr9,chrX chr1,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr5,chr7,chr8,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr5,chr6,chr9,chrX chr1,chr10,chr11,chr12,chr13,chr14,chr16,chr17,chr18,chr19,chr2,chr21,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chrX chr1,chr10,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr2,chr20,chr21,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chr9 chr1,chr10,chr11,chr12,chr13,chr14,chr15,chr19,chr2,chr20,chr22,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chrX --send-msg-to-admin --send-msg-to-user --team-name-tsv data/team_name_round1.tsv
+	```
 
diff --git a/build_npy_from_bigwig.py b/build_npy_from_bigwig.py
-Original file line number
+Diff line change
@@ Expand Up / @@ -12,6 +12,7 @@ join.log @@
     *.npy
     *.npz
     *.db
-    *.tsv
     __pycache__
     round1
+    *.pyc