CN_Learn Pipeline Issues #1

eugenegardner · 2019-01-03T11:43:23Z

Hello Girirajan Lab,

Thank you for the recent BioRxiv manuscript on your method for QCing WES CNVs — the metrics appear to prove helpful in generating a good, QC'd set of WES CNVs. I am doing QC of some calls of my own and have a few questions regarding some of the code deposited here on github:

In doing step 5a/b vs 5, what is the major difference here? I am having issues with running merge_overlapping_CNVs_readdepth.sh due to excessive memory usage by bedtools intersect. I am wondering if the intended behavior is to get the coverage of every basepair over a WES probe, or just the mean coverage across that probe. My bp_coverage_dir/*.bpcov files have approx. 80-90M lines per file.
- As an aside, the README script generate_bp_coverage.sh is actually named extract_bp_coverage.sh and leads to confusion when running.
Throughout the pipeline (e.g. merge_overlapping_CNVs_endjoin.sh, calculate_CNV_overlap.sh), you have for sample in 'cat ${SAMPLE_LIST}' | head -10;. As programmed, this takes only the first 10 samples in the sample list. Is this as intended? I would have imagined all samples would have to be processed through the pipeline.
I think the script merge_overlapping_CNVs_endjoin.sh is slightly broken as written. The line

${BEDTOOLS_DIR}intersectBed -wao -a ${PRED_DIR}CONSENSUS_${sample}_${cnv_type}.txt \
                                  -b ${PRED_DIR}${caller}_${sample}_${cnv_type}.txt | cut -f1-6,${col_after_conc_column},$((${col_after_conc_column} + 1)) \
                                  | awk -v OFS='\t' '{if ($8 > 0) print $0;}' >> ${DATA_DIR}CONSENSUS_caller_ov.txt;

if using only three callers (such as myself) prints what I imagine to be the incorrect output of something like:

`7       142099521       142104483       DEL     SC_WES_INT5824323       CONSENSUS       SC_WES_INT5824323       CLAMMS`

rather than what I imagine is the intended output of:

`7       142099521       142104483       DEL     SC_WES_INT5824323       CONSENSUS       CLAMMS  4962`

I assume this is likely due to the code:

last_caller_column=$((6 + ${CALLER_COUNT}))
concordance_column=$((${last_caller_column} + 1))
col_after_conc_column=$((${concordance_column} + 1))

As the output from bedtools intersectBed shouldn't change based on the total number of callers used. Or am I mistaken?

A minor point as easy to fix in each individual use case, but the hard-coding in some places (e.g. when pointing to config.params as /data/CN_learn/config.params) and the CpG file in extract_gc_map_vals.sh may make it difficult for some users to follow your code. I suppose this depends on your intention as to the use of your code-base (reproducibility of your paper or broader use like my own).

Thanks in advance for the help!

The text was updated successfully, but these errors were encountered:

eugenegardner · 2019-01-03T14:29:37Z

P.S. Another issue in merge_overlapping_CNVs_endjoin.sh is at the bottom:

cut -f6,7 --complement ${DATA_DIR}CONSENSUS_caller_ov_prop.txt > ${DATA_DIR}final_preds.txt
cut -f6 --complement ${DATA_DIR}extn_grouped_nonconc_preds.txt >> ${DATA_DIR}final_preds.txt

This results in a bed file with differing columns which bedtools does not like during the extract_gc_map_vals.sh step. I modified to both be cut -f6,7 but unsure if this will result in unexpected behavior.

girirajanlab · 2019-04-14T17:51:58Z

Hello Eugene, Our apologies for the delay in response. Ever since we posted the initial draft of the manuscript on archive, we have been working on simplifying and testing the pipeline. Specifically, we have simplified the software installation process by providing a Docker image with all the software tools preinstalled. We will address each issue you reported here and respond within the next week. Once you hear back from us, please feel free to download the latest version and try using CN-Learn again. Sorry for the inconvenience. We will keep you posted.

Vijay

girirajanlab · 2019-05-01T14:55:19Z

Hello Eugene,
We have restructured and rewritten the tool to improve usability. We have also added an option to make use of Docker image with all the software tools preinstalled. So, each of the four issues you pointed out has been fixed and the tool itself has been tested thoroughly for several scenarios. You should now be able to clone the repo again and start using the newer version of the tool. If you run into other issues, please feel free to let us know. We appreciate your patience.

Vijay

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CN_Learn Pipeline Issues #1

CN_Learn Pipeline Issues #1

eugenegardner commented Jan 3, 2019

eugenegardner commented Jan 3, 2019

girirajanlab commented Apr 14, 2019

girirajanlab commented May 1, 2019

CN_Learn Pipeline Issues #1

CN_Learn Pipeline Issues #1

Comments

eugenegardner commented Jan 3, 2019

eugenegardner commented Jan 3, 2019

girirajanlab commented Apr 14, 2019

girirajanlab commented May 1, 2019