Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CN_Learn Pipeline Issues #1

Open
eugenegardner opened this issue Jan 3, 2019 · 3 comments
Open

CN_Learn Pipeline Issues #1

eugenegardner opened this issue Jan 3, 2019 · 3 comments

Comments

@eugenegardner
Copy link

Hello Girirajan Lab,

Thank you for the recent BioRxiv manuscript on your method for QCing WES CNVs — the metrics appear to prove helpful in generating a good, QC'd set of WES CNVs. I am doing QC of some calls of my own and have a few questions regarding some of the code deposited here on github:

  1. In doing step 5a/b vs 5, what is the major difference here? I am having issues with running merge_overlapping_CNVs_readdepth.sh due to excessive memory usage by bedtools intersect. I am wondering if the intended behavior is to get the coverage of every basepair over a WES probe, or just the mean coverage across that probe. My bp_coverage_dir/*.bpcov files have approx. 80-90M lines per file.

    • As an aside, the README script generate_bp_coverage.sh is actually named extract_bp_coverage.sh and leads to confusion when running.
  2. Throughout the pipeline (e.g. merge_overlapping_CNVs_endjoin.sh, calculate_CNV_overlap.sh), you have for sample in 'cat ${SAMPLE_LIST}' | head -10;. As programmed, this takes only the first 10 samples in the sample list. Is this as intended? I would have imagined all samples would have to be processed through the pipeline.

  3. I think the script merge_overlapping_CNVs_endjoin.sh is slightly broken as written. The line

${BEDTOOLS_DIR}intersectBed -wao -a ${PRED_DIR}CONSENSUS_${sample}_${cnv_type}.txt \
                                  -b ${PRED_DIR}${caller}_${sample}_${cnv_type}.txt | cut -f1-6,${col_after_conc_column},$((${col_after_conc_column} + 1)) \
                                  | awk -v OFS='\t' '{if ($8 > 0) print $0;}' >> ${DATA_DIR}CONSENSUS_caller_ov.txt;

if using only three callers (such as myself) prints what I imagine to be the incorrect output of something like:

`7       142099521       142104483       DEL     SC_WES_INT5824323       CONSENSUS       SC_WES_INT5824323       CLAMMS`

rather than what I imagine is the intended output of:

`7       142099521       142104483       DEL     SC_WES_INT5824323       CONSENSUS       CLAMMS  4962`

I assume this is likely due to the code:

last_caller_column=$((6 + ${CALLER_COUNT}))
concordance_column=$((${last_caller_column} + 1))
col_after_conc_column=$((${concordance_column} + 1))

As the output from bedtools intersectBed shouldn't change based on the total number of callers used. Or am I mistaken?

  1. A minor point as easy to fix in each individual use case, but the hard-coding in some places (e.g. when pointing to config.params as /data/CN_learn/config.params) and the CpG file in extract_gc_map_vals.sh may make it difficult for some users to follow your code. I suppose this depends on your intention as to the use of your code-base (reproducibility of your paper or broader use like my own).

Thanks in advance for the help!

@eugenegardner
Copy link
Author

P.S. Another issue in merge_overlapping_CNVs_endjoin.sh is at the bottom:

cut -f6,7 --complement ${DATA_DIR}CONSENSUS_caller_ov_prop.txt > ${DATA_DIR}final_preds.txt
cut -f6 --complement ${DATA_DIR}extn_grouped_nonconc_preds.txt >> ${DATA_DIR}final_preds.txt

This results in a bed file with differing columns which bedtools does not like during the extract_gc_map_vals.sh step. I modified to both be cut -f6,7 but unsure if this will result in unexpected behavior.

@girirajanlab
Copy link
Owner

Hello Eugene, Our apologies for the delay in response. Ever since we posted the initial draft of the manuscript on archive, we have been working on simplifying and testing the pipeline. Specifically, we have simplified the software installation process by providing a Docker image with all the software tools preinstalled. We will address each issue you reported here and respond within the next week. Once you hear back from us, please feel free to download the latest version and try using CN-Learn again. Sorry for the inconvenience. We will keep you posted.

Vijay

@girirajanlab
Copy link
Owner

Hello Eugene,
We have restructured and rewritten the tool to improve usability. We have also added an option to make use of Docker image with all the software tools preinstalled. So, each of the four issues you pointed out has been fixed and the tool itself has been tested thoroughly for several scenarios. You should now be able to clone the repo again and start using the newer version of the tool. If you run into other issues, please feel free to let us know. We appreciate your patience.

Vijay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants