Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbcreator_encode: K562 bug #93

Open
mdozmorov opened this issue Jul 26, 2015 · 4 comments
Open

dbcreator_encode: K562 bug #93

mdozmorov opened this issue Jul 26, 2015 · 4 comments

Comments

@mdozmorov
Copy link
Owner

K562 is the cell line that is parsed out of the file names. But sometimes it is labeled as K562b or as K562E, URLs below. These are depreciated cell line names, and are the same as K562.

Can we hard-code these exceptions, so the files are downloaded with the original names but processed specially. E.g.
wgEncodeSydhHistone/wgEncodeSydhHistoneK562bH3k27me3bUcdPk.narrowPeak.gz - should be K562-H3k27me2-Sydh
wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562EfosUniPk.narrowPeak.gz - should be K562-Fos-Uchicago

All noted bugs:
hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneK562bH3k27me3bUcdPk.narrowPeak.gz
hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneK562bH3k4me1UcdPk.narrowPeak.gz
hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneK562bH3k4me3bUcdPk.narrowPeak.gz
hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneK562bH3k9acbUcdPk.narrowPeak.gz

hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562EfosUniPk.narrowPeak.gz
hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562Egata2UniPk.narrowPeak.gz
hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562Ehdac8UniPk.narrowPeak.gz
hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562EjunbUniPk.narrowPeak.gz
hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsUchicagoK562EjundUniPk.narrowPeak.gz

lkscara added a commit that referenced this issue Jul 29, 2015
@mdozmorov
Copy link
Owner Author

At least for the K562E bug, we need to hard code these particular files. Simply repl6acing K562E with K562 will break the "K562Ezh2" files, where Ezh2 is a legitimate factor.

So, even in the wrong files, not only the K562E should be K562, but also the factors should be capitalized, e.g., Fos.

@mdozmorov
Copy link
Owner Author

In re K562b - we can rename the files. There is no factors that start with lowercase "b". But there are some factors starting with capital "B", e.g., Btf3. So, if the split algorithms in the dbcreator are case-sensitive (to my knowledge, yes), we should be good with renaming K562b to K562

lkscara added a commit that referenced this issue Jul 31, 2015
@mdozmorov
Copy link
Owner Author

Won't fix. Temporary solution - delete 'K562b' folders as they contain files the same as in regular 'K562'. The 'K562E' will be processed into correct 'K562' folder, but factor will be like 'Efos' - this can be corrected manually in the 'gf_description' file.

@mdozmorov
Copy link
Owner Author

Check for duplicates:

for file in `find grsnp_db/ -type f -name "*.bed.gz"`; do echo `basename $file`; done | sort | uniq | wc -l

should be equal to

for file in `find grsnp_db/ -type f -name "*.bed.gz"`; do echo `basename $file`; done | wc -l

The "K562E" error is fixed. The "K562b" folders should be manually deleted using find . -type d -name "K562b" -exec rm -r {} \;.

For hg19, 19,776 GFs become 19, 771 after removing duplicates.

The duplicates are:

  2 K562-H3k9acb-SydhHistone.bed.gz
  2 K562-H3k4me3b-SydhHistone.bed.gz
  2 K562-H3k4me1-SydhHistone.bed.gz
  2 K562-H3k27me3b-SydhHistone.bed.gz

Filter grsnp_db/hg19/gf_descriptions.txt file:

  grep -v "/K562b/" gf_descriptions.txt  > gf_descriptions_nodups.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant