Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update parsing of several fields #19

Open
nakib103 opened this issue Oct 11, 2023 · 0 comments
Open

Update parsing of several fields #19

nakib103 opened this issue Oct 11, 2023 · 0 comments
Labels

Comments

@nakib103
Copy link
Contributor

nakib103 commented Oct 11, 2023

We need to update how hypsipyle parse VEP annotations for some fields as described -

Phenotype

Current header - "VAR_SYNONYMS"
Updated header - "PHENOTYPES"

pattern - <phenotype>+<source>+<feature affected>
Multiple phenotype is separated by an &

example -

ANGIOTENSINOGEN+MIM_morbid+ENSG00000135744&HYPERTENSION__ESSENTIAL+MIM_morbid+ENSG00000135744&RENAL_TUBULAR_DYSGENESIS+MIM_morbid+ENSG00000135744&NON_RARE_IN_EUROPE:_Essential_hypertension+Orphanet+ENSG00000135744&Renal_tubular_dysgenesis_of_genetic_origin+Orphanet+ENSG00000135744&ClinVar:_phenotype_not_specified+ClinVar+rs699&HYPERTENSION__ESSENTIAL__SUSCEPTIBILITY_TO+ClinVar+rs699&Hypertensive_disorder+ClinVar+rs699&Preeclampsia__susceptibility_to+ClinVar+rs699&RENAL_TUBULAR_DYSGENESIS+ClinVar+rs699&Susceptibility_to_progression_to_renal_failure_in_IgA_nephropathy+ClinVar+rs699&Coronary_Artery_Disease+NHGRI-EBI_GWAS_catalog+rs699&Diastolic_blood_pressure+NHGRI-EBI_GWAS_catalog+rs699&Mean_arterial_pressure+NHGRI-EBI_GWAS_catalog+rs699&Systolic_blood_pressure+NHGRI-EBI_GWAS_catalog+rs699

Alternative names (data not available in 110 because of VEP cache issue)

Current header - none
Updated header - "VAR_SYNONYMS"

pattern - <source>::<name>&<name>..
Multiple synonyms separated by --

ClinVar::RCV000835695&RCV000405686&RCV000242838&RCV000019693&RCV000019692&RCV000019691&VCV000018068--OMIM::106150.0001--PharmGKB::PA166153539--UniProt::VAR_007096--COSMIC::COSM42556

Ancestral allele

Current header - none
Updated header - "AA"

Just allele sequence
There are some exceptional cases: - if there is an insertion or ? if chrom cannot be looked up in FASTA
example -
G

Frequency

Current header - "FREQ" (outside of CSQ - directly under INFO)
Updated header - frequency field under CSQ coming from --custom annotation fror gnomAD and --af_1kg flag for 1000Genomes

full list of fields:

gnoamd_exomes -

gnomAD_exomes|gnomAD_exomes_AF|gnomAD_exomes_AC|gnomAD_exomes_AN|gnomAD_exomes_AF_afr|gnomAD_exomes_AC_afr|gnomAD_exomes_AN_afr|gnomAD_exomes_AF_amr|gnomAD_exomes_AC_amr|gnomAD_exomes_AN_amr|gnomAD_exomes_AF_asj|gnomAD_exomes_AC_asj|gnomAD_exomes_AN_asj|gnomAD_exomes_AF_eas|gnomAD_exomes_AC_eas|gnomAD_exomes_AN_eas|gnomAD_exomes_AF_fin|gnomAD_exomes_AC_fin|gnomAD_exomes_AN_fin|gnomAD_exomes_AF_nfe|gnomAD_exomes_AC_nfe|gnomAD_exomes_AN_nfe|gnomAD_exomes_AF_oth|gnomAD_exomes_AC_oth|gnomAD_exomes_AN_oth|gnomAD_exomes_AF_sas|gnomAD_exomes_AC_sas|gnomAD_exomes_AN_sas

example -

rs699|0.548141|137772|251344|0.845055|13722|16238|0.719601|24891|34590|0.439968|4434|10078|0.838792|15417|18380|0.441035|9544|21640|0.419703|47706|113666|0.500163|3069|6136|0.620231|18989|30616

gnoamd_genomes -
Same as gnomad_exomes except it will have gnomad_genomes prefix.

1000Genoems -

AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF

example -

0.7051|0.9032|0.6354|0.8532|0.4115|0.636

Primary source

Current header - "source" (value taken from VCF file header)
Updated header - "SOURCE" (an INFO field - can be read from the INFO field for each variant)

example -

1	33878	ENSVVVI00100001	T	G	.	.	SOURCE=Ensembl;CSQ=G|missense_variant&splice_region_variant....
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant