Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting population allele frequencies #29

Merged
merged 31 commits into from
Jan 31, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
fd86f09
Parsing populations from VCF
likhitha-surapaneni Dec 15, 2023
ae26a08
Logic to compute ref allele frequency, maf, hpmaf
likhitha-surapaneni Jan 17, 2024
6a384ac
Added endpoint to return population object
likhitha-surapaneni Jan 23, 2024
eae1b9a
Adding Population schema
likhitha-surapaneni Jan 23, 2024
0247d13
Fixed a bug in population frequencies
likhitha-surapaneni Jan 24, 2024
6865f77
Making allele_frequency nullable
likhitha-surapaneni Jan 24, 2024
22308a1
Reverting nullable allele_frequency for now
likhitha-surapaneni Jan 24, 2024
91699ce
Reverting nullable allele_frequency for now
likhitha-surapaneni Jan 24, 2024
36e7723
Added population metadata for gnomadg and gnomade
likhitha-surapaneni Jan 24, 2024
940ccf8
Added population metadata for gnomadg and gnomade
likhitha-surapaneni Jan 24, 2024
f6a2d27
Added examples for population and population_allele_frequencies
likhitha-surapaneni Jan 24, 2024
31ca0f1
Renamed file
likhitha-surapaneni Jan 24, 2024
4eb9e14
Minor typo
likhitha-surapaneni Jan 24, 2024
e6a5f3e
Added examples for gnomADg, gnomADe
likhitha-surapaneni Jan 24, 2024
bc8ba60
Cleanup of files
likhitha-surapaneni Jan 24, 2024
1870e41
Added version metadata for gnomADg and gnomADe
likhitha-surapaneni Jan 24, 2024
3a78f32
Allele_frequency as nullable
likhitha-surapaneni Jan 24, 2024
f661d72
Not returning population_allele_frequencies with allele_frequency null
likhitha-surapaneni Jan 25, 2024
5575ce0
Removed updates to example payload
likhitha-surapaneni Jan 25, 2024
4dfa7d3
Removed updates to example payload
likhitha-surapaneni Jan 25, 2024
bb6ae82
Modified endpoint from population to populations; Updated examples
likhitha-surapaneni Jan 25, 2024
d5a100c
Merge branch 'main' into population_scores
likhitha-surapaneni Jan 25, 2024
8a7557d
Fixing is_global values
likhitha-surapaneni Jan 29, 2024
42b7667
Merge branch 'population_scores' of https://github.com/likhitha-surap…
likhitha-surapaneni Jan 29, 2024
193ffe1
Fixing the nullability logic in super_population
likhitha-surapaneni Jan 29, 2024
a032adf
Fixing the nullability logic in super_population
likhitha-surapaneni Jan 29, 2024
518b5c8
Fixed typo in population metadata file and removed the redundant work…
likhitha-surapaneni Jan 29, 2024
f38e487
Fixed typo in population metadata file and removed the redundant work…
likhitha-surapaneni Jan 29, 2024
b74af67
Updated population.graphql to conform with VDM
likhitha-surapaneni Jan 29, 2024
f3bfa14
Handling multiple maf and hpmaf alleles
likhitha-surapaneni Jan 30, 2024
4052b2c
Added display_group_name
likhitha-surapaneni Jan 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
801 changes: 801 additions & 0 deletions common/file_model/population_metadata.json

Large diffs are not rendered by default.

34 changes: 34 additions & 0 deletions common/file_model/populations.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"1000_genomes": [
{"name": "1000GENOMES:phase_3:ALL", "frequencies": {"af": "AF"}},
{"name": "1000GENOMES:phase_3:AFR", "frequencies": {"af": "AFR_AF"}},
{"name": "1000GENOMES:phase_3:AMR", "frequencies": {"af": "AMR_AF"}},
{"name": "1000GENOMES:phase_3:EAS", "frequencies": {"af": "EAS_AF"}},
{"name": "1000GENOMES:phase_3:EUR", "frequencies": {"af": "EUR_AF"}},
{"name": "1000GENOMES:phase_3:SAS", "frequencies": {"af": "SAS_AF" }}
],
"gnomAD_exomes": [
{"name": "gnomADe:ALL", "frequencies": {"af": "gnomAD_exomes_AF", "ac": "gnomAD_exomes_AC", "an": "gnomAD_exomes_AN"}},
{"name": "gnomADe:afr", "frequencies": {"af": "gnomAD_exomes_AF_afr", "ac": "gnomAD_exomes_AC_afr", "an": "gnomAD_exomes_AN_afr"}},
{"name": "gnomADe:amr", "frequencies": {"af": "gnomAD_exomes_AF_amr", "ac": "gnomAD_exomes_AC_amr", "an": "gnomAD_exomes_AN_amr"}},
{"name": "gnomADe:asj", "frequencies": {"af": "gnomAD_exomes_AF_asj", "ac": "gnomAD_exomes_AC_asj", "an": "gnomAD_exomes_AN_asj"}},
{"name": "gnomADe:eas", "frequencies": {"af": "gnomAD_exomes_AF_eas", "ac": "gnomAD_exomes_AC_eas", "an": "gnomAD_exomes_AN_eas"}},
{"name": "gnomADe:fin", "frequencies": {"af": "gnomAD_exomes_AF_fin", "ac": "gnomAD_exomes_AC_fin", "an": "gnomAD_exomes_AN_fin"}},
{"name": "gnomADe:nfe", "frequencies": {"af": "gnomAD_exomes_AF_nfe", "ac": "gnomAD_exomes_AC_nfe", "an": "gnomAD_exomes_AN_nfe"}},
{"name": "gnomADe:oth", "frequencies": {"af": "gnomAD_exomes_AF_oth", "ac": "gnomAD_exomes_AC_oth", "an": "gnomAD_exomes_AN_oth"}},
{"name": "gnomADe:sas", "frequencies": {"af": "gnomAD_exomes_AF_sas", "ac": "gnomAD_exomes_AC_sas", "an": "gnomAD_exomes_AN_sas"}}
],
"gnomAD_genomes": [
{"name": "gnomADg:ALL", "frequencies": {"af": "gnomAD_genomes_AF", "ac": "gnomAD_genomes_AC", "an": "gnomAD_genomes_AN"}},
{"name": "gnomADg:afr", "frequencies": {"af": "gnomAD_genomes_AF_afr", "ac": "gnomAD_genomes_AC_afr", "an": "gnomAD_genomes_AN_afr"}},
{"name": "gnomADg:ami", "frequencies": {"af": "gnomAD_genomes_AF_ami", "ac": "gnomAD_genomes_AC_ami", "an": "gnomAD_genomes_AN_ami"}},
{"name": "gnomADg:amr", "frequencies": {"af": "gnomAD_genomes_AF_amr", "ac": "gnomAD_genomes_AC_amr", "an": "gnomAD_genomes_AN_amr"}},
{"name": "gnomADg:asj", "frequencies": {"af": "gnomAD_genomes_AF_asj", "ac": "gnomAD_genomes_AC_asj", "an": "gnomAD_genomes_AN_asj"}},
{"name": "gnomADg:eas", "frequencies": {"af": "gnomAD_genomes_AF_eas", "ac": "gnomAD_genomes_AC_eas", "an": "gnomAD_genomes_AN_eas"}},
{"name": "gnomADg:fin", "frequencies": {"af": "gnomAD_genomes_AF_fin", "ac": "gnomAD_genomes_AC_fin", "an": "gnomAD_genomes_AN_fin"}},
{"name": "gnomADg:mid", "frequencies": {"af": "gnomAD_genomes_AF_mid", "ac": "gnomAD_genomes_AC_mid", "an": "gnomAD_genomes_AN_mid"}},
{"name": "gnomADg:nfe", "frequencies": {"af": "gnomAD_genomes_AF_nfe", "ac": "gnomAD_genomes_AC_nfe", "an": "gnomAD_genomes_AN_nfe"}},
{"name": "gnomADg:oth", "frequencies": {"af": "gnomAD_genomes_AF_oth", "ac": "gnomAD_genomes_AC_oth", "an": "gnomAD_genomes_AN_oth"}},
{"name": "gnomADg:sas", "frequencies": {"af": "gnomAD_genomes_AF_sas", "ac": "gnomAD_genomes_AC_sas", "an": "gnomAD_genomes_AN_sas"}}
]
}
122 changes: 95 additions & 27 deletions common/file_model/variant.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ def __init__(self, record: Any, header: Any) -> None:
self.info = record.INFO
self.type = "Variant"
self.vep_version = re.search("v\d+", self.header.get_lines("VEP")[0].value).group()
self.population_map = {}

def get_alternative_names(self) -> List:
return []
Expand Down Expand Up @@ -196,7 +197,7 @@ def get_most_severe_consequence(self) -> Mapping:
csq_record_list = csq_record.split("|")
for cons in csq_record_list[consequence_index].split("&"):
rank = consequence_rank[cons]
consequence_map[rank] = cons
consequence_map[int(rank)] = cons
return{
"result": consequence_map[min(consequence_map.keys())] ,
"analysis_method": {
Expand Down Expand Up @@ -240,34 +241,101 @@ def get_info_key_index(self, key: str, info_id: str ="CSQ") -> int:
for index, value in enumerate(csq_list):
if value == key:
return index

def traverse_population_info(self) -> Mapping:
directory = os.path.dirname(__file__)
with open(os.path.join(directory,'populations.json')) as pop_file:
pop_mapping = json.load(pop_file)
population_frequency_map = {}
for csq_record in self.info["CSQ"]:
csq_record_list = csq_record.split("|")
allele_index = self.get_info_key_index("Allele")
if csq_record_list[allele_index] is not None:
if csq_record_list[allele_index] not in population_frequency_map.keys():
population_frequency_map[csq_record_list[allele_index]] = {}
for pop_key, pop in pop_mapping.items():
for sub_pop in pop:
if sub_pop["name"] in population_frequency_map[csq_record_list[allele_index]]:
continue
allele_count = allele_number = allele_frequency = None
for freq_key, freq_val in sub_pop["frequencies"].items():
col_index = self.get_info_key_index(freq_val)
if col_index and csq_record_list[col_index] is not None:
if freq_key == "af":
allele_frequency = csq_record_list[col_index] or None
elif freq_key == "an":
allele_number = csq_record_list[col_index] or None
elif freq_key == "ac":
allele_count = csq_record_list[col_index] or None
else:
raise Exception('Frequency metric is not recognised')
population_frequency = {
"population_name": sub_pop["name"],
"allele_frequency": allele_frequency,
"allele_count": allele_count,
"allele_number": allele_number,
"is_minor_allele": False,
"is_hpmaf": False
}
if allele_frequency:
population_frequency_map[csq_record_list[allele_index]][sub_pop["name"]] = population_frequency
return population_frequency_map

def set_frequency_flags(self, allele_list: List):
def set_frequency_flags(self):
"""
Calculates minor allele frequency by iterating through each allele
Assumption: Considers that population is only gnomAD (genomes) for now
Sets the maf as hpmaf as gnomAD is the only population at the moment
Calculates MAF (minor allele frequency) and HPMAF by iterating through each allele
"""
maf_frequency = -1
maf_index = -1
highest_frequency = -1
highest_frequency_index = -1
highest_maf_frequency = -1
highest_maf_frequency_index = -1
maf_map = {}
for allele_index, allele in enumerate(allele_list):
if(len(allele["population_frequencies"]) > 0):
pop = allele["population_frequencies"][0]
pop_allele_frequency = float(pop["allele_frequency"])
if ( pop_allele_frequency > maf_frequency and pop_allele_frequency < highest_frequency ):
maf_frequency = pop_allele_frequency
maf_index = allele_index
elif ( pop_allele_frequency > highest_frequency ):
maf_frequency = highest_frequency
maf_index = highest_frequency_index
highest_frequency = pop_allele_frequency
highest_frequency_index = allele_index
allele_list = self.get_alleles()
directory = os.path.dirname(__file__)
with open(os.path.join(directory,'populations.json')) as pop_file:
pop_mapping = json.load(pop_file)
pop_names = []
for pop in pop_mapping.values():
pop_names.extend([sub_pop["name"] for sub_pop in pop])
hpmaf = []
pop_frequency_map = self.traverse_population_info()

for pop_name in pop_names:
by_population = []
for allele in allele_list:
if allele.minimise_allele(allele.alt) in pop_frequency_map:
pop_freqs = pop_frequency_map[allele.minimise_allele(allele.alt)].values()
# Calculate reference allele from parsed frequency info of all alleles
if allele.get_allele_type()["accession_id"] == "biological_region" and len(by_population)>0 :
allele_frequency_ref = 1 - sum(list(zip(*by_population))[0])
if allele_frequency_ref <= 1 and allele_frequency_ref >= 0:
population_frequency_ref = {
"population_name": pop_name,
"allele_frequency": allele_frequency_ref ,
"allele_count": None,
"allele_number": None,
"is_minor_allele": False,
"is_hpmaf": False
}
minimised_allele = allele.minimise_allele(allele.alt)
if minimised_allele not in pop_frequency_map:
pop_frequency_map[minimised_allele] = {}
pop_frequency_map[minimised_allele][pop_name] = population_frequency_ref
by_population.append([allele_frequency_ref,minimised_allele,pop_name])

else:
for pop_freq in pop_freqs:
if pop_name == pop_freq["population_name"] and pop_freq["allele_frequency"]:
minimised_allele = allele.minimise_allele(allele.alt)
by_population.append([float(pop_freq["allele_frequency"]),minimised_allele, pop_name])



by_population_sorted = sorted(by_population, key=lambda item: item[0])
if len(by_population_sorted) >= 2:
maf_frequency, maf_allele, maf_population = by_population_sorted[-2]
pop_frequency_map[maf_allele][maf_population]["is_minor_allele"] = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have multiple population and corresponding MAF. We should somehow have a way to tell which MAF we want to represent in the GB drawer. Before, we only have one population and it was not a issue.

It does not effect the current EV design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may need further discussion for how we would like the API to communicate this, for example, through a field in variantor through the sorting order in population_allele_frequencies

hpmaf.append([maf_frequency,maf_allele,maf_population])
if len(hpmaf) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be multiple hpmaf allele. MAF with same highest frequency level in separate populations.

Not much effect in EV currently as not part of the view.

Copy link
Contributor Author

@likhitha-surapaneni likhitha-surapaneni Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nakib103 , this seems valid. Similarly, there can be multiple alleles having MAF, we need to mark them. Example to test multiple alleles having same MAF: 13:57932480:rs11276267

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Likhitha, also tested for 17:63992940:rs1183731126 hpmaf.

hpmaf_sorted = sorted(hpmaf, key=lambda item: item[0])
_, hpmaf_allele, hpmaf_population = hpmaf_sorted[-1]
pop_frequency_map[hpmaf_allele][hpmaf_population]["is_hpmaf"] = True
return pop_frequency_map


if maf_frequency>=0:
allele_list[maf_index]["population_frequencies"][0]["is_minor_allele"] = True
allele_list[maf_index]["population_frequencies"][0]["is_hpmaf"] = True

76 changes: 48 additions & 28 deletions common/file_model/variant_allele.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ def __init__(self, allele_index: str, alt: str, variant:dict) -> None:
self.type = "VariantAllele"
self.allele_sequence = alt
self.reference_sequence = variant.ref
self.population_map = []
self.info_map = self.traverse_csq_info()

def get_allele_type(self):
Expand All @@ -41,7 +42,12 @@ def get_prediction_results(self):
min_alt = self.minimise_allele(self.alt)
return self.info_map[min_alt]["prediction_results"] if min_alt in self.info_map else []


def get_population_allele_frequencies(self):
min_alt = self.minimise_allele(self.alt)
population_map = self.variant.set_frequency_flags()
return population_map[min_alt].values() if min_alt in population_map else []


def traverse_csq_info(self) -> Mapping:
"""
This function is to traverse the CSQ record and extract columns
Expand Down Expand Up @@ -70,7 +76,47 @@ def traverse_csq_info(self) -> Mapping:
current_prediction_results = info_map[csq_record_list[prediction_index_map["allele"]]]["prediction_results"]
info_map[csq_record_list[prediction_index_map["allele"]]]["prediction_results"] += self.create_allele_prediction_results(current_prediction_results, csq_record_list, prediction_index_map)
return info_map

# def traverse_population_info(self) -> Mapping:
# directory = os.path.dirname(__file__)
# with open(os.path.join(directory,'populations.json')) as pop_file:
# pop_mapping = json.load(pop_file)
# population_frequency_map = {}
# for csq_record in self.variant.info["CSQ"]:
# csq_record_list = csq_record.split("|")
# allele_index = self.variant.get_info_key_index("Allele")
# if csq_record_list[allele_index] is not None:
# if csq_record_list[allele_index] not in population_frequency_map.keys():
# population_frequency_map[csq_record_list[allele_index]] = {}
# for pop_key, pop in pop_mapping.items():
# for sub_pop in pop:
# if sub_pop["name"] in population_frequency_map[csq_record_list[allele_index]]:
# continue
# allele_count = allele_number = allele_frequency = None
# for freq_key, freq_val in sub_pop["frequencies"].items():
# col_index = self.variant.get_info_key_index(freq_val)
# if col_index is not None and csq_record_list[col_index] is not None:
# if freq_key == "af":
# allele_frequency = csq_record_list[col_index]
# elif freq_key == "an":
# allele_number = csq_record_list[col_index] or None
# elif freq_key == "ac":
# allele_count = csq_record_list[col_index] or None
# else:
# raise Exception('Frequency metric is not recognised')
# population_frequency = {
# "population": sub_pop["name"],
# "allele_frequency": allele_frequency or -1,
# "allele_count": allele_count,
# "allele_number": allele_number,
# "is_minor_allele": False,
# "is_hpmaf": False
# }

# population_frequency_map[csq_record_list[allele_index]][sub_pop["name"]] = population_frequency
# return population_frequency_map


def create_allele_prediction_results(self, current_prediction_results: Mapping, csq_record: List, prediction_index_map: dict) -> list:
prediction_results = []
if "cadd_phred" in prediction_index_map.keys():
Expand Down Expand Up @@ -219,30 +265,4 @@ def minimise_allele(self, alt: str):
elif len(alt) < len(self.variant.ref):
minimised_allele = "-"
return minimised_allele

def format_frequency(self, raw_frequency_list: List) -> Mapping:
freq_map = {}
for freq in raw_frequency_list:
key = freq.split(":")[0]
freq_list = freq.split(":")[1].split(",")
freq_map[key] = freq_list
return freq_map

def get_population_allele_frequencies(self) -> List:
frequency_map = {}
if "FREQ" in self.variant.info:
frequency_map = self.format_frequency(",".join(map(str,self.variant.info["FREQ"])).split("|"))
population_allele_frequencies = []
for key, pop_list in frequency_map.items():
## Adding only GnomAD population
if key == "GnomAD":
if pop_list[self.allele_index] not in ["None", "."]:
population_allele_frequencies.append({
"population": key,
"allele_frequency": pop_list[self.allele_index],
"is_minor_allele": False,
"is_hpmaf": False
})
return population_allele_frequencies



14 changes: 14 additions & 0 deletions common/schemas/population.graphql
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
type Population{
"""
Population
"""
name: String ## Requires to be nullable as super-population can be a null field
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the comment refer to the correct field? It correctly points out that the super_population field is nullable; but this is the name field, which I would have thought can never be null.

Copy link
Contributor Author

@likhitha-surapaneni likhitha-surapaneni Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the way GraphQL handles nullability, a nullable Type (Type population in case of super_population) having a non-null field name results in an error as super_population can be null but super_population->name cannot be null
See article.
The following is the error if we make name non-nullable. This needs to be handled better.

populations(genome_id: "a7335667-93e7-11ec-a39d-005056b38ce3")
  {
    
    name
    description
    is_global
    super_population {
      name # line causing error
    }
    sub_populations
    {
      name
    }
    
  }
}

Error: "message": "Cannot return null for non-nullable field Population.name.",

Copy link
Collaborator

@azangru azangru Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type population in case of super_population) having a non-null field name results in an error as super_population can be null but super_population->name cannot be null
...
Error: "message": "Cannot return null for non-nullable field Population.name.",

The error looks to me like the resolver is trying to return some empty super-population object rather than a true null. GraphQL type validator seems to be saying here that it received an object, rather than a null; and when it looked into its name field, there was nothing there. Is there any way you could inspect what it is you are actually returning from the super_population resolver when there is no super-population?

Consider this api: https://swapi-graphql.eskerda.vercel.app/

Person is a nullable field. But it has a non-nullable field that is ID. If you request a person with an existing id, you get the data

image

if you request a person with a non-existing id, there is an error about the missing person; but the person in the data is just set to null, and there are no errors about the non-nullable field Person.id

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @azangru , you are right. This seems be to an issue with the metadata file which should be fixed in the latest commits

size: Int!
description: String!
type: String
is_global: Boolean
is_from_genotypes: Boolean
display_group_name: String
super_population: Population
sub_populations: [Population]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like all fields in the schema should be non-nullable (unless display_group_name is allowed to be null? I do not know what this is).

Copy link
Contributor Author

@likhitha-surapaneni likhitha-surapaneni Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the file according to VDM, all the fields except super_population are non-nullable. display_group_name seems to be a string for displaying the population group on the website. Currently it seems to be redundant to name. I can check with the team once

}
12 changes: 7 additions & 5 deletions common/schemas/population_allele_frequency.graphql
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@ type PopulationAlleleFrequency{
"""
Population Allele Frequency
"""
population: String
allele_frequency: Float
is_minor_allele: Boolean
is_hpmaf: Boolean
}
population_name: String!
allele_frequency: Float!
allele_count: Int
allele_number: Int
is_minor_allele: Boolean!
is_hpmaf: Boolean!
}
1 change: 1 addition & 0 deletions common/schemas/query.graphql
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
type Query {
version: Version
variant(by_id: IdInput): Variant
population(genome_id: String!): [Population]
azangru marked this conversation as resolved.
Show resolved Hide resolved
}

input IdInput {
Expand Down
Binary file removed data/test_dbsnp.vcf.gz
Binary file not shown.
Binary file removed data/test_dbsnp.vcf.gz.tbi
Binary file not shown.
18 changes: 18 additions & 0 deletions examples/population/query.graphql
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
query pop_example {
population(genome_id: "a7335667-93e7-11ec-a39d-005056b38ce3")
{

name
description
is_global
super_population {
name
}
sub_populations
{
name
}

}
}

Loading