Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting population allele frequencies #29

Merged
merged 31 commits into from
Jan 31, 2024
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
fd86f09
Parsing populations from VCF
likhitha-surapaneni Dec 15, 2023
ae26a08
Logic to compute ref allele frequency, maf, hpmaf
likhitha-surapaneni Jan 17, 2024
6a384ac
Added endpoint to return population object
likhitha-surapaneni Jan 23, 2024
eae1b9a
Adding Population schema
likhitha-surapaneni Jan 23, 2024
0247d13
Fixed a bug in population frequencies
likhitha-surapaneni Jan 24, 2024
6865f77
Making allele_frequency nullable
likhitha-surapaneni Jan 24, 2024
22308a1
Reverting nullable allele_frequency for now
likhitha-surapaneni Jan 24, 2024
91699ce
Reverting nullable allele_frequency for now
likhitha-surapaneni Jan 24, 2024
36e7723
Added population metadata for gnomadg and gnomade
likhitha-surapaneni Jan 24, 2024
940ccf8
Added population metadata for gnomadg and gnomade
likhitha-surapaneni Jan 24, 2024
f6a2d27
Added examples for population and population_allele_frequencies
likhitha-surapaneni Jan 24, 2024
31ca0f1
Renamed file
likhitha-surapaneni Jan 24, 2024
4eb9e14
Minor typo
likhitha-surapaneni Jan 24, 2024
e6a5f3e
Added examples for gnomADg, gnomADe
likhitha-surapaneni Jan 24, 2024
bc8ba60
Cleanup of files
likhitha-surapaneni Jan 24, 2024
1870e41
Added version metadata for gnomADg and gnomADe
likhitha-surapaneni Jan 24, 2024
3a78f32
Allele_frequency as nullable
likhitha-surapaneni Jan 24, 2024
f661d72
Not returning population_allele_frequencies with allele_frequency null
likhitha-surapaneni Jan 25, 2024
5575ce0
Removed updates to example payload
likhitha-surapaneni Jan 25, 2024
4dfa7d3
Removed updates to example payload
likhitha-surapaneni Jan 25, 2024
bb6ae82
Modified endpoint from population to populations; Updated examples
likhitha-surapaneni Jan 25, 2024
d5a100c
Merge branch 'main' into population_scores
likhitha-surapaneni Jan 25, 2024
8a7557d
Fixing is_global values
likhitha-surapaneni Jan 29, 2024
42b7667
Merge branch 'population_scores' of https://github.com/likhitha-surap…
likhitha-surapaneni Jan 29, 2024
193ffe1
Fixing the nullability logic in super_population
likhitha-surapaneni Jan 29, 2024
a032adf
Fixing the nullability logic in super_population
likhitha-surapaneni Jan 29, 2024
518b5c8
Fixed typo in population metadata file and removed the redundant work…
likhitha-surapaneni Jan 29, 2024
f38e487
Fixed typo in population metadata file and removed the redundant work…
likhitha-surapaneni Jan 29, 2024
b74af67
Updated population.graphql to conform with VDM
likhitha-surapaneni Jan 29, 2024
f3bfa14
Handling multiple maf and hpmaf alleles
likhitha-surapaneni Jan 30, 2024
4052b2c
Added display_group_name
likhitha-surapaneni Jan 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
801 changes: 801 additions & 0 deletions common/file_model/population_metadata.json

Large diffs are not rendered by default.

34 changes: 34 additions & 0 deletions common/file_model/populations.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"1000_genomes": [
{"name": "1000GENOMES:phase_3:ALL", "frequencies": {"af": "AF"}},
{"name": "1000GENOMES:phase_3:AFR", "frequencies": {"af": "AFR_AF"}},
{"name": "1000GENOMES:phase_3:AMR", "frequencies": {"af": "AMR_AF"}},
{"name": "1000GENOMES:phase_3:EAS", "frequencies": {"af": "EAS_AF"}},
{"name": "1000GENOMES:phase_3:EUR", "frequencies": {"af": "EUR_AF"}},
{"name": "1000GENOMES:phase_3:SAS", "frequencies": {"af": "SAS_AF" }}
],
"gnomAD_exomes": [
{"name": "gnomADe:ALL", "frequencies": {"af": "gnomAD_exomes_AF", "ac": "gnomAD_exomes_AC", "an": "gnomAD_exomes_AN"}},
{"name": "gnomADe:afr", "frequencies": {"af": "gnomAD_exomes_AF_afr", "ac": "gnomAD_exomes_AC_afr", "an": "gnomAD_exomes_AN_afr"}},
{"name": "gnomADe:amr", "frequencies": {"af": "gnomAD_exomes_AF_amr", "ac": "gnomAD_exomes_AC_amr", "an": "gnomAD_exomes_AN_amr"}},
{"name": "gnomADe:asj", "frequencies": {"af": "gnomAD_exomes_AF_asj", "ac": "gnomAD_exomes_AC_asj", "an": "gnomAD_exomes_AN_asj"}},
{"name": "gnomADe:eas", "frequencies": {"af": "gnomAD_exomes_AF_eas", "ac": "gnomAD_exomes_AC_eas", "an": "gnomAD_exomes_AN_eas"}},
{"name": "gnomADe:fin", "frequencies": {"af": "gnomAD_exomes_AF_fin", "ac": "gnomAD_exomes_AC_fin", "an": "gnomAD_exomes_AN_fin"}},
{"name": "gnomADe:nfe", "frequencies": {"af": "gnomAD_exomes_AF_nfe", "ac": "gnomAD_exomes_AC_nfe", "an": "gnomAD_exomes_AN_nfe"}},
{"name": "gnomADe:oth", "frequencies": {"af": "gnomAD_exomes_AF_oth", "ac": "gnomAD_exomes_AC_oth", "an": "gnomAD_exomes_AN_oth"}},
{"name": "gnomADe:sas", "frequencies": {"af": "gnomAD_exomes_AF_sas", "ac": "gnomAD_exomes_AC_sas", "an": "gnomAD_exomes_AN_sas"}}
],
"gnomAD_genomes": [
{"name": "gnomADg:ALL", "frequencies": {"af": "gnomAD_genomes_AF", "ac": "gnomAD_genomes_AC", "an": "gnomAD_genomes_AN"}},
{"name": "gnomADg:afr", "frequencies": {"af": "gnomAD_genomes_AF_afr", "ac": "gnomAD_genomes_AC_afr", "an": "gnomAD_genomes_AN_afr"}},
{"name": "gnomADg:ami", "frequencies": {"af": "gnomAD_genomes_AF_ami", "ac": "gnomAD_genomes_AC_ami", "an": "gnomAD_genomes_AN_ami"}},
{"name": "gnomADg:amr", "frequencies": {"af": "gnomAD_genomes_AF_amr", "ac": "gnomAD_genomes_AC_amr", "an": "gnomAD_genomes_AN_amr"}},
{"name": "gnomADg:asj", "frequencies": {"af": "gnomAD_genomes_AF_asj", "ac": "gnomAD_genomes_AC_asj", "an": "gnomAD_genomes_AN_asj"}},
{"name": "gnomADg:eas", "frequencies": {"af": "gnomAD_genomes_AF_eas", "ac": "gnomAD_genomes_AC_eas", "an": "gnomAD_genomes_AN_eas"}},
{"name": "gnomADg:fin", "frequencies": {"af": "gnomAD_genomes_AF_fin", "ac": "gnomAD_genomes_AC_fin", "an": "gnomAD_genomes_AN_fin"}},
{"name": "gnomADg:mid", "frequencies": {"af": "gnomAD_genomes_AF_mid", "ac": "gnomAD_genomes_AC_mid", "an": "gnomAD_genomes_AN_mid"}},
{"name": "gnomADg:nfe", "frequencies": {"af": "gnomAD_genomes_AF_nfe", "ac": "gnomAD_genomes_AC_nfe", "an": "gnomAD_genomes_AN_nfe"}},
{"name": "gnomADg:oth", "frequencies": {"af": "gnomAD_genomes_AF_oth", "ac": "gnomAD_genomes_AC_oth", "an": "gnomAD_genomes_AN_oth"}},
{"name": "gnomADg:sas", "frequencies": {"af": "gnomAD_genomes_AF_sas", "ac": "gnomAD_genomes_AC_sas", "an": "gnomAD_genomes_AN_sas"}}
]
}
122 changes: 95 additions & 27 deletions common/file_model/variant.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ def __init__(self, record: Any, header: Any) -> None:
self.info = record.INFO
self.type = "Variant"
self.vep_version = re.search("v\d+", self.header.get_lines("VEP")[0].value).group()
self.population_map = {}

def get_alternative_names(self) -> List:
return []
Expand Down Expand Up @@ -196,7 +197,7 @@ def get_most_severe_consequence(self) -> Mapping:
csq_record_list = csq_record.split("|")
for cons in csq_record_list[consequence_index].split("&"):
rank = consequence_rank[cons]
consequence_map[rank] = cons
consequence_map[int(rank)] = cons
return{
"result": consequence_map[min(consequence_map.keys())] ,
"analysis_method": {
Expand Down Expand Up @@ -240,34 +241,101 @@ def get_info_key_index(self, key: str, info_id: str ="CSQ") -> int:
for index, value in enumerate(csq_list):
if value == key:
return index

def traverse_population_info(self) -> Mapping:
directory = os.path.dirname(__file__)
with open(os.path.join(directory,'populations.json')) as pop_file:
pop_mapping = json.load(pop_file)
population_frequency_map = {}
for csq_record in self.info["CSQ"]:
csq_record_list = csq_record.split("|")
allele_index = self.get_info_key_index("Allele")
if csq_record_list[allele_index] is not None:
if csq_record_list[allele_index] not in population_frequency_map.keys():
population_frequency_map[csq_record_list[allele_index]] = {}
for pop_key, pop in pop_mapping.items():
for sub_pop in pop:
if sub_pop["name"] in population_frequency_map[csq_record_list[allele_index]]:
continue
allele_count = allele_number = allele_frequency = None
for freq_key, freq_val in sub_pop["frequencies"].items():
col_index = self.get_info_key_index(freq_val)
if col_index and csq_record_list[col_index] is not None:
if freq_key == "af":
allele_frequency = csq_record_list[col_index] or None
elif freq_key == "an":
allele_number = csq_record_list[col_index] or None
elif freq_key == "ac":
allele_count = csq_record_list[col_index] or None
else:
raise Exception('Frequency metric is not recognised')
population_frequency = {
"population_name": sub_pop["name"],
"allele_frequency": allele_frequency,
"allele_count": allele_count,
"allele_number": allele_number,
"is_minor_allele": False,
"is_hpmaf": False
}
if allele_frequency:
population_frequency_map[csq_record_list[allele_index]][sub_pop["name"]] = population_frequency
return population_frequency_map

def set_frequency_flags(self, allele_list: List):
def set_frequency_flags(self):
"""
Calculates minor allele frequency by iterating through each allele
Assumption: Considers that population is only gnomAD (genomes) for now
Sets the maf as hpmaf as gnomAD is the only population at the moment
Calculates MAF (minor allele frequency) and HPMAF by iterating through each allele
"""
maf_frequency = -1
maf_index = -1
highest_frequency = -1
highest_frequency_index = -1
highest_maf_frequency = -1
highest_maf_frequency_index = -1
maf_map = {}
for allele_index, allele in enumerate(allele_list):
if(len(allele["population_frequencies"]) > 0):
pop = allele["population_frequencies"][0]
pop_allele_frequency = float(pop["allele_frequency"])
if ( pop_allele_frequency > maf_frequency and pop_allele_frequency < highest_frequency ):
maf_frequency = pop_allele_frequency
maf_index = allele_index
elif ( pop_allele_frequency > highest_frequency ):
maf_frequency = highest_frequency
maf_index = highest_frequency_index
highest_frequency = pop_allele_frequency
highest_frequency_index = allele_index
allele_list = self.get_alleles()
directory = os.path.dirname(__file__)
with open(os.path.join(directory,'populations.json')) as pop_file:
pop_mapping = json.load(pop_file)
pop_names = []
for pop in pop_mapping.values():
pop_names.extend([sub_pop["name"] for sub_pop in pop])
hpmaf = []
pop_frequency_map = self.traverse_population_info()

for pop_name in pop_names:
by_population = []
for allele in allele_list:
if allele.minimise_allele(allele.alt) in pop_frequency_map:
pop_freqs = pop_frequency_map[allele.minimise_allele(allele.alt)].values()
# Calculate reference allele from parsed frequency info of all alleles
if allele.get_allele_type()["accession_id"] == "biological_region" and len(by_population)>0 :
allele_frequency_ref = 1 - sum(list(zip(*by_population))[0])
if allele_frequency_ref <= 1 and allele_frequency_ref >= 0:
population_frequency_ref = {
"population_name": pop_name,
"allele_frequency": allele_frequency_ref ,
"allele_count": None,
"allele_number": None,
"is_minor_allele": False,
"is_hpmaf": False
}
minimised_allele = allele.minimise_allele(allele.alt)
if minimised_allele not in pop_frequency_map:
pop_frequency_map[minimised_allele] = {}
pop_frequency_map[minimised_allele][pop_name] = population_frequency_ref
by_population.append([allele_frequency_ref,minimised_allele,pop_name])

else:
for pop_freq in pop_freqs:
if pop_name == pop_freq["population_name"] and pop_freq["allele_frequency"]:
minimised_allele = allele.minimise_allele(allele.alt)
by_population.append([float(pop_freq["allele_frequency"]),minimised_allele, pop_name])



by_population_sorted = sorted(by_population, key=lambda item: item[0])
if len(by_population_sorted) >= 2:
maf_frequency, maf_allele, maf_population = by_population_sorted[-2]
pop_frequency_map[maf_allele][maf_population]["is_minor_allele"] = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have multiple population and corresponding MAF. We should somehow have a way to tell which MAF we want to represent in the GB drawer. Before, we only have one population and it was not a issue.

It does not effect the current EV design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may need further discussion for how we would like the API to communicate this, for example, through a field in variantor through the sorting order in population_allele_frequencies

hpmaf.append([maf_frequency,maf_allele,maf_population])
if len(hpmaf) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be multiple hpmaf allele. MAF with same highest frequency level in separate populations.

Not much effect in EV currently as not part of the view.

Copy link
Contributor Author

@likhitha-surapaneni likhitha-surapaneni Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nakib103 , this seems valid. Similarly, there can be multiple alleles having MAF, we need to mark them. Example to test multiple alleles having same MAF: 13:57932480:rs11276267

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Likhitha, also tested for 17:63992940:rs1183731126 hpmaf.

hpmaf_sorted = sorted(hpmaf, key=lambda item: item[0])
_, hpmaf_allele, hpmaf_population = hpmaf_sorted[-1]
pop_frequency_map[hpmaf_allele][hpmaf_population]["is_hpmaf"] = True
return pop_frequency_map


if maf_frequency>=0:
allele_list[maf_index]["population_frequencies"][0]["is_minor_allele"] = True
allele_list[maf_index]["population_frequencies"][0]["is_hpmaf"] = True

76 changes: 48 additions & 28 deletions common/file_model/variant_allele.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ def __init__(self, allele_index: str, alt: str, variant:dict) -> None:
self.type = "VariantAllele"
self.allele_sequence = alt
self.reference_sequence = variant.ref
self.population_map = []
self.info_map = self.traverse_csq_info()

def get_allele_type(self):
Expand All @@ -41,7 +42,12 @@ def get_prediction_results(self):
min_alt = self.minimise_allele(self.alt)
return self.info_map[min_alt]["prediction_results"] if min_alt in self.info_map else []


def get_population_allele_frequencies(self):
min_alt = self.minimise_allele(self.alt)
population_map = self.variant.set_frequency_flags()
return population_map[min_alt].values() if min_alt in population_map else []


def traverse_csq_info(self) -> Mapping:
"""
This function is to traverse the CSQ record and extract columns
Expand Down Expand Up @@ -70,7 +76,47 @@ def traverse_csq_info(self) -> Mapping:
current_prediction_results = info_map[csq_record_list[prediction_index_map["allele"]]]["prediction_results"]
info_map[csq_record_list[prediction_index_map["allele"]]]["prediction_results"] += self.create_allele_prediction_results(current_prediction_results, csq_record_list, prediction_index_map)
return info_map

# def traverse_population_info(self) -> Mapping:
# directory = os.path.dirname(__file__)
# with open(os.path.join(directory,'populations.json')) as pop_file:
# pop_mapping = json.load(pop_file)
# population_frequency_map = {}
# for csq_record in self.variant.info["CSQ"]:
# csq_record_list = csq_record.split("|")
# allele_index = self.variant.get_info_key_index("Allele")
# if csq_record_list[allele_index] is not None:
# if csq_record_list[allele_index] not in population_frequency_map.keys():
# population_frequency_map[csq_record_list[allele_index]] = {}
# for pop_key, pop in pop_mapping.items():
# for sub_pop in pop:
# if sub_pop["name"] in population_frequency_map[csq_record_list[allele_index]]:
# continue
# allele_count = allele_number = allele_frequency = None
# for freq_key, freq_val in sub_pop["frequencies"].items():
# col_index = self.variant.get_info_key_index(freq_val)
# if col_index is not None and csq_record_list[col_index] is not None:
# if freq_key == "af":
# allele_frequency = csq_record_list[col_index]
# elif freq_key == "an":
# allele_number = csq_record_list[col_index] or None
# elif freq_key == "ac":
# allele_count = csq_record_list[col_index] or None
# else:
# raise Exception('Frequency metric is not recognised')
# population_frequency = {
# "population": sub_pop["name"],
# "allele_frequency": allele_frequency or -1,
# "allele_count": allele_count,
# "allele_number": allele_number,
# "is_minor_allele": False,
# "is_hpmaf": False
# }

# population_frequency_map[csq_record_list[allele_index]][sub_pop["name"]] = population_frequency
# return population_frequency_map


def create_allele_prediction_results(self, current_prediction_results: Mapping, csq_record: List, prediction_index_map: dict) -> list:
prediction_results = []
if "cadd_phred" in prediction_index_map.keys():
Expand Down Expand Up @@ -219,30 +265,4 @@ def minimise_allele(self, alt: str):
elif len(alt) < len(self.variant.ref):
minimised_allele = "-"
return minimised_allele

def format_frequency(self, raw_frequency_list: List) -> Mapping:
freq_map = {}
for freq in raw_frequency_list:
key = freq.split(":")[0]
freq_list = freq.split(":")[1].split(",")
freq_map[key] = freq_list
return freq_map

def get_population_allele_frequencies(self) -> List:
frequency_map = {}
if "FREQ" in self.variant.info:
frequency_map = self.format_frequency(",".join(map(str,self.variant.info["FREQ"])).split("|"))
population_allele_frequencies = []
for key, pop_list in frequency_map.items():
## Adding only GnomAD population
if key == "GnomAD":
if pop_list[self.allele_index] not in ["None", "."]:
population_allele_frequencies.append({
"population": key,
"allele_frequency": pop_list[self.allele_index],
"is_minor_allele": False,
"is_hpmaf": False
})
return population_allele_frequencies



14 changes: 14 additions & 0 deletions common/schemas/population.graphql
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
type Population{
"""
Population
"""
name: String!
size: Int!
description: String!
type: String
is_global: Boolean
is_from_genotypes: Boolean
display_group_name: String
super_population: Population
sub_populations: [Population]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like all fields in the schema should be non-nullable (unless display_group_name is allowed to be null? I do not know what this is).

Copy link
Contributor Author

@likhitha-surapaneni likhitha-surapaneni Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the file according to VDM, all the fields except super_population are non-nullable. display_group_name seems to be a string for displaying the population group on the website. Currently it seems to be redundant to name. I can check with the team once

}
12 changes: 7 additions & 5 deletions common/schemas/population_allele_frequency.graphql
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@ type PopulationAlleleFrequency{
"""
Population Allele Frequency
"""
population: String
allele_frequency: Float
is_minor_allele: Boolean
is_hpmaf: Boolean
}
population_name: String!
allele_frequency: Float!
allele_count: Int
allele_number: Int
is_minor_allele: Boolean!
is_hpmaf: Boolean!
}
1 change: 1 addition & 0 deletions common/schemas/query.graphql
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
type Query {
version: Version
variant(by_id: IdInput): Variant
populations(genome_id: String!): [Population]
}

input IdInput {
Expand Down
Binary file removed data/test_dbsnp.vcf.gz
Binary file not shown.
Binary file removed data/test_dbsnp.vcf.gz.tbi
Binary file not shown.
2 changes: 1 addition & 1 deletion examples/population/query.graphql
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
query pop_example {
population(genome_id: "a7335667-93e7-11ec-a39d-005056b38ce3")
populations(genome_id: "a7335667-93e7-11ec-a39d-005056b38ce3")
{

name
Expand Down
Loading