You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sequence records can get linked to the wrong isolate id when the passage type for the same strain name differs
duplicate sequences can appear in the virus table when strain names get renamed in GISAID and later reingested with the same isolate id
As an example of the first issue, two different isolate ids exist for the strain name A/AbuDhabi/240/2018 including a cell-passaged isolate and an egg-passaged isolate. Because we index virus records on strain name, we have only one record with the name A/AbuDhabi/240/2018 and one isolate id EPI_ISL_312868 which is the cell-passaged isolate id. When we include the egg-passaged sequences for this strain in our builds, we report the incorrect isolate id.
As an example of the second issue, at one point the isolate id EPI_ISL_18430014 had a strain name of A/Moscow/MH144681S/2023 which was later renamed to A/Moscow/RII-MH144681S/2023. The isolate id and gene sequence id remain the same, but because we index on strain name, these appeared to be distinct records.
Proposed solution
GISAID distinguishes viruses by their isolate ids and not by their strain names, allowing multiple versions of the same strain to be included in the database. I propose that we follow this data model in the RethinkDB table, too, by changing the viruses index key from strain to isolate_id.
I realize this is a potentially breaking change, but I think we could make it with the following general steps (specifics may vary and be much messier):
Export the entire flu_viruses and flu_sequences tables to disk
Copy the existing tables to backup copies in the database
Delete all records in the original tables
Change the index key in the original flu_viruses table
Import all records from disk into the updated table
Test resolution of duplicates with a download of sequences
Update duplicate resolution logic to account select for latest isolate id by passage type?
duplicate sequences can appear in the virus table when strain names get renamed in GISAID and later reingested with the same isolate id
This can also happen if we modify the code around how the strain name is modified and then re-process the same data. I recently processed the entirety of H7Nx and HxN6 data and uploaded them to a test table in fauna. I won't upload them to the main table because of the issues outlined here.
From afar, the proposed solution seems good, although I can't speak to the steps we'd need to take to get there.
Description
We currently index the
virus
table in RethinkDB on the strain name of each isolate. However, this indexing causes at least two problems:virus
table when strain names get renamed in GISAID and later reingested with the same isolate idAs an example of the first issue, two different isolate ids exist for the strain name
A/AbuDhabi/240/2018
including a cell-passaged isolate and an egg-passaged isolate. Because we indexvirus
records on strain name, we have only one record with the nameA/AbuDhabi/240/2018
and one isolate idEPI_ISL_312868
which is the cell-passaged isolate id. When we include the egg-passaged sequences for this strain in our builds, we report the incorrect isolate id.As an example of the second issue, at one point the isolate id EPI_ISL_18430014 had a strain name of
A/Moscow/MH144681S/2023
which was later renamed toA/Moscow/RII-MH144681S/2023
. The isolate id and gene sequence id remain the same, but because we index on strain name, these appeared to be distinct records.Proposed solution
GISAID distinguishes viruses by their isolate ids and not by their strain names, allowing multiple versions of the same strain to be included in the database. I propose that we follow this data model in the RethinkDB table, too, by changing the
viruses
index key fromstrain
toisolate_id
.I realize this is a potentially breaking change, but I think we could make it with the following general steps (specifics may vary and be much messier):
flu_viruses
andflu_sequences
tables to diskflu_viruses
tableAdditional context
The text was updated successfully, but these errors were encountered: