Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seasonal flu "virus" records should be indexed by isolate id instead of strain name #165

Open
huddlej opened this issue Oct 18, 2024 · 1 comment

Comments

@huddlej
Copy link
Contributor

huddlej commented Oct 18, 2024

Description

We currently index the virus table in RethinkDB on the strain name of each isolate. However, this indexing causes at least two problems:

  1. sequence records can get linked to the wrong isolate id when the passage type for the same strain name differs
  2. duplicate sequences can appear in the virus table when strain names get renamed in GISAID and later reingested with the same isolate id

As an example of the first issue, two different isolate ids exist for the strain name A/AbuDhabi/240/2018 including a cell-passaged isolate and an egg-passaged isolate. Because we index virus records on strain name, we have only one record with the name A/AbuDhabi/240/2018 and one isolate id EPI_ISL_312868 which is the cell-passaged isolate id. When we include the egg-passaged sequences for this strain in our builds, we report the incorrect isolate id.

As an example of the second issue, at one point the isolate id EPI_ISL_18430014 had a strain name of A/Moscow/MH144681S/2023 which was later renamed to A/Moscow/RII-MH144681S/2023. The isolate id and gene sequence id remain the same, but because we index on strain name, these appeared to be distinct records.

Proposed solution

GISAID distinguishes viruses by their isolate ids and not by their strain names, allowing multiple versions of the same strain to be included in the database. I propose that we follow this data model in the RethinkDB table, too, by changing the viruses index key from strain to isolate_id.

I realize this is a potentially breaking change, but I think we could make it with the following general steps (specifics may vary and be much messier):

  1. Export the entire flu_viruses and flu_sequences tables to disk
  2. Copy the existing tables to backup copies in the database
  3. Delete all records in the original tables
  4. Change the index key in the original flu_viruses table
  5. Import all records from disk into the updated table
  6. Test resolution of duplicates with a download of sequences
  7. Update duplicate resolution logic to account select for latest isolate id by passage type?

Additional context

@jameshadfield
Copy link
Member

duplicate sequences can appear in the virus table when strain names get renamed in GISAID and later reingested with the same isolate id

This can also happen if we modify the code around how the strain name is modified and then re-process the same data. I recently processed the entirety of H7Nx and HxN6 data and uploaded them to a test table in fauna. I won't upload them to the main table because of the issues outlined here.

From afar, the proposed solution seems good, although I can't speak to the steps we'd need to take to get there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants