seasonal flu "virus" records should be indexed by isolate id instead of strain name #165

huddlej · 2024-10-18T18:53:23Z

Description

We currently index the virus table in RethinkDB on the strain name of each isolate. However, this indexing causes at least two problems:

sequence records can get linked to the wrong isolate id when the passage type for the same strain name differs
duplicate sequences can appear in the virus table when strain names get renamed in GISAID and later reingested with the same isolate id

As an example of the first issue, two different isolate ids exist for the strain name A/AbuDhabi/240/2018 including a cell-passaged isolate and an egg-passaged isolate. Because we index virus records on strain name, we have only one record with the name A/AbuDhabi/240/2018 and one isolate id EPI_ISL_312868 which is the cell-passaged isolate id. When we include the egg-passaged sequences for this strain in our builds, we report the incorrect isolate id.

As an example of the second issue, at one point the isolate id EPI_ISL_18430014 had a strain name of A/Moscow/MH144681S/2023 which was later renamed to A/Moscow/RII-MH144681S/2023. The isolate id and gene sequence id remain the same, but because we index on strain name, these appeared to be distinct records.

Proposed solution

GISAID distinguishes viruses by their isolate ids and not by their strain names, allowing multiple versions of the same strain to be included in the database. I propose that we follow this data model in the RethinkDB table, too, by changing the viruses index key from strain to isolate_id.

I realize this is a potentially breaking change, but I think we could make it with the following general steps (specifics may vary and be much messier):

Export the entire flu_viruses and flu_sequences tables to disk
Copy the existing tables to backup copies in the database
Delete all records in the original tables
Change the index key in the original flu_viruses table
Import all records from disk into the updated table
Test resolution of duplicates with a download of sequences
Update duplicate resolution logic to account select for latest isolate id by passage type?

Additional context

Design discussion on Slack about how to index records

The text was updated successfully, but these errors were encountered:

jameshadfield · 2024-12-18T01:10:32Z

duplicate sequences can appear in the virus table when strain names get renamed in GISAID and later reingested with the same isolate id

This can also happen if we modify the code around how the strain name is modified and then re-process the same data. I recently processed the entirety of H7Nx and HxN6 data and uploaded them to a test table in fauna. I won't upload them to the main table because of the issues outlined here.

From afar, the proposed solution seems good, although I can't speak to the steps we'd need to take to get there.

This was referenced Oct 18, 2024

Remove gisaid_epi_isl from auspice_config JSONs nextstrain/seasonal-flu#188

Merged

Add GISAID isolate ID to acknowledgments download nextstrain/seasonal-flu#178

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seasonal flu "virus" records should be indexed by isolate id instead of strain name #165

seasonal flu "virus" records should be indexed by isolate id instead of strain name #165

huddlej commented Oct 18, 2024 •

edited

Loading

jameshadfield commented Dec 18, 2024

seasonal flu "virus" records should be indexed by isolate id instead of strain name #165

seasonal flu "virus" records should be indexed by isolate id instead of strain name #165

Comments

huddlej commented Oct 18, 2024 • edited Loading

Description

Proposed solution

Additional context

jameshadfield commented Dec 18, 2024

huddlej commented Oct 18, 2024 •

edited

Loading