Decouple data curation and upload #162

joverlee521 · 2024-09-11T22:01:35Z

Additional context in Slack

A part of the data curation occurs during vdb/upload and tdb/upload making it difficult to debug data curation issues and hard to share data curation steps with external groups.

Potential solutions

Detangle data curation and data upload within fauna.
Start brand new ingest workflows for curation where the results are then optionally uploaded to fauna.

joverlee521 · 2024-12-19T21:07:25Z

(2) also separately brought up by @jameshadfield and +1 by @trvrb in Slack, so that's the direction we should take here!

jameshadfield · 2024-12-19T21:11:18Z

Having worked in fauna for the first time in a few years, this decoupling would be much welcome. For the work I was doing in avian flu (no titers!) I'd propose (3):

Use fauna to mirror GISAID (indexing on isolate_id and accession), i.e. fauna contains no curation at all. We then have ingest pipelines which start by downloading from fauna, curate the data, and then either use it directly or upload to S3.

j23414 · 2024-12-19T22:59:33Z

About time! 🥳

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple data curation and upload #162

Decouple data curation and upload #162

joverlee521 commented Sep 11, 2024

joverlee521 commented Dec 19, 2024

jameshadfield commented Dec 19, 2024

j23414 commented Dec 19, 2024

Decouple data curation and upload #162

Decouple data curation and upload #162

Comments

joverlee521 commented Sep 11, 2024

Potential solutions

joverlee521 commented Dec 19, 2024

jameshadfield commented Dec 19, 2024

j23414 commented Dec 19, 2024