Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple data curation and upload #162

Open
joverlee521 opened this issue Sep 11, 2024 · 3 comments
Open

Decouple data curation and upload #162

joverlee521 opened this issue Sep 11, 2024 · 3 comments

Comments

@joverlee521
Copy link
Contributor

Additional context in Slack

A part of the data curation occurs during vdb/upload and tdb/upload making it difficult to debug data curation issues and hard to share data curation steps with external groups.

Potential solutions

  1. Detangle data curation and data upload within fauna.
  2. Start brand new ingest workflows for curation where the results are then optionally uploaded to fauna.
@joverlee521
Copy link
Contributor Author

(2) also separately brought up by @jameshadfield and +1 by @trvrb in Slack, so that's the direction we should take here!

@jameshadfield
Copy link
Member

Having worked in fauna for the first time in a few years, this decoupling would be much welcome. For the work I was doing in avian flu (no titers!) I'd propose (3):

  1. Use fauna to mirror GISAID (indexing on isolate_id and accession), i.e. fauna contains no curation at all. We then have ingest pipelines which start by downloading from fauna, curate the data, and then either use it directly or upload to S3.

@j23414
Copy link
Contributor

j23414 commented Dec 19, 2024

About time! 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants